Two of the biggest labs in the AI coding race shipped new flagship coding models in the back half of April. Anthropic launched Claude Opus 4.7 on April 16. OpenAI followed a week later on April 23 with GPT-5.5 (codename Spud), the model powering Codex 5.5. Both promised faster reasoning, better tool use, and a real jump on production engineering tasks rather than just on leaderboards.

We've been running both since release. Two weeks with Opus 4.7, just over a week with Codex 5.5, building real features on Test-Lab. Same engineer, same repo, same kinds of work. This is what we found.

TL;DR for the impatient

	Quality	Speed	Cost	Where it shines
Claude Opus 4.7	Highest on hard refactors	Fastest in its class	Premium	Multi file refactors, ambiguous specs, taste calls
Codex 5.5	Very strong	Fast	Cheap by comparison	Tight loops, terminal work, well scoped tasks

If you only have time for one takeaway: Opus 4.7 still produces the best output on the messiest problems, and Codex 5.5 is the sane default for everything else.

What actually changed in Opus 4.7

The headline change is speed. On the browser automation benchmark we ran in April, Opus 4.7 finished the same hardened plan faster than every other model in the smart tier, including GPT-5.4. That same speed gain shows up in Claude Code. Long agent loops that used to feel like waiting for a coffee now feel like a normal IDE response.

Three things we noticed beyond the speed:

Better instruction following on long prompts. We feed Claude pretty heavy system prompts, often 3K to 5K tokens of project conventions, file ownership rules, and workflow guidance. Opus 4.6 would occasionally drift on the fifteenth instruction. 4.7 holds the line.

Sharper "stop and ask" behavior. Opus 4.7 is more willing to push back when a request is ambiguous, instead of just guessing. That sounds like a small thing. In practice it saves a full round trip on most non trivial tasks.

Cleaner diffs. When 4.7 edits a file, the change is almost always the smallest one that does the job. 4.6 would sometimes touch surrounding code "while it was there". 4.7 leaves the surrounding code alone unless you explicitly ask it to refactor.

Cost did not move. Opus 4.7 is priced like Opus 4.6, which is to say it is the most expensive frontier model in regular production use. If you are paying per token, you feel it.

What actually changed in Codex 5.5

OpenAI's pitch for Codex 5.5 was a model that punches at the level of the previous generation's flagship, with materially better latency and unchanged pricing. After a week of building with it, that pitch holds up.

The biggest jump for us was in tool use discipline. Codex 5.3 was already strong here. Codex 5.5 is noticeably better at:

Calling the right tool the first time, rather than groping toward it
Cleaning up state at the end of a multi step task
Reading large blocks of unfamiliar code without losing the thread

It is also less likely to generate confidently wrong type signatures, which was one of our few persistent complaints about 5.3.

Where Codex 5.5 still trails Opus 4.7 is the genuinely ambiguous task. "Refactor this file to be more testable" is exactly the kind of request where the right answer depends on a dozen small judgments. Opus 4.7 makes those judgments closer to how a senior engineer would. Codex 5.5 makes reasonable choices but they read as more mechanical.

Real workloads, side by side

We bucketed the last two weeks of work into four categories. Opus 4.7 covered the full window. Codex 5.5 joined the rotation once it shipped on April 23 and got at least a couple of runs at each category before this post went up.

Bug fix in a familiar module. Roughly tied. Both produced clean, minimal patches. Codex 5.5 was usually faster end to end because it generates fewer intermediate explanations. The cost differential made Codex the obvious default here.

Multi file refactor. Opus 4.7 was the better tool. It held the cross file picture in mind, produced consistent renames, and flagged two cases where our existing API contract would have leaked into the new abstraction. Codex 5.5 finished the job, but we caught a stale import in the diff that it should have cleaned up.

Net new feature, well scoped. Codex 5.5 won on throughput. We could grind through a feature, tests, and a small migration in a single session and pay a fraction of what the same session cost on Opus 4.7. Quality was good enough that we did not have to babysit.

Net new feature, vague spec. Opus 4.7 was clearly stronger. When the input was a short paragraph and a screenshot, Opus 4.7 asked the right clarifying questions, made tasteful choices, and produced something we wanted to ship. Codex 5.5 needed more guidance up front to get to the same place.

That split is consistent with what we wrote in our February comparison. The bias has not flipped. The gap has narrowed.

The new design surface in Claude Code

Worth flagging because it slipped out the same week as 4.7: Claude Code now has a first party design preview surface. You can sketch a UI iteration in the chat, get a rendered preview without leaving your terminal session, and iterate on layouts and copy in the same loop where you are also writing the implementation.

It is not a replacement for Figma. What it is, surprisingly, is a faster way to settle internal "what if the panel sat on the right" arguments without doing a real round trip into a design tool. Engineers who do not love living in Figma will use it. Designers who already live in Figma will mostly ignore it, and that is fine.

We have been using it on a few internal admin tools where we cared about getting something usable in front of the team within an afternoon. It pairs well with the lightweight design first workflow we already use for early stage features.

A quick word on GPT image generation

While we are on the subject of new releases, OpenAI's latest image model is genuinely good. We had been defaulting to Google's Gemini image model (the one nicknamed nano banana inside the community) for marketing assets and quick mockups. The new GPT image generation handles instructional edits and text inside images noticeably better, with fewer of the warping artefacts that have been the standard tell on AI generated UI screenshots.

Two practical implications for builders:

Generating consistent product illustrations across a blog series no longer requires a custom LoRA or a stack of inpainting tricks
Mocking up a feature for an internal review is now genuinely faster than asking a designer to throw together placeholders

We are not switching every workflow over. Gemini is still cheaper for high volume batch jobs. But for the slow, careful image, GPT pulled ahead.

The cost story keeps moving

Plan	Reality after a heavy week
Codex 5.5 on the OpenAI $20 plan	Did not hit the limit. Fits a full time engineer.
Codex 5.5 on the OpenAI $100 plan	Comfortable for a small team sharing the seat.
Opus 4.7 on the Anthropic $100 plan	Tight for a single heavy user. Hit limits twice.
Opus 4.7 on the Anthropic $200 plan	Where we sit. Comfortable for daily heavy use.

The pattern holds. If you are budget sensitive and your work is mostly bounded coding tasks, Codex 5.5 is the default. If you are working on hard problems and your time is the most expensive input, Opus 4.7 earns its premium.

What we actually do day to day

For most engineering work on Test-Lab, Opus 4.7 is our default. We run a lot of multi file refactors and ambiguous spec work, and the taste calls plus the willingness to push back when a request is fuzzy are worth the price for the kind of work we do. The speed bump in 4.7 is what tipped this from "use it for the hard things" to "just leave it on".

For tightly scoped tasks (a quick bug fix in a familiar module, a churn heavy migration, a one off script) we sometimes drop down to Codex 5.5. The cost differential makes it an easy choice when the task is well bounded and the throughput matters more than the nuance.

We do not pick one model and stick with it. The gap is narrow enough on simple tasks that price wins, and wide enough on hard tasks that quality wins. Use both.

What this means for the broader market

A few patterns are sharper now than they were six months ago.

The pricing curve is real. Frontier quality keeps getting cheaper. The Codex 5.5 you can run on a $20 plan is meaningfully better than the Opus 4.5 you could run on a $200 plan a year ago.

Speed matters more than peak benchmark. A model that is slightly weaker but twice as fast wins more sessions than a model that is slightly stronger but slow. Engineers run agents in tight loops, and the loops feel different at 30 seconds versus 90.

The full product is still a stack. Both Opus 4.7 and Codex 5.5 are excellent components. The IDE you wrap around them, the tools you give them, and the prompts you ship with them still drive most of the user visible difference. Picking the right model is necessary. It is not sufficient.

If you are evaluating these for your own team, run both for at least a week on real work. The benchmarks are useful. Your own diff history is more useful.

Test-Lab.ai is an AI powered browser testing platform. Our take on coding agents comes from using them every day to build and ship the product. If you want to see how AI agents handle browser testing, start a free trial.

Opus 4.7 vs Codex 5.5: Two Weeks Building With the New Frontier Coding Models