Generating a Playwright script from a passing AI run is the part of E2E test authoring where the underlying model has the most leverage. It shapes how fast a clean script lands, how readable it ends up, and how much each generation costs.

Two of the strongest "deep tier" frontier models are credible candidates for the job today. Anthropic's Claude Opus 4.7 has been our incumbent. OpenAI's GPT-5.5 (codename Spud) shipped a couple of weeks later, with a pitch around stronger tool use and tighter reasoning. We swapped both into the same generation flow, against the same source AI runs, and recorded everything that came out.

This post is the writeup.

TL;DR

	Reliability	Avg duration	Avg LLM cost	Cost variance
Claude Opus 4.7	All passes	67s (3 steps) / 94s (5 steps)	$0.19 / $0.33	±2¢, very tight
GPT-5.5	All passes	262s (3 steps) / 418s (5 steps)	$0.45 / $0.99	±15¢, wide

On these workloads, GPT-5.5 was roughly four times slower and two and a half to three times more expensive on raw LLM spend, with comparable headline reliability inside a controlled bench. The interesting story is in where the cost gap comes from, and how predictable each model is.

The setup

A focused comparison rather than the eleven-model sweep we published last month. Two source AI runs from a real customer test plan against a real production CRM:

Create contact (3 steps)
Add property (5 steps)

Same source runs, same target. Both routed through OpenRouter so per-token pricing is directly comparable in one dashboard, and both ran with reasoning effort set to "high".

We ran each source repeatedly through both models and aggregated. The Opus pass landed twenty clean generations. The GPT-5.5 pass we cut short at fourteen because the cost and duration profile was already obvious and we wanted the credit balance back. This is a focused snapshot, not a full harness sweep.

Headline numbers

Per-source aggregates, averaged across all completed generations, raw LLM cost reported in cents from the OpenRouter response (so prompt caching and any provider-side discount is already baked in):

Source	Model	n	Avg duration	Min/Max duration	Avg LLM cost	Min/Max cost
Create contact (3 steps)	Opus 4.7	10	67s	59 / 79	$0.189	$0.18 / $0.20
Create contact (3 steps)	GPT-5.5	10	262s	213 / 347	$0.450	$0.31 / $0.62
Add property (5 steps)	Opus 4.7	10	94s	82 / 107	$0.333	$0.32 / $0.34
Add property (5 steps)	GPT-5.5	4	418s	337 / 496	$0.988	$0.78 / $1.07

Rolling all sources together: Opus finished 20 clean generations for $5.22 total. GPT-5.5 finished 14 clean generations for $8.45 total. Same pipeline, same prompts, same browsers, same machine.

Where the cost gap comes from

The dominant factor is reasoning chain length. With reasoning effort pinned to "high", GPT-5.5 visibly spends longer thinking on every model call. Opus 4.7 closes the same prompt in seconds of model time. GPT-5.5 sits on it for a minute or two. Reasoning chains are billed, so the per-call invoice scales with the wall. The same 5-step plan that Opus 4.7 produces a script for in around 90 seconds and 33 cents takes GPT-5.5 around seven minutes and 98 cents.

The script content itself is also subtly different in ways that matter for what you actually get for the money. We pulled the generated scripts back out and read through them by hand. The two models produce visibly different outputs from the same input, in two dimensions:

Opus 4.7 writes terser scripts. GPT-5.5 writes more thorough ones.

On the simpler 3-step prompt (Create contact), Opus shipped scripts with zero post-action assertions. Every step ends with a click and a screenshot. The test name claims it "verified the details on the record page", but the body just clicks Save. That script "passes" as long as the page does not crash, regardless of whether the contact actually got created. Across multiple runs of the same prompt, this pattern was completely consistent.

GPT-5.5 on the same prompt wrote assertions that mapped to the prompt's acceptance criteria: URL contains /contacts, name visible, email visible, phone visible. Same Save click, but the test actually checks the result.

On the 5-step prompt the gap was narrower. Both models wrote assertions, but GPT-5.5 still leaned more defensive. It added explicit visibility waits before screenshots, used level: 1 / level: 2 on heading roles to disambiguate, scoped iframe locators with negated title patterns, and added "modal hidden after save" checks. Opus's output was tighter and read better, but skipped some of these guards.

Opus 4.7 is dramatically more consistent run to run. Two Opus runs of the same prompt were byte-identical. Two GPT-5.5 runs of the same prompt diverged on selector strategy, assertion shape, and which dimensions to verify. So if you regenerate a script later as the UI shifts, Opus output is closer to a deterministic transformation; GPT-5.5 output mutates more.

So the cost-gap story is partly about reasoning tokens, and partly about what each model spends them on. Some of the GPT-5.5 spend buys real coverage (more assertions, more waits). Some of it buys variance.

Variance is the second story

The first table tells you Opus is faster and cheaper on average. The variance numbers tell you Opus is also a lot more predictable.

On the Create contact prompt across ten runs each:

Opus 4.7 cost ranged from $0.18 to $0.20. A two cent spread across ten generations. Same plan, same browser, same model, same pricing.
GPT-5.5 cost ranged from $0.31 to $0.62. A thirty cent spread, top of range almost double the bottom.

Duration variance follows the same pattern. Opus stays inside a 20-second window for the same prompt repeated. GPT-5.5 swings by 130 seconds.

If you are budgeting a CI plan that runs script generation on every merge, predictability matters as much as the average. A three cent run that occasionally costs sixty is a different operational story than a nineteen cent run that always costs nineteen.

Reliability, with the caveat

Both models produced a script for every single attempt in this run. No retries, no surfaced failures.

The honest caveat: this benchmark was run in a local environment that could not execute the generated scripts end to end against the target. Both models therefore look "all passing" because there was no execution feedback to push back on either one. In a setup where execution feedback is wired in, both would burn additional cycles whenever a generated locator did not stick, and the head to head on retry rate and retry cost is where the more interesting comparison lives. We will run that pass next.

A nuance worth surfacing here: "produces a script that passes" is not the same as "produces a useful test". Opus's Create contact script "passes" trivially because it asserts almost nothing, so a real product regression would not trip it. GPT-5.5's version on the same prompt would actually catch a regression in the contact-detail page rendering. Once execution feedback is wired in and we re-run, the headline reliability number for both models will probably move down, but the relative ordering on "catches real bugs" may shift in GPT-5.5's favor on prompts where verification language matters.

Speed

Setup time per generation was about the same for both models. All the divergence shows up in the model-driven steps.

This matters more than the per-call cost when generation sits in a workflow that humans wait on. A tester clicking "generate script" and getting a passing artifact in 90 seconds has a different experience from one who waits seven minutes.

What this benchmark is not

A few things this run does not tell you.

It only measures one reasoning effort setting. Both models ran with reasoning effort pinned to "high". With effort dialed down to "low" or "medium", both would be cheaper and faster, and the relative gap between them would likely shift. A team optimizing for throughput over peak quality might land somewhere different.

It does not cover Sonnet 4.6 or GPT-5.4. The other "deep tier" candidates in our catalog deserve their own pass. Sonnet 4.6 in particular is the obvious candidate at half of Opus's input price. We will run that comparison as a follow-up.

It does not measure how each model behaves once a script fails to run. Execution feedback against the target was not available in this local setup, so neither model had to recover from a broken locator or a flaky assertion. That recovery behavior is where production differences are most likely to show up, and we have not captured it yet.

It does not say anything about either model on tasks other than Playwright script generation. GPT-5.5 may well dominate on workloads we did not test here. This particular job cares about a narrow set of skills: instruction following over a fixed schema, locator synthesis, conservative diff generation. A model can be excellent at general coding and uneven at this slice.

Pricing context

For reference, OpenRouter list prices for the two models at the time of writing:

Model	Input ($/M tokens)	Output ($/M tokens)	Context (K)
Claude Opus 4.7	5.00	25.00	1000
GPT-5.5	5.00	30.00	400

Same input rate, twenty percent higher output rate on GPT-5.5, smaller context window. List pricing alone does not explain the two and a half to three times spend gap we measured. The token count gap, driven by reasoning chain length and patch volume, drives the rest.

Numbers to watch if you run this yourself

If you are evaluating models for your own E2E generation work, these are the metrics worth instrumenting before you commit:

Average cost per generation, on a workload you actually run. Synthetic prompts and toy pages will under-report cost differences. Real production targets, with all their modal dialogs and slow third-party widgets, are where the gap shows up.
Cost variance across repeated runs of the same plan. Average plus min/max plus standard deviation. A model with a steady cost profile is much easier to budget than one that swings by 2 to 3 times on the same input.
Wall-clock duration end to end. Reasoning-heavy models cluster their wall in the model-driven steps. A 90-second tool feels different to a tester than a seven-minute one.
Sampled output quality on first-draft scripts. A handful of by-hand reads of the script the model produces before any cleanup pass is applied tells you a lot about how much downstream work the model will need.
Assertion density per generated step, not just whether the script ran. Count the expect(...) calls per test() block. A script that "passes" with no assertions is a smoke test in disguise. Models can vary a lot on this dimension under the same prompt.
Recovery behavior when a script fails on its first run. The most interesting differences usually live there. Cheap models that recover well can beat expensive models that produce a clean first draft.

If you want to dig into our broader benchmark methodology, our eleven-model browser automation benchmark covers it in depth, and discusses why repeatability dominates peak benchmark scores for production agent loops. The pattern in that post (GPT-5.4 was the sweet spot of speed and cost, Opus 4.7 was the speed champion at a premium, models with high variance washed out) holds up here too, on a different slice of the workload.

So which model wins?

The interesting question is not "which one is the best", because the answer depends on what you are willing to spend per script and how much variance you can tolerate. The interesting question is which model gives you the right shape of cost curve for the kind of test plan you generate most. The numbers above are one slice of that curve. Run your own.

Test-Lab.ai is an AI powered browser testing platform. We benchmark frontier models against real production plans on a regular cadence. If you want to see how AI agents handle browser testing on your own surface, start a free trial.

Opus 4.7 vs GPT-5.5 for Playwright Script Generation: A Focused Benchmark