← Blog

Which LLM Is Best for Browser Automation? 11 Frontier Models Benchmarked on a Real Production Plan

We ran 11 frontier models against the same hardened production plan through our internal benchmarking harness. Here are the pass rates, durations, and costs that matter for AI driven browser automation.

AI testingLLM benchmarkbrowser automationGPT-5.4Claude Opus 4.7Gemini 3E2E testingcomparison
Which LLM Is Best for Browser Automation? 11 Frontier Models Benchmarked on a Real Production Plan

At Test-Lab.ai we think a lot about which large language model makes a solid foundation for AI driven browser automation. Prices shift, a new frontier model lands every other week, and the gap between "passes the smoke test" and "behaves well across hundreds of repeated real browser sessions" stays surprisingly wide. Public leaderboards publish plenty of numbers. Very few of them come from real browser sessions hitting real production surfaces.

So we built an internal benchmarking harness, pointed it at a hardened production test plan, and ran a sweep across eleven candidate models. This post is the writeup.

Why another LLM benchmark

The usual suspects, MMLU, SWE-Bench, GPQA, Arena Elo, all measure something useful about abstract language or code ability. They don't tell you much about the skill mix that matters for end-to-end browser testing: instruction following over dozens of steps, clean tool use, graceful recovery from surprise UI changes, and the discipline to clean up after itself.

They also skip the thing that actually gates production use: repeatability. A model that passes once in three attempts is worthless in a CI pipeline. A model that reliably finishes in 60 seconds at four cents a run is gold. We wanted a scorecard tuned to that gating function, not to benchmark aesthetics.

The setup

Every candidate ran the same test plan, against the same production surface, through the same infrastructure. The only thing that changed between runs was which model sat at the other end of the agent loop.

Key properties of the harness:

  • Strict serial execution. Concurrency would have polluted the shared resource that the plan touches, so we queued models one at a time and waited for each to release all state before the next one started.
  • Per-run cleanup verification. The plan creates a transient resource on the target surface. After every single run, an independent checker inspects that surface and flags any artifact the agent forgot to tidy up. Leaked state is both a correctness signal and a trust signal: a model that drifts on cleanup inside a controlled automation context is a model you can't deploy in a customer environment.
  • Instrumented bookkeeping. Every comparison run records the model id, the provider model name, the tier (light or smart), cleanup state, wall-clock duration, LLM cost, token counts per call, and a per-step trace of the tool calls the agent emitted. Queries like "which models take longer than 180 seconds on plans with a modal dialog" are cheap. Fishing through logs has no place in a benchmark pipeline.
  • One aggregator, one billing path. All candidates were routed through OpenRouter, so no vendor got an edge from preferred pricing, prompt caching, or routing tricks another vendor didn't.
  • Default configuration only. No prompt engineering overrides, no tuned tool-call retry limits, no system message massaging. The point was to measure the behavior at the seam where the model plugs in, not what a few hours of prompt polish can extract from it.

One thing worth flagging before the numbers land: the benchmark harness is the tip of the iceberg. Our production orchestration layer does substantial work on top of whichever frontier model sits underneath: step classification, prompt scaffolding, retry and healing policies, vision-aware action synthesis, an internal library of heuristics that has built up over many thousands of production runs. Swapping to a "better" model on a benchmark doesn't translate one-to-one to user outcomes in the product, because the product is a stack, not a model. The numbers below should read as a model-quality signal in isolation, not as a proxy for how the product behaves end-to-end.

The field

Eleven candidates across two weight classes:

Light tier (fast, cheap, meant for simple steps):

  • Google Gemini 3 Flash
  • Google Gemini 3.1 Flash Lite
  • DeepSeek V3.2
  • Moonshot Kimi K2.6

Smart tier (larger, pricier, meant for multi-step reasoning):

  • OpenAI GPT-5.2
  • OpenAI GPT-5.4
  • OpenAI GPT-5.4 Mini
  • Google Gemini 3.1 Pro
  • Z.ai GLM-5.1
  • Anthropic Claude Sonnet 4.6
  • Anthropic Claude Opus 4.7

That gives coverage of every major frontier lab plus the strongest open-weight options currently in production-ready shape.

The plan

A multi-step flow on a settings surface. The agent has to navigate there, create a named resource, verify the creation, revoke it, and verify it's gone. Simple to describe. Surprisingly easy to get wrong when the agent has to cope with toasts, confirmation dialogs, optimistic UI updates, and asynchronous state that doesn't always land in the order the model expects.

Headline results

Pass rate, average wall-clock duration, and average LLM cost per run, same hardened plan for every candidate:

TierModelPass rateAvg durationAvg cost
LightGemini 3 Flash100%58s$0.06
LightDeepSeek V3.2100%250s$0.04
LightGemini 3.1 Flash Lite50%60s$0.03
LightKimi K2.6100%136s$0.07
SmartGPT-5.4100%98s$0.17
SmartClaude Opus 4.7100%88s$1.88
SmartGemini 3.1 Pro100%100s$0.24
SmartClaude Sonnet 4.6100%121s$0.77
SmartGPT-5.2100%270s$0.33
SmartGPT-5.4 Mini50%86s$0.07
SmartGLM-5.10%697s$0.38

Some of those numbers are striking once you sit with them.

What jumps out

GPT-5.4 is the sweet spot of the current generation. Perfect pass rate, fastest average duration in its tier, and less than one fifth of what Claude Opus 4.7 costs to complete the same plan. "Most intelligent" and "most useful" aren't the same axis, and this benchmark is a good illustration.

Claude Opus 4.7 is the speed champion. It finished faster on average than every other candidate in the smart tier, including GPT-5.4. The catch is price. Roughly eleven times GPT-5.4 per run, so it only makes sense when latency beats every other consideration.

Reliability variance matters more than peak score. GPT-5.4 Mini in the smart tier and Gemini 3.1 Flash Lite in the light tier both showed inconsistent behavior, flipping between pass and fail across runs on the same prompt. A single successful run on either would have told a very different story. That's why we run extensive sweeps in the harness rather than single-shot comparisons.

GLM-5.1 washed out. It burned through ten plus minutes on every attempt before giving up, with a repeatable failure mode. "Not ready for production browser automation today" regardless of how it looks on text-only leaderboards.

Cleanup integrity was a clean sweep. Every passing model, across every run, correctly tidied up the transient resource it created. Even the failing runs failed early rather than midway through, so leaked state never actually materialized. The only way you get a signal that specific is to check for it explicitly after every run.

Variance, the part most benchmarks skip

Run any frontier model repeatedly on the same plan and you'll often get wildly different durations, sometimes varying by a factor of two or three. Our GPT-5.2 runs ranged from 143 seconds to 506 seconds on the same plan, same prompt, same target. Same pass result, wildly different exploration paths through the page.

If you don't run enough iterations per model, you're measuring noise, not signal. Your sample size is doing more work than your model choice.

What this means for production

Nothing in this study changes the basic truth that a production agent is a stack, not a model. Prompt design, tool scaffolding, retry policies, step classification, vision strategies, guardrails, plus a large library of internal heuristics, all matter as much as the raw weights plugged in at the bottom. We're disciplined about that machinery precisely so the model is a swappable component, not a load-bearing dependency.

What the benchmark gives us is a clean, repeatable way to decide when a new model release is worth plumbing through the pipeline. Every new frontier launch gets pushed through the same harness, scored on the same plans, and compared against the incumbent. The comparison drives a data-backed integration decision, not a vibes-based one.

Caveats worth reading before acting on the numbers

A benchmark is a snapshot. This one measures one plan, one target surface, each model in its default configuration and without the orchestration layer our product actually wraps around it. A different plan, a different target, different prompting, or a richer retry policy would reshuffle several of these numbers. Frontier pricing also moves quickly. If you're reading this more than a few weeks after publication, pull your own numbers before acting on ours.

Also, light-tier and smart-tier variants of the same family often behave very differently despite similar names. Pick the tier deliberately.

Where we take it from here

The harness is persistent and always on. Every comparison run is recorded, every leak flagged, every duration and cost queryable. When a new model drops tomorrow, we push it through the same workflow, look at the same scorecard, and decide. That's the discipline. The model is a moving part. Everything around it is the mechanism.

If you want us to benchmark a specific model against your plan, reach out. We'll add it to the next sweep.

Ready to try Test-Lab.ai?

Start running AI-powered tests on your application in minutes. No complex setup required.

Get Started Free