← Blog

Claude Fable 5 vs Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: How Big Is Anthropic's Mythos-Class Jump?

Anthropic just shipped Claude Fable 5, its first public Mythos-class model. We break down the benchmarks and pricing against Opus 4.8, GPT-5.5, and Gemini 3.1 Pro, the silent fallback nobody is talking about, and what it means for coding agents and AI browser testing.

Claude Fable 5Claude Opus 4.8AnthropicGPT-5.5Gemini 3.1 ProAI codingLLM benchmarkAI agentsbrowser automationcomparison
Claude Fable 5 vs Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: How Big Is Anthropic's Mythos-Class Jump?

Anthropic shipped Claude Fable 5 today, and for once the "new frontier model" label undersells it. Fable 5 is the first model from the Mythos class, a capability tier that sits above Opus, that anyone can actually buy. Until now, Mythos-class models lived behind Project Glasswing, Anthropic's restricted program for cyberdefenders, after the preview version rattled the security industry with its ability to find and exploit vulnerabilities.

The launch comes with a twist we haven't seen before: a sibling model called Claude Mythos 5 that is the exact same model with fewer safeguards, available only to vetted security teams. Same weights, two packages. The safeguards are the product boundary.

We spent launch day reading everything, pulling the benchmark tables apart, and watching the developer reaction roll in. Here's what stands out, where the numbers are genuinely new territory, and what it means if you run AI agents in production, including agents that test software.

What "Mythos-class" actually means

Anthropic's model ladder used to top out at Opus. Mythos is the tier above it. The first Mythos model shipped quietly in April 2026 to a small group of security partners, and Fable 5 is the general-release version of its successor: same underlying model as Mythos 5, wrapped in safety classifiers for general use. The name is a nod to the split itself, Fable from the Latin fabula, Mythos from the Greek. Two tellings of the same thing.

The safeguards are concrete, not marketing. Three classifiers watch every session:

  • Cybersecurity: offensive cyber work, exploitation, agentic hacking
  • Biology and chemistry: dual-use research and weapons-adjacent tasks
  • Distillation: attempts to extract the model's capabilities

Here's the interesting part. When a classifier triggers, Fable 5 doesn't refuse. Your query gets answered by Claude Opus 4.8 instead, Anthropic's next-most-capable model. Anthropic says this happens in under 5% of sessions on average. We'll come back to why that mechanism matters more for agent builders than the headline benchmarks do.

On the robustness side: an external bug bounty ran over 1,000 hours of red-teaming against the safeguards without finding a universal jailbreak. That's a stronger safety posture than any previous frontier launch shipped with, and it's clearly the price of releasing this capability tier at all.

The benchmarks: an actual gap, not a rounding error

Frontier launches in the past year have mostly traded 1-3 points back and forth on the usual leaderboards. Fable 5 doesn't. Here are the headline numbers, compiled from Anthropic's announcement and third-party roundups:

BenchmarkFable 5Opus 4.8GPT-5.5Gemini 3.1 Pro
SWE-bench Verified95.0%88.6%82.6%-
SWE-bench Pro80.3%69.2%58.6%54.2%
Terminal-Bench 2.188.0%82.7%83.4%70.7%
FrontierCode Diamond29.3%13.4%5.7%-
OSWorld-Verified (computer use)85.0%83.4%78.7%76.2%
GDPval-AA (knowledge work, Elo)1932189017691314

Three things jump out.

The coding gap is the largest single-launch jump in a long time. Eleven points over Opus 4.8 on SWE-bench Pro, and more than double its score on Cognition's FrontierCode Diamond set, which was built specifically to resist saturation. Cognition's CEO called Fable 5 the highest-scoring model on their eval, Cursor reported state of the art on CursorBench, and Stripe's early-access team reported a codebase-wide migration across a 50-million-line Ruby codebase done in a day, work they estimated at two-plus months for a team.

Terminal work flips back to Anthropic. GPT-5.5 had quietly edged ahead of Opus 4.8 on Terminal-Bench (83.4% vs 82.7%). Fable 5 takes the lead back at 88.0%. Worth noting: one analysis of the runs found that roughly a fifth of Fable 5's Terminal-Bench trials tripped the cybersecurity classifier mid-trajectory and finished on Opus 4.8. The measured score is partly a blend of two models, which means the ceiling of the underlying model is higher than the public number.

Computer use moved the least. OSWorld-Verified, the benchmark closest to real browser and desktop automation, improved by 1.6 points over Opus 4.8. That's a real gain, and 85% is the best public score yet. But it's nothing like the coding jump, and it tells you where this model's training effort went.

Beyond the tables, the long-horizon stories are the more interesting signal. Fable 5 completed Pokémon FireRed using vision alone with a minimal harness, where earlier models needed elaborate scaffolding. With persistent file-based memory it reached the final act of Slay the Spire three times more often than Opus 4.8. One research customer reported a physics task done in 36 hours that took GPT-5.5 four days, using a third of the reasoning tokens. Replit's CTO put it bluntly: apps that took a hundred prompts a year ago, it now one-shots.

Pricing: the premium tier costs premium money

ModelInput $/MOutput $/MNotes
Claude Fable 5$10$5090% prompt-caching discount, 1M context, no long-context surcharge
Claude Opus 4.8$5$25
GPT-5.5$5$30Surcharge above 272K context (2x input, 1.5x output)
Gemini 3.1 Pro$2$12$4 / $18 above 200K context

Double Opus 4.8 on paper. In practice the multiplier is bigger, because Fable 5 thinks whether you like it or not. Adaptive thinking is always on, the API rejects requests that try to disable it, and complex agentic sessions routinely burn 500K to 1M tokens. Simon Willison's first day with the model cost him $110.42, with $99.26 of it going to a single agent session. He also called it "something of a beast," so he didn't seem to regret it.

There's a counterweight to the sticker shock, and it's the same cost math we keep landing on with model choice for E2E testing: you don't pay for tokens, you pay for completed work. If Fable 5 finishes a hard task in one session that a cheaper model fails at twice, the expensive model is the cheap one. The physics example above is the extreme case, where fewer reasoning tokens made Fable 5 cheaper than GPT-5.5 in absolute terms despite a 2x rate card. That math only works on tasks hard enough to expose the gap. On routine work, Gemini 3.1 Pro at a fifth of the input price remains very hard to argue with.

If you're on a Claude subscription rather than the API: Fable 5 is included on Pro, Max, Team, and Enterprise plans until June 22, then moves to usage credits. The free window is a good moment to form your own opinion.

The API surface, briefly

For anyone wiring this into a product: the model ID is claude-fable-5, with a 1M token context window and 128K max output. The request shape follows Opus 4.8 with one extra constraint, thinking can't be disabled at all. Sampling parameters (temperature, top_p, top_k) are gone, replaced by prompting and an effort parameter that runs from low to max. Day-one availability is unusually broad: Claude API, Claude Code, GitHub Copilot, Amazon Bedrock, Google Cloud, Microsoft Foundry, and Databricks all lit up within hours of the announcement.

Fable 5 vs Opus 4.8: should you upgrade?

For teams already running Claude in production, the cross-vendor comparison is academic. The real question is whether to move up a tier inside Anthropic's own lineup.

The case for switching: the gap between adjacent Anthropic tiers has never looked like this. Eleven points on SWE-bench Pro, more than double the FrontierCode Diamond score, a 42-point Elo lead on knowledge work. And because Fable 5 keeps Opus 4.8's request shape, the migration is a model ID swap for most codebases, not a rewrite.

The case for staying: cost and predictability. Opus 4.8 is half the per-token price, it lets you disable thinking entirely (Fable 5 doesn't), and it never changes models on you mid-session. On browser and computer use specifically, a 1.6-point OSWorld gap doesn't justify a 2x rate card on its own.

There's also a quirk worth pricing in: every Fable 5 user is partly an Opus 4.8 user already, since classifier triggers silently serve the older model anyway.

Our read: Opus 4.8 stays the default for high-volume agent workloads, Fable 5 takes the work that's currently failing, and June 22, when the included-in-plans window closes, is the natural deadline for running that evaluation on your own tasks.

The day-one split: "a beast" vs "a Ferrari with a 30mph limiter"

The Hacker News reaction sorted itself into two camps fast, and both are right.

The first camp threw genuinely hard problems at it and came away impressed. Willison watched it port a MicroPython sandbox to full CPython on WebAssembly and package the result as a distributable wheel, then estimated another session produced "several days' worth of work" on his own library, with code, tests, and docs. Multiple reports describe noticeably better frontend output, more intentional design, less of the recognizable AI-built look.

The second camp hit the safety classifiers. The "Ferrari with a 30mph limiter" line started circulating within hours, mostly from security researchers, cryptography people, and reverse engineers whose day jobs look exactly like what the cyber classifier is built to catch. Anthropic's "under 5% of sessions" figure is an average across all users. The complaint from the field is that the 5% concentrates precisely in the work professional developers and security teams do. If your workload lives near that boundary, run a pilot before you commit.

The silent fallback is the detail agent builders should read twice

Most coverage treats the Opus 4.8 fallback as a safety footnote. We think it's the most operationally significant design decision in the launch.

In a chat session, falling back to Opus 4.8 instead of refusing is good UX. The user gets an answer, slightly less brilliant, no dead end. Inside an automated pipeline it's something else: you specified a model, and on some fraction of requests you silently get a different one, possibly mid-trajectory, with no error to catch. The Terminal-Bench analysis above showed this isn't theoretical, a fifth of trials on a benign coding benchmark degraded mid-run.

For agent systems this creates a new failure shape. Not a crash, not a refusal, just a quiet capability drop partway through a long task. Your retries won't fire. Your logs will show a completed run. The output will simply be a bit worse, and you won't know why unless you're checking which model actually served each response.

QA automation sits closer to this boundary than most workloads. Plenty of routine test plans read like the opening moves of a security assessment: probe the login flow with bad credentials, verify rate limiting, test permission boundaries between roles, confirm an expired session can't reach protected pages. A classifier tuned to catch "agentic hacking" is going to see some of that. We expect most browser testing to sail through, but anyone running security-adjacent test suites on Fable 5 should monitor for it explicitly rather than assume.

What Fable 5 means for AI browser testing

Our angle on every model launch is the same question: does this change what an agent can do in a real browser session against a real production surface?

The honest read from the public numbers: coding agents got a generational jump, browser agents got an increment. The 1.6-point OSWorld gain matters, and the long-horizon improvements (staying coherent across millions of tokens, file-based memory, stronger vision) target exactly the failure mode that kills complex E2E runs: the model that loses the plot at step 22 of a 30-step flow. That's the bottleneck we documented when we compared coding agents as E2E testers, and Fable 5 is the most direct attempt yet to train it away.

But at $10/$50 with thinking permanently on, the deployment shape matters. Running Fable 5 on every step of every test is the wrong way to spend it. Where a Mythos-class model plausibly earns its rate card in a testing stack: generating the test plan from a vague spec, healing the failures that cheaper models can't diagnose, and finishing the long multi-page journeys that mid-tier models abandon halfway. The steps in between (clicking a button, filling a field, verifying a toast) don't need a model that can refactor 50 million lines of Ruby.

When we benchmarked 11 frontier models on a hardened production plan in April, the takeaway was that pass rate, duration, and cost per completed run tell a different story than leaderboard scores, and that variance across repeated runs matters more than any single number. Fable 5 goes through that same harness next, same plan, same instrumentation, scored against GPT-5.4, Opus 4.7, and Gemini 3.1 Pro as the incumbents. The silent-fallback behavior makes the repeatability sweep more interesting than usual, since some fraction of runs may quietly execute on a different model. We'll publish the numbers when the sweep is done.

The principle from that benchmark post still holds, and this launch reinforces it: a production agent is a stack, not a model. Step classification, prompt scaffolding, retry and healing policies, evidence capture, all of it carries more of the user-visible outcome than the weights underneath. A model this capable raises the ceiling. It doesn't replace the machinery.

The bottom line

Fable 5 is the first time the tier above Opus has been available to everyone, and on hard, long-horizon coding work the gap is real, the largest we've seen from a single launch in over a year. If your problems are hard enough that frontier models currently fail at them, the premium is worth testing this week while it's included in paid plans.

If your work brushes security, cryptography, or reverse engineering, pilot it carefully and watch for the classifier. If you run agents in production, instrument which model actually answers each request, because "Fable 5" is now sometimes Opus 4.8 wearing a different name.

And if your interest is browser automation and AI testing: expect incremental gains today, watch the cost per completed test rather than the rate card, and judge it on repeated runs against your own surfaces, not on launch-day leaderboards. That's what we're doing.


Test-Lab is an AI-powered browser testing platform. We benchmark every frontier model release against real production test plans so you don't have to. Describe what to test in plain English, run it free, and let the harness pick the right model for each step.

Ready to try Test-Lab.ai?

Start running AI-powered tests on your application in minutes. No complex setup required.

Get Started Free
Claude Fable 5 vs Opus 4.8, GPT-5.5, Gemini 3.1 | Test-Lab.ai