AI Models
Claude Opus 4.8: The New King of AI Models
Opus 4.8 just took the crown, and unlike most launch-day hype, the benchmarks actually back it up. It wins six of seven head-to-head evals against GPT-5.5 and Gemini 3.1 Pro, ships a wild new Dynamic Workflows feature, and somehow costs the same as before.
Anthropic does not usually move this fast. Opus 4.7 was barely 41 days old when Opus 4.8 dropped on May 28th, and that cadence alone tells you something: the lab is in a sprint, not a stroll. The IPO race with OpenAI is heating up, and the models are getting better fast enough that “wait for the next one” is now a genuinely bad strategy.
I have spent the launch window reading the model card, running my own coding tasks through it, and lining up the numbers against GPT-5.5 and Gemini 3.1 Pro. My honest take? This is the best general purpose model you can use right now. Not by a landslide on every axis, but by enough, and in the places that matter most for agents and real work.
The best general-purpose model you can use right now — not by a landslide on every axis, but by enough, and in the places that matter most.
What Actually Shipped
Anthropic describes Opus 4.8 as having “sharper judgement, more honesty about its progress, and the ability to work independently for longer than its predecessors.” Strip away the press release voice and that maps to three concrete things you will feel: it makes better calls on ambiguous tasks, it stops pretending it finished something when it did not, and it can run long agentic loops without falling apart halfway through.
It is available everywhere today, with the API identifier claude-opus-4-8. Anthropic also signaled that its next generation “Mythos class” models are coming to customers in the next few weeks, so 4.8 is both a real upgrade and a warm up act. None of that changes the fact that right now, today, it is the model to beat.
The Benchmarks: Where It Actually Wins
Anyone can say their model is the best. The interesting question is whether the numbers survive contact with the competition. Here is how Opus 4.8 stacks up head to head against its own predecessor and the two models everyone actually compares it to.
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro (agentic coding) | 69.2% | 64.3% | 58.6% | 54.2% |
| OSWorld-Verified (computer use) | 83.4% | 82.8% | 78.7% | 76.2% |
| GDPval-AA (knowledge work) | 1890 | n/a | 1769 | 1314 |
| Terminal-Bench 2.1 (terminal coding) | 74.6% | 66.1% | 78.2% | 70.3% |
On agentic coding, the SWE-Bench Pro jump from 64.3% to 69.2% is the headline. That is a real generational step in a single 41 day cycle, and it puts daylight, more than ten points, between Opus 4.8 and both GPT-5.5 and Gemini 3.1 Pro. For anyone running coding agents, that gap compounds across every multi step task.
It also leads on reasoning. On Humanity's Last Exam, Opus 4.8 scores 49.8% with no tools and 57.9% with tools, ahead of all three rivals. On the GDPval-AA knowledge work eval it posts 1890, a clean lead over GPT-5.5 (1769) and a blowout against Gemini 3.1 Pro (1314). And on the agent specific evals it pulls some genuinely impressive firsts: 84% on Online-Mind2Web for browser agents, the first model to break 10% on the Legal Agent Benchmark's all pass standard, and the only model to complete every case on the Super-Agent benchmark end to end, beating GPT-5.5 at equivalent cost.
It is not a clean sweep, and I am not going to pretend it is. On Terminal-Bench 2.1, GPT-5.5 still leads at 78.2% versus Opus 4.8's 74.6%. Across the seven benchmarks people are comparing, Opus 4.8 wins six — “best overall,” not “best at literally everything.”
The Features That Actually Matter
A higher benchmark score is nice. New capabilities you can actually build on are better. Opus 4.8 shipped with three of those, and they change how you use the model, not just how it scores.
Dynamic Workflows
This is the one that made me sit up. Shipping as a research preview in Claude Code, Dynamic Workflows let Claude spin up and coordinate many subagents in parallel, hundreds of them at once, on a single task. Think codebase migrations across hundreds of thousands of lines, where one orchestrator fans the work out, the subagents grind in parallel, and the results get stitched back together.
Why it matters — The difference between an agent that does one thing at a time and a fleet that attacks a huge task from every angle simultaneously.
Effort Control
You now get a dial for how hard the model thinks. Crank it up for deeper reasoning on a gnarly problem; turn it down when you want speed and want to be gentle on your rate limits. Opus 4.8 defaults to high effort, which Anthropic judges to be the best overall balance. The clever part: on coding tasks, that high effort default spends a similar number of tokens as Opus 4.7's default, but delivers better results. You get more, not for more.
Why it matters — One model, tuned per task, instead of juggling a “smart but slow” and a “fast but dumb” model in your stack.
Fast Mode
Fast Mode runs the model at roughly 2.5x the speed, and crucially it is now three times cheaper than fast inference was on previous models. Low latency used to come with a brutal premium. That premium just collapsed, which makes real time, interactive agent experiences a lot more practical to ship.
Why it matters — Snappy UX and acceptable economics used to be a trade-off. Less so now.
The Pricing Twist: Same Price, Better Model
Here is the part that quietly matters most. Normally a better model means a bigger bill. Not this time. Standard pricing is identical to Opus 4.7.
Standard
$5 / M in
$25 / M out
Unchanged from Opus 4.7
Fast Mode
$10 / M in
$50 / M out
2.5x speed, 3x cheaper than before
Pair that with the Effort Control point from earlier, where the high effort default spends about the same tokens as 4.7 did, and the takeaway is simple: you get a meaningfully better model for the same money. That is the rarest kind of upgrade in this space.
The Safety Gains Nobody Is Talking About
Benchmarks get the headlines, but the alignment numbers on this release are arguably the more important story, especially if you are putting this model anywhere near production code.
The “more honesty about its progress” claim ties directly into this. A model that admits when it is stuck, instead of confidently handing you broken work, is dramatically more useful in long agentic loops, the exact place where overconfidence quietly wrecks everything downstream.
Should You Actually Switch?
Short answer: yes, and there is barely an argument against it. Same API, same price, drop in the claude-opus-4-8 identifier, and you immediately get better coding, better reasoning, better knowledge work, and fewer missed bugs. The migration cost is close to zero.
The one place I would pause: if your core workload is heavy raw terminal coding, GPT-5.5 still edges it on Terminal-Bench 2.1, so run your own eval before you commit. And remember Anthropic flagged Mythos class models arriving in weeks, so if you are about to make a big long term bet, factor that in. For everyone else building agents, shipping features, or doing real knowledge work today, Opus 4.8 is the model I would reach for.
The pace is the real story here. A meaningful upgrade every six weeks, for the same price, with the safety numbers moving in the right direction at the same time. That is not normal, and it is exactly why “best model” is now a title that changes hands almost monthly. Right now, it belongs to Opus 4.8.