GPT-5.5, One Week In: What Developers Are Actually Saying

8 min read
A week after OpenAI shipped GPT-5.5, the gap between 'massive leap' and 'still hallucinating' tells you almost everything about who this model is for. A field report on the wins, the warts, and how it stacks against Opus 4.7.

Code on a developer's screen at night

OpenAI shipped GPT-5.5 on April 23. By the following Monday, the API had been live for three days, the price had doubled, and my Twitter feed was a near-perfect split between people calling it a MASSIVE leap forward (Matt Shumer's words) and people quietly tweeting screenshots of it confidently inventing a function that doesn't exist.

So which is it. After a week of watching teams put it through real work, the honest answer is: both, and the gap between those two camps tells you almost everything you need to know about who this model is for.

What actually shipped

GPT-5.5 is the first ground-up retrain since GPT-4.5. Everything between those releases was a refinement on the same architectural base. That detail matters because it explains why this drop feels different from the 5.1, 5.2, 5.3, 5.4 cadence we've been on for the past year. It also explains the price.

The headline numbers, lifted straight from OpenAI's launch:

  • 1,050,000-token context window, 128k max output
  • $5 per million input tokens, $30 per million output (double GPT-5.4)
  • Available in the API since April 24, plus Plus, Pro, Business, Enterprise on ChatGPT
  • 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, score of 60 on the Artificial Analysis Intelligence Index (a new high)
  • A Pro variant at $30/$180 for the heaviest agent workloads

The 1M context window is the spec line everyone fixated on. It's nice to have, but most teams I've talked to aren't using it the way the marketing implies. They're not stuffing it with whole monorepos. They're using it as breathing room so RAG pipelines stop choking on edge cases.

What developers like

The thing people keep mentioning in the same breath as actually impressed is that GPT-5.5 talks less. Reviewers at CodeRabbit described it as quicker, leaner, and more direct, with shorter responses and a stronger bias toward small, working changes instead of sprawling refactors. That sounds like a marketing line until you see it back-to-back with Opus 4.7 on the same task. Opus will hand you a thoughtful 800-line PR. GPT-5.5 will hand you a 90-line patch and a sentence explaining why that's enough.

This shows up in the token math, too. On identical coding prompts, GPT-5.5 produces roughly 72% fewer output tokens than Claude Opus 4.7. Once you account for that, the doubled per-token price stops looking like such a hike. OpenAI claims the effective cost increase per Codex task is closer to 20% because the model uses about 40% fewer output tokens than 5.4. That tracks with what teams I've spoken to are seeing in their bills, give or take.

Where it really separates from the previous generation is long-running agent work. There's a now-famous example floating around of GPT-5.3-Codex running unattended for 25 hours, burning ~13M tokens, and producing ~30k lines of code while staying on spec and recovering from its own failures. GPT-5.5 keeps that thread and makes it sturdier. The Terminal-Bench 2.0 jump from 75.1% (5.4) to 82.7% isn't a vibes number; it's the difference between this agent finishes the workflow and this agent gets 80% there and silently gives up.

A few specific places it's clearly stronger:

  • Multi-step terminal workflows. Building, testing, debugging, retrying. Holding state across the loop without losing the plot.
  • TypeScript and Python. Handles codebase-aware tasks better than 5.4 did, particularly when the change spans 3-6 files.
  • Security review. Missed-vulnerability rate dropped from 40% in GPT-5 to 10%. Real number, real difference.
  • Drafts and prose. This one surprised me. Several developers said they're moving their writing workflows from Claude back to OpenAI for the first time in a year. The structure GPT-5.5 puts on a longer doc is easier to revise than what Opus 4.7 produces.

Developer working on agentic systems

What's annoying, or worse

Now the other side of the ledger.

GPT-5.5 hallucinates. A lot. On the AA Omniscience benchmark its hallucination rate clocked in at 86%, against 36% for Claude Opus 4.7 and 50% for Gemini 3.1 Pro Preview. That gap is not a rounding error, and it lines up with what people are saying out loud: this model is more confident when it's wrong than its peers are. On the BullshitBench pushback test (does the model push back when you feed it nonsense?) it scored ~45%, about the same as 5.4. The Pro variant did worse at ~35%.

Translated into something useful: if the failure mode you can't tolerate is wrong but stated like a fact, this is the wrong model for that workload. Legal research, citations, anything customer-facing where a confident error is a real problem — keep verification in the loop, or pick a different model.

The other complaint is structural. Matt Shumer called this a massive leap but it probably won't matter for 99% of users. That's a useful frame. GPT-5.5's gains live in long-horizon, tool-using, multi-step work. If your product is a chat box that answers questions in two turns, you will not feel the upgrade. You will feel the bill.

Also worth flagging:

  • ChatGPT Plus users were initially capped at 200 GPT-5.5 messages per week. Several Reddit threads called this a real downgrade in effective usage even with smarter responses per call.
  • Claude Code's prompt cache TTL changes earlier this month already raised the cost of long sessions. Layering doubled API pricing on top of that is making some teams rethink their default model in CI pipelines.
  • The Pro variant at $30/$180 is a power tool. Treat it like one. Running it as your default will shred a budget.

How it stacks against Opus 4.7

This is the comparison every team is running this week.

Task GPT-5.5 Claude Opus 4.7 Terminal-Bench 2.0 82.7% 69.4% SWE-Bench Pro 58.6% 64.3% MCP Atlas 75.3% 79.1% AA Omniscience hallucination 86% 36% Output tokens per coding task ~28% of Opus baseline

Read the table this way: Opus 4.7 still wins on the benchmarks closest to can you actually fix the GitHub issue end-to-end. It also wins on tool-heavy, MCP-heavy work — which is a big deal for anyone building a Cursor-style agent on top. GPT-5.5 wins on terminal automation, raw intelligence scores, and (importantly) on token efficiency. The cheaper-per-token model isn't always the cheaper-to-run model.

If you're picking one as a default for new feature work and command-line agents, GPT-5.5. If you're maintaining a large codebase with PRs that touch a dozen files and you care about instruction-following consistency, Opus 4.7. Most teams I've talked to are running both with a router.

When to reach for it

A practical decision tree, based on what's working in the wild:

  • Building a coding agent that runs unattended for hours? GPT-5.5 or 5.5 Pro. The long-horizon recovery behavior is the differentiator.
  • Writing or refactoring a feature inside an existing codebase? Try both on a representative task. Opus is winning these on average; GPT-5.5 is winning them on cost-per-completion.
  • Running a chatbot, an FAQ assistant, a content rewriter? Stay on 5.4 or a smaller model. You won't feel the upgrade and you'll pay double.
  • Anything where a wrong-but-confident answer is a real liability? Keep a verification step. Or pick a different model. The hallucination data is too consistent to wave away.
  • Doing security work, vuln triage, or red-team prep? GPT-5.5 is genuinely better here, and the system card backs that up.

How to actually prompt it

OpenAI's own prompting guide buries the lede on this, so worth surfacing: GPT-5.5 wants the outcome, not the recipe. Tell it the goal, the constraints, the success criteria, the allowed side effects, the shape of the output you want. Don't walk it through the steps unless the exact path is part of the requirement. It will figure out a path. That's the whole point of the thing.

Tool-specific guidance goes in the tool descriptions, not the system prompt. When-to-use, required inputs, retry safety, common failure modes — all of that belongs next to the tool definition. The system prompt should hold cross-tool policy and the agent's general operating principles.

Both of these are small habits, but they have outsized effect on how well the model performs versus how it performed in your existing 5.4 prompts. If you ported your prompts directly and the output feels worse, this is probably why.

So, is it worth upgrading?

Depends on what you're building.

If you're shipping autonomous coding agents, the answer is mostly yes, with the caveat that you should A/B against Opus 4.7 on your own evals before you commit. If you're building consumer chat features, no. The upgrade isn't going to show up where your users will see it, and the price will. If you're somewhere in the middle (most product teams), the most useful thing you can do this week is set up a router that sends terminal/agent traffic to GPT-5.5, large-PR refactors to Opus 4.7, and chat traffic to whatever was already working. Then look at your bill in two weeks and recalibrate.

The bigger picture, the part that will matter six months from now: this is the first model release where agentic stopped feeling like a marketing word. The 25-hour autonomous Codex run, the Terminal-Bench jump, the verification-and-repair behavior. Those aren't demos. They're workflows people are running in production right now. Whatever you think of OpenAI's pricing strategy, that line moved this week.

It just moved with an 86% hallucination rate. Plan accordingly.


Sources