For nearly a year, the question "which frontier model for production coding work" had a default answer (Claude Sonnet 4.5 for most work, Claude Opus 4.6 for the hardest tasks) and a contender (OpenAI's o1/o3 family for math-heavy reasoning). That has been substantially reset twice in the last nine months.
OpenAI shipped GPT-5 on August 7, 2025 as its first unified reasoning and chat model, replacing the GPT-4o / o1 / o3 family with a single model that dynamically routes between fast and deep thinking (OpenAI, Introducing GPT-5). Anthropic followed with Claude Opus 4.7 on April 16, 2026, shipping a 1M-token context window, a new top-tier xhigh effort level, and an SWE-bench Pro score of 64.3% (Anthropic, What's new in Claude Opus 4.7; llm-stats.com, Opus 4.7 benchmarks).
Both are production-ready for freelance coding work in May 2026. Both have real failure modes. The honest answer to "which one for client work" depends on the shape of the engagement, not the headline benchmark numbers.
What each one is actually good at
GPT-5 strengths. OpenAI's published numbers and independent benchmarks consistently show GPT-5 at or near the top on:
- General-purpose reasoning, including math-heavy and logic-heavy tasks.
- Long-context retrieval and "needle in a haystack" tests.
- Multi-modal work including vision and audio.
- Cost-per-token across the API tier (GPT-5 is materially cheaper than Opus 4.7 for comparable output quality on most tasks).
The unified model architecture means GPT-5 picks its own reasoning depth dynamically. A simple question gets a fast answer; a complex one triggers extended thinking automatically. For freelance work where the developer is not pre-classifying every prompt, this is a meaningful ergonomic win.
Claude Opus 4.7 strengths. Anthropic's benchmarks and independent agentic-coding tests show Opus 4.7 at the top on:
- Long-horizon agentic loops that run across many turns and many files (SWE-bench Pro 64.3%).
- Code generation with complex multi-file edits and refactors.
- Tool use and Computer Use precision (the pixel-pointing improvements in 4.7 are real).
- The 1M-token context window, which fits most mid-size codebases in a single prompt at the same price as the 200k window on Opus 4.6.
The new top-tier xhigh effort level — positioned above the previous high tier — is the default for Claude Code as of the 4.7 release, and it is genuinely better at "make the right architectural call across a 100-file change" than any previous Claude or any current GPT.
What each one is actually weak at
GPT-5 weaknesses for freelance coding.
- Agentic loops over 10+ turns are weaker than Opus 4.7. The unified model occasionally drops context on long coding sessions where Opus 4.7 holds.
- Tool-use reliability in production agentic systems still lags Claude in independent observability data from Cursor and Cline users.
- The "fast-or-deep router" inside GPT-5 occasionally picks fast when deep is the right call, producing confident-sounding wrong answers on architecture decisions.
Opus 4.7 weaknesses.
- Materially more expensive per token. At $5/$25 per million input/output tokens (Anthropic platform docs), Opus 4.7 costs roughly 5-8× GPT-5 for comparable workloads when GPT-5 routes to fast mode.
- The new tokenizer in Opus 4.7 runs 1× to 1.35× more tokens per character than Opus 4.6 (Anthropic). Same prompt, sometimes 35% more billable tokens.
- Breaking API changes in Opus 4.7 (removed sampling parameters, removed extended-thinking budget controls) require migration work for teams maintaining client integrations.
What the independent benchmarks actually show
The two models trade leadership across benchmarks depending on which one you trust.
- SWE-bench Pro: Opus 4.7 at 64.3% leads GPT-5 at independent comparison ranges of 56-61% depending on harness (Anthropic, What's new in Claude Opus 4.7; llm-stats.com).
- Aider polyglot benchmark: GPT-5 and Opus 4.7 are within 2-3 percentage points of each other depending on the test cycle.
- GPQA Diamond (science PhD questions): GPT-5 leads on raw reasoning depth.
- MATH-500 and AIME: GPT-5's math-heavy reasoning is consistently stronger.
- Long-context retrieval at 500k+ tokens: Opus 4.7 holds better, partly because GPT-5's effective context is shorter despite the published ceiling.
For a freelance developer the takeaway is uncomfortable but honest: neither model wins everywhere. Trying to pick one for all client work is the wrong frame.
The shape-of-engagement decision matrix
Five engagement shapes and which model wins for each:
1. New project scaffolding and component generation. GPT-5 wins on cost and speed. The work is short-horizon, single-file, low-stakes architecturally. Opus 4.7 is overkill at the price.
2. Complex multi-file refactors on existing client codebases. Opus 4.7 wins. The 1M-token context lets the agent see the whole codebase; SWE-bench Pro leadership translates to real productivity on long agentic loops. Cost is justified by the work being unambiguously hard.
3. Math-heavy or scientific computing work. GPT-5 wins. The reasoning strength on AIME and GPQA-style problems translates directly. For a freelance ML or quant-finance engineer, GPT-5 is the right default.
4. Browser automation and Computer Use work. Opus 4.7 wins. The pixel-pointing precision and vision-resolution improvements (2576px, 3.75MP at 1:1 mapping) are genuinely ahead of GPT-5's CUA-style capabilities (Anthropic).
5. Customer-facing chat interfaces. GPT-5 wins on cost and on the consumer-friendly tone defaults. Opus 4.7's outputs are sometimes more formal and verbose than is right for a chat product.
The pricing math for freelance work
A working comparison for a typical freelance engagement: a one-month build that runs ~50M input tokens and ~10M output tokens of model usage.
- GPT-5 (standard pricing): roughly $300-450 in API cost depending on reasoning-routing distribution.
- Claude Opus 4.7: roughly $500 input + $250 output = $750 in API cost, plus 10-35% more if the new tokenizer expands your prompts.
For a $25,000 freelance build, the AI cost differential is rounding error. For a $5,000 build it is 5-10% of the project budget. Plan accordingly.
The pricing tier with the most leverage in 2026 is actually the *non-frontier* model from each provider: Claude Sonnet 4.5 ($3/$15 per million tokens) and GPT-5 mini (cheaper than full GPT-5). Most freelance work — including substantial production coding — should run on the cheaper tier with the frontier model reserved for the genuinely hard tasks.
The harness matters more than the model
A practical observation from senior freelance devs: the *harness* (Cursor, Claude Code, Codex CLI, Cline, Aider) often matters more than the model. A well-tuned harness with prompt caching, careful context management, and good tool definitions can extract more value from Claude Sonnet 4.5 than a sloppy harness gets from Opus 4.7.
For freelance work where you bill the client for AI usage as a pass-through cost, the harness choice is part of the deliverable. Pick one, get good at it, and report savings to the client when you switch the agent to a cheaper tier on appropriate tasks.
Related: Claude Opus 4.7 and the 1M context window for freelance engineers, Cursor 2.0 Composer vs GitHub Copilot Agent, and Computer Use APIs and the freelance automation market.
Delivvo gives freelance engineers a branded client portal where the engagement scope, AI usage budget, and per-milestone deliverables live at one URL. When the client asks "what model did you use and what did it cost," the per-engagement reconciliation is already structured. See how it works →
The takeaway
GPT-5 and Claude Opus 4.7 are both production-ready for freelance coding in May 2026. The right freelance answer is not "pick one and use it for everything" — it is "build a harness that routes the work to the right model." GPT-5 wins on cost, math, and quick single-file work. Opus 4.7 wins on agentic loops, long-context refactors, and Computer Use precision. Both lose to careful harness design when you try to use them as a default for everything.
The freelance engineer billing $200/hour does not need to optimise API cost. The freelance engineer billing fixed-price needs to. Both should be running both models, not picking one.
Written by The Delivvo team · May 16, 2026
More from the blog →