Six frontier models in eight days. The Chinese open-weight gap collapsed to hours. Coding is now the only battlefield that matters.

AImodelsopen sourceClaudeOpenAIDeepSeekQwenKimi
Share:
EIGHT DAYS

EIGHT DAYS

By Amir H. Jalali6 min read
AI Generated
I have been keeping a release tracker for a few years now. Most weeks it gets one or two entries. Last week it got six.

Claude Opus 4.7 landed on April 16, which the rest of the week then had to react against. Kimi K2.6 dropped on the 20th. Qwen 3.6-Max-Preview arrived the same day. GPT Image 2.0 went out to ChatGPT and Codex web users on the 21st and 22nd. Qwen 3.6-27B in open weights on the 22nd. GPT-5.5 and a new Codex default on the 23rd. DeepSeek V4 on the 24th. By Friday I had stopped trying to write thoughtful first impressions and started just stacking model cards in a folder.

The thing that catches you first is how aligned the headline benchmark is. Five of the six text models lead with SWE-bench. Even GPT Image 2.0 launched first inside Codex. The frontier labs are no longer competing on chat quality or trivia knowledge or vibes. They are competing on whether you can hand them the bug and walk away.

The numbers from the week tell the story flatly. Opus 4.7 hit 87.6 on SWE-bench Verified, up from 80.8. GPT-5.5 posted 82.7 on Terminal-Bench 2.0, the highest of the week. Kimi K2.6 took SWE-Bench Pro to 58.6, beating both GPT-5.4-xhigh at 57.7 and Claude Opus 4.6-max at 53.4. Qwen 3.6-27B, a 27 billion parameter dense model released under Apache 2.0, beat the team's own 397B mixture-of-experts model on agentic coding and matched Claude 4.5 Opus on Terminal-Bench 2.0. The 27B beating the 397B is the most surprising single result of the week.

The gap I keep watching is the one between Western proprietary frontier and Chinese open-weight. A year ago that gap was measured in quarters. This week it was measured in days. Kimi K2.6 beat Opus 4.6 on agentic coding while Opus 4.7 was less than a week old. DeepSeek V4-Pro priced output tokens at $3.48 per million against GPT-5.5's $30 and Opus 4.7's $25. That is roughly an order of magnitude cheaper for nominally similar coding work, and you can run the weights yourself.

The surprising read, the one I have not seen anyone write up clearly, is what DeepSeek did with the silicon. V4 was reportedly optimized for Huawei Ascend 950 chips through a Huawei interconnect they call Supernode, and was made available to Chinese chipmakers but not to Nvidia or AMD. Jensen Huang told Fortune the same week, on the record: the day DeepSeek ships first on Huawei is a horrible outcome for the United States. If V4 trains and serves competitively without American silicon, the export-control thesis is the next thing that has to be revised.

I also do not think the timing was coincidence. Anthropic and OpenAI's release windows had been signaled for weeks. Alibaba and Moonshot tend to telegraph. DeepSeek does not. The cleanest reading of the week is everyone watching the same calendar and trying not to be the eight-day-old model when V4 dropped.

There is a quieter story underneath the benchmark numbers, and it is about attention. Qwen 3.6-27B uses Gated DeltaNet linear attention in three of every four sublayers, which is sub-quadratic in sequence length. DeepSeek V4 ships with a hybrid attention architecture that delivers 1M context using around 27 percent of the compute V3.2 needed. Anthropic priced 1M-token Opus 4.7 at the standard tier with no long-context premium. Three labs, three different mechanisms, all pointing the same direction. The next round of competition is not going to be about parameter count or expert count. It is going to be about whether you can serve a million tokens cheaply enough that nobody bothers doing retrieval anymore.

The Opus 4.7 release also slipped in something most coverage missed. The new tokenizer encodes the same English text into between 1.0x and 1.35x more tokens than 4.6. Sticker price is unchanged, effective price for a lot of workloads went up by something like twenty percent. I noticed this because my prompt-cost line in my own tooling jumped overnight on inputs I had not changed. It is a reasonable thing to do, the new tokenizer probably helps the model in real ways, but it is the kind of detail you only catch by running the same prompt twice.

GPT Image 2.0 deserves its own paragraph. The reason people are calling it the biggest jump of the year on Image Arena, with a 242 Elo margin within twelve hours of release, is not resolution and not speed. It is that OpenAI baked reasoning into the generation loop. The model researches and plans before it touches pixels. You can see this most obviously in text rendering, where 2.0 produces print-ready menus with correct prices and Japanese, Korean, Chinese, Hindi, and Bengali characters at character-level accuracy. Diffusion models have been faking text for three years. This one is not faking. The hero image on this article was generated with it, and the words on the screens behind the figures are the words I asked for, not approximate shapes.

The pricing fan is also widening, not converging. GPT-5.5 Pro charges $180 per million output tokens. DeepSeek V4-Flash charges $0.28. That is a 643x spread for tasks that overlap heavily on benchmarks. The market is not collapsing toward a single price for general-purpose intelligence. It is bifurcating into a premium reasoning-agent tier and a commodity inference tier, and there is no sign that the middle is going to fill in.

What I do not know yet, and the thing I want to figure out next, is whether a real workload sees the same gap the benchmarks see. SWE-bench is artificial. Terminal-Bench is artificial. The closest thing to a real test is whether Kimi K2.6's 300-sub-agent swarm or Opus 4.7's adaptive thinking actually finishes the long-horizon ticket I would otherwise close myself. That answer takes weeks to develop. I will know more by the end of May.

For now I am sitting with the strangeness of the calendar. Six frontier-class releases in eight days, all of them serious, none of them obviously dominant, two of them open weights, one of them maybe shipping outside the Nvidia stack entirely. I cannot remember the last time the field moved this much in a week. I am not sure it has, before.
Was this helpful?