Anthropic explains again: From August to early September, Claude indeed had issues.
Just now, Anthropic today released a detailed technical report explaining how three infrastructure bugs caused a sharp decline in Claude's answer quality.
While they seem to have told some truths, this report comes a bit too late.
What exactly happened to Claude?
Starting in early August, users began complaining that Claude's answers had worsened.
Initially, Anthropic found it difficult to discern whether this was normal fluctuation in user feedback or if there was a real problem.
But as complaints grew, they finally started an investigation in late August.
Anthropic emphasized in the report:
We never reduce model quality due to demand, time of day, or server load.
The issues users encountered were entirely due to infrastructure bugs.
The investigation uncovered three overlapping bugs, making diagnosis unusually difficult.
Anthropic claims: All three bugs have now been fixed.
Three Annoying Bugs
Illustrative timeline of events on the Claude API. Yellow: issue detected, Red: degradation worsened, Green: fix deployed.
First bug: Routing error
On August 5th, some Sonnet 4 requests were incorrectly routed to servers configured for the upcoming 1 million token context window.
This bug initially affected only 0.8% of requests.
However, a routine load balancing adjustment on August 29th inadvertently routed more short-context requests to the long-context servers.
At its worst, within a single hour on August 31st, 16% of Sonnet 4 requests were affected.
Even worse, routing was "sticky": once a request was processed by the wrong server, subsequent conversational turns would continue to be handled by the same incorrect server.
Second bug: Output corruption
On August 25th, an incorrect configuration was deployed on Claude API's TPU servers.
This caused the model to inexplicably output unexpected tokens, such as Thai or Chinese characters suddenly appearing in English conversations, or producing obvious syntax errors in code.
Users might have seen a Thai greeting like "สวัสดี" in an English response.
This issue affected Opus 4.1 and Opus 4 from August 25-28, and Sonnet 4 from August 25 to September 2.
Third bug: XLA:TPU compiler error
On August 25th, they deployed code to improve token selection, but inadvertently triggered a latent bug in the XLA:TPU compiler.
This bug was confirmed to affect Claude Haiku 3.5, and possibly also some Sonnet 4 and Opus 3 requests.
Technical Details of the Compiler Bug
Code snippet of a December 2024 patch to work around the unexpected dropped token bug when temperature = 0.
This bug was the most challenging.
When Claude generates text, it calculates the probability of each possible next word and then randomly selects one.
On TPUs, the model runs across multiple chips, and probability calculations occur in different locations, requiring data coordination between chips.
The problem stemmed from mixed-precision arithmetic.
The model uses bf16 (16-bit floating point) to calculate probabilities, but the TPU's vector processors natively support fp32, so the compiler converts certain operations to fp32 for performance optimization.
This caused a precision mismatch: operations that should have agreed on the highest probability token diverged because they were running at different precision levels. The highest probability token would sometimes disappear entirely.
Code snippet showing minimized reproducer merged as part of the August 11 change that root-caused the “bug” being worked around in December 2024; in reality, it’s expected behavior of the xla_allow_excess_precision flag.
When fixing the precision issue, they exposed a deeper bug: a problem with the approximate top-k operation.
This is a performance optimization used to quickly find the highest probability tokens, but it sometimes returned completely incorrect results.
Reproducer of the underlying approximate top-k bug shared with the XLA:TPU engineers who developed the algorithm. The code returns correct results when run on CPUs.
The behavior of this bug was infuriating.
It would change based on irrelevant factors, such as what operations ran before or after, or whether debugging tools were enabled.
The same prompt might work perfectly one time and fail the next.
Ultimately, they found that the exact top-k operation no longer had as significant a performance penalty as before, so they switched to the precise algorithm and standardized some operations to fp32 precision.
Why was it so difficult to discover?
Anthropic's validation process typically relies on benchmarks, safety evaluations, and performance metrics.
Engineering teams conduct spot checks, deploying first to small "canary" groups.
However, these issues exposed critical flaws that they should have identified earlier.
Their evaluations simply did not capture the quality degradation reported by users, partly because Claude often recovers well from isolated errors.
Privacy practices also complicated the investigation: internal privacy and security controls limited how and when engineers could access user interactions with Claude.
This protected user privacy but also prevented engineers from examining problem interactions needed to identify or reproduce bugs.
Each bug produced different symptoms at different rates on different platforms, creating confusing mixed reports that pointed to no single cause.
It appeared as random, inconsistent quality degradation.
Improvement Measures
Anthropic has committed to making the following changes:
More sensitive evaluations: Developing evaluation methods that can more reliably distinguish between normal and anomalous implementations.
More widespread quality assessment: Continuously running evaluations on real production systems to catch issues like context window load balancing errors.
Faster debugging tools: Developing infrastructure and tools to better debug community feedback without sacrificing user privacy.
They specifically emphasized that continuous user feedback is crucial.
Users can provide feedback using the /bug command in Claude Code, or the "thumbs down" button in the Claude app.
Netizen Reactions
While Anthropic's transparency is commendable, user reactions have been quite mixed.
Denis Stetskov(@razoorka) stated he felt a huge improvement:
I can already feel a massive improvement. Whatever you fixed, it's working.
Rajat(@DRajat33) praised the transparency:
Thanks for the clarification and details. Transparency is what sets companies apart, no matter their product.
But more users expressed dissatisfaction over the lack of compensation.
Alexandr Os(@iamavalex) directly demanded:
Release the list of affected accounts and issue immediate refunds. I am one of them.
Conor Dart(@Conor_D_Dart) questioned:
Will you be refunding or compensating affected users? The report affected many people, and your prices aren't cheap.
The City Keeps Building(@TheCity777) was simple and direct:
Where are our refunds?
peermux(@peermux) commented:
If you admit to not delivering the agreed-upon product between August and September, then you should offer refunds, or at least a month of free service. This would show good faith and help maintain trust.
Baby Yoda(@grogu836) expressed disappointment:
We're not getting refunds for this? Unbelievable. No more Claude Code for me.
Other users pointed out that the problem might not be completely resolved yet.
tuna(@tunahorse21) said:
It obviously still has bugs, and you waited a month to admit the problem.
Daniel Lovera(@dlovera) raised a further question, arguing that the degradation of short-context requests on long-context servers suggests Anthropic was indirectly reducing model quality based on demand.
Thomas Ip(@_thomasip) summarized the three bugs:
tldr:
bug 1 - some requests routed to test servers
bug 2 - performance optimization bug assigned high probability to rare tokens
bug 3a - precision mismatch caused highest probability token to be dropped
bug 3b - approximate top-k algorithm was completely wrong
[1] Technical Post-Mortem Report: https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues