Xinzhiyuan Report Editor: Yuanyu
【Xinzhiyuan Guide】Recently, a mysterious model on Google AI Studio not only successfully identified an 18th-century merchant's "mystery" ledger, but also corrected format errors and ambiguous expressions within it. Its demonstrated reasoning capabilities have astonished historians.
Quietly, has Google solved two of AI's oldest problems?
Not long ago, a mysterious model on Google AI Studio attracted widespread attention, including from historian Mark Humphries.
He used a "mystery" ledger from an Albany merchant over 200 years old to test large language models' capabilities in Handwritten Text Recognition (HTR).
A shocking scene unfolded!
The mysterious model not only achieved near-perfect automatic handwritten recognition but also corrected a format error in the original ledger and optimized an ambiguous expression that could have caused confusion.
This means the model can not only recognize letters but also understand the logic and knowledge context behind them.
Furthermore, these capabilities were demonstrated without the model being explicitly prompted.
Expert-level handwritten text recognition and reasoning capabilities without explicit rules – the resolution of these two major challenges marks a leap in AI model capabilities.
Netizens speculate that this mysterious model might be Gemini-3, which Google is expected to launch this year, but this has not been officially confirmed.
Cracking Historians' Challenges
Mark Humphries is a professor of history at Wilfrid Laurier University.
As a historian, he is deeply concerned about whether AI has reached human expert-level reasoning in his professional field.
Therefore, Humphries chose to test large models on historical handwriting, which he considers a golden test for evaluating the overall capabilities of large models.
Recognizing historical handwriting is not just a visual task; it also requires an understanding of the historical context of the manuscript.
Without this knowledge, accurately identifying and transcribing a historical document is almost impossible.
In Humphries' view, this is precisely the most difficult part of historical documents to recognize.
With the development of large model capabilities, their HTR accuracy can exceed 90%, but the remaining 10% is the most difficult and crucial.
Humphries believes that current large models (Transformer architecture) are inherently predictive (their core mechanism is to predict the next token), but spelling errors and inconsistent styles in historical documents are inherently unpredictable, low-probability answers.
Therefore, to transcribe "the cat sat on the rugg" instead of "mat," the model must go against the tendencies of the training distribution.
This is why large models are less adept at transcribing unfamiliar names (especially surnames), obscure place names, dates, or numbers (such as monetary amounts).
For example, was a letter written by Richard Darby or Richard Derby? Was the date March 15, 1762, or March 16, 1782? Was the bill for 339 dollars or 331 dollars?
When such illegible letters or numbers appear in historical documents, the answer often needs to be found through other types of background knowledge.
Humphries believes that this "last mile of accuracy" is the prerequisite for historical handwritten text recognition to be usable by humans.
Is There a "Ceiling" for Predictive Architectures?
To measure handwritten transcription accuracy, Humphries and Dr. Lianne Leddy created a test set comprising 50 documents, totaling approximately 10,000 words.
They took all reasonable precautions to ensure these documents were not included in the large models' training data.
This test set included various writing styles (from illegible scribbles to formal secretarial hands) and images captured with different tools.
In Humphries' view, these documents represent the types most commonly encountered by him and other historians studying 18th and 19th-century English documents.
They used Character Error Rate (CER) and Word Error Rate (WER) to measure the proportion of transcription errors.
Research shows that non-specialists typically have a WER of 4-10%.
Even professional transcription services anticipate a small number of errors, typically guaranteeing a 1% WER, provided the text is clear and legible.
Thus, this is essentially the upper limit of accuracy.
Last year, on Humphries' and others' test set, Gemini-2.5-Pro performed with:
Strict CER of 4%, WER of 11%.
When errors in capitalization and punctuation, which generally do not change the actual meaning of the text nor affect searchability or readability, were excluded, these error rates dropped to CER 2% and WER 4%.
Humphries also observed that improvements in each generation of models have indeed occurred steadily.
Gemini-2.5-Pro's results were approximately 50-70% better than Gemini-1.5-Pro, which they tested a few months prior, and Gemini-1.5-Pro was in turn about 50-70% better than the initially tested GPT-4.
This confirms the expectation of scaling laws:
As models grow larger, their performance on such tasks can be roughly predicted based solely on model size.
Performance of the New Model
Using the same dataset, they began testing Google's new model.
The procedure involved uploading images to AI Studio and inputting the following fixed prompt:
"Your task is to accurately transcribe handwritten historical documents, minimizing CER and WER as much as possible. Work word by word, line by line, transcribing the text exactly as it appears on the page. To maintain the authenticity of the historical text, preserve spelling errors, grammar, syntax, punctuation, and line breaks. Transcribe all text on the page, including headers, footers, marginalia, insertions, page numbers, etc. If these elements exist, please insert them where indicated by the author..."
When selecting test documents, Humphries deliberately chose those with the most errors and the most illegible handwriting.
They were not only scribbled but also full of spelling and grammatical errors, lacked proper punctuation, and had extremely inconsistent capitalization.
The goal was simple: to truly test the limits of this mysterious model.
Ultimately, he selected five documents from the test set.
The results were astonishing.
The 5 documents transcribed by the model (just over 1000 words, about one-tenth of the sample) had a strict CER of 1.7% and a WER of 6.5%.
This means, including punctuation and capitalization, there was approximately 1 error for every 50 characters.
Furthermore, almost all errors were in capitalization and punctuation, and the problematic areas were highly ambiguous, with very few true "word" level errors.
When these types of errors were excluded from the count, the error rates dropped to CER 0.56% and WER 1.22%.
In other words, this new Gemini model's performance in HTR reached human expert-level standards.
Cracking the "Mystery" of a 200-Year-Old Ledger in Seconds
Subsequently, Humphries decided to further challenge the new model.
He brought out an Albany merchant's daybook from over 200 years ago.
This was a running ledger recorded in English by a Dutch clerk.
He likely wasn't fluent in English, and his spelling and letter formation were highly irregular, with a mix of Dutch and English throughout.
The accounts also used the old pounds/shillings/pence notation and common shorthand of the time: "To 30 Gallons Rum @4/6 6/15/0."
This indicated that someone purchased (debited to their account) 30 gallons of rum at 4 shillings 6 pence per gallon, totaling 6 pounds 15 shillings 0 pence.
For most people today, these non-decimal currency units are unfamiliar: 1 shilling equals 12 pence, and 1 pound equals 20 shillings.
Individual transactions were recorded on the fly, separated by horizontal lines, with the date written in numbers in the middle.
Each transaction was noted as a debit (Dr, purchase) or credit (Cr, payment).
Some transactions were crossed out, possibly indicating they had been reconciled or transferred to a customer's account in the general ledger (similar to "pending" becoming "posted").
These records also lacked a standard format.
Large models have consistently struggled with processing such ledgers.
Not only because there is very little relevant training data, but also because there is not much regularity: people could buy any quantity of anything, at any arbitrary unit price, and the total price would not round off in a conventional manner.
Large models could often identify some names and some goods, but would get completely lost on the numbers.
For example, they typically found it difficult to accurately transcribe numbers and tended to mix up unit prices and total prices.
Particularly complex pages would temporarily "break" the model, causing it to endlessly repeat certain numbers or phrases, or sometimes simply fail to respond.
However, Humphries observed that Google's new model performed near-perfectly when recognizing pages from the Albany merchant's daybook.
Not only were the numerical parts astonishingly all correct, but more interestingly, it also corrected a minor formatting error made by the original clerk when recording entries.
For example, Samuel Stitt bought 2 punch bowls, and the clerk recorded it as "each 2/", meaning 2 shillings each; to save space, he omitted "0 pence." But to maintain consistency, the model transcribed it as "@2/0", which is actually more standardized and clearer.
Reading through the text, Humphries also noticed an "error" that made his hair stand on end.
He saw Gemini transcribing the original line "To 1 loff Sugar 145 @ 1/4 0 19 1" as "To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1."
In the 18th century, sugar was sold in hardened conical loaves, and Mr. Slitt was a shopkeeper who purchased large quantities of sugar for resale.
At first glance, this seemed like a hallucinatory error: the model was asked to transcribe strictly from the original, yet it inserted "14 lb 5 oz," which was not present in the original.
Upon closer examination, Humphries realized that the large model had done something extremely clever.
Gemini correctly inferred that 1, 4, and 5 constituted numerical values for weight units, describing the total weight of the sugar purchased.
To determine the correct weight and decode "145", Gemini also used the final total price of 0/19/1 to reverse-engineer the weight, which required back-and-forth conversions between two decimal systems and two non-decimal systems.
Humphries deduced the large model's reasoning process:
The unit price of sugar was 1 shilling 4 pence per unit, which is 16 pence. The total transaction price was 0 pounds, 19 shillings, 1 pence, which converts to 229 pence.
To calculate how much sugar was bought, one would divide 229 by 16, yielding 14.3125, or 14 pounds 5 ounces.
Thus, Gemini concluded it was not "1 45," nor "145," but "14 5," and further clarified it as "14 lb 5 oz" in the transcription.
In Humphries' tests, no other model demonstrated similar performance when asked to transcribe the same document.
This example caught Humphries' attention because AI seemed to have crossed boundaries that experts had long claimed existing models could not surmount.
Faced with an ambiguous number, it was able to infer the missing context, perform a series of multi-step conversions between historical currency and weight systems, and arrive at a correct conclusion. This process required abstract reasoning about the world described in the document.
Humphries believes that what occurred might be an emergent, implicit form of reasoning, where perception, memory, and logic are spontaneously combined within a statistical model, rather than being explicitly designed to reason symbolically, although he is still unclear about the underlying specific principles.
If this hypothesis holds, Humphries believes that the "sugar loaf entry" is not only a remarkable transcription but also sends a small, clear signal: pattern recognition is beginning to cross the boundary of true "understanding."
This indicates that large models can not only transcribe historical documents with human expert-level accuracy but are also beginning to demonstrate an understanding of the economic and cultural systems behind these historical documents.
Humphries believes this might also reveal the beginning of something else: machines starting to perform true abstract, symbolic reasoning about the world they perceive.
References:
https://generativehistory.substack.com/p/has-google-quietly-solved-two-of