A new Google model is nearly perfect on automated handwriting recognition

scrlk | 450 points

> In tabulating the “errors” I saw the most astounding result I have ever seen from an LLM, one that made the hair stand up on the back of my neck. Reading through the text, I saw that Gemini had transcribed a line as “To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1”. If you look at the actual document, you’ll see that what is actually written on that line is the following: “To 1 loff Sugar 145 @ 1/4 0 19 1”. For those unaware, in the 18th century sugar was sold in a hardened, conical form and Mr. Slitt was a storekeeper buying sugar in bulk to sell. At first glance, this appears to be a hallucinatory error: the model was told to transcribe the text exactly as written but it inserted 14 lb 5 oz which is not in the document.

I read the whole reasoning of the blog author after that, but I still gotta know - how can we tell that this was not a hallucination and/or error? There's a 1/3 chance of an error being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the author so sure that this was deliberate?

I feel a good way to test this would be to create an almost identical ledger entry, but in a way so that the correct answer after reasoning (the way the author thinks the model reasoned) has completely different digits.

This way there'd be more confidence that the model itself reasoned and did not make an error.

lelanthran | 9 hours ago

I really hope they have because I’ve also been experimenting with LLMs to automate searching through old archival handwritten documents. I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge. It doesn’t help that they were often written in the field by semi-literate people who misused lots of words. Even the simplest accounts require quite a lot of detective work to decipher with subtle signals like that pound sign for the sugar loaf.

> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.

This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

throwup238 | a day ago

I read the whole article, but have never tried the model. Looking at the input document, I believe the model saw enough of a space between the 14 and 5 to simply treat it that way. I saw the space too. Impressive, but it's a leap to say it saw 145 then used higher order reasoning to correct 145 to 14 and 5.

elphinstone | 16 hours ago

My task today for LLMs was "can you tell if this MRI brain scan is facing the normal way", and the answer was: no, absolutely not. Opus 4.1 succeeds more than chance, but still not nearly often enough to be useful. They all cheerfully hallucinate the wrong answer, confidently explaining the anatomy they are looking for, but wrong. Maybe Gemini 3 will pull it off.

Now, Claude did vibe code a fairly accurate solution to this using more traditional techniques. This is very impressive on its own but I'd hoped to be able to just shovel the problem into the VLM and be done with it. It's kind of crazy that we have "AIs" that can't tell even roughly what the orientation of a brain scan is- something a five year old could probably learn to do- but can vibe code something using traditional computer vision techniques to do it.

I suppose it's not too surprising, a visually impaired programmer might find it impossible to do reliably themselves but would code up a solution, but still: it's weird!

roywiggins | 17 hours ago

I haven’t seen this new google model but now must try it out.

I will say that other frontier models are starting to surprise me with their reasoning/understanding- I really have a hard time making (or believing) the argument that they are just predicting the next word.

I’ve been using Claude Code heavily since April; Sonnet 4.5 frequently surprises me.

Two days ago I told the AI to read all the documentation from my 5 projects related to a tool I’m building, and create a wiki, focused on audience and task.

I'm hand reviewing the 50 wiki pages it created, but overall it did a great job.

I got frustrated about one issue: I have a github issue to create a way to integrate with issue trackers (like Jira), but it's TODO, and the AI featured on the home page that we had issue tracker integration. It created a page for it and everything; I figured it was hallucinating.

I went to edit the page and replace it with placeholder text and was shocked that the LLM had (unprompted) figured out how to use existing features to integrate with issue trackers, and wrote sample code for GitHub, Jira and Slack (notifications). That truly surprised me.

efitz | 21 hours ago

I will note that 2.5 pro preview… march? Was maybe the best model I’ve used yet. The actual release model was… less. I suspect Google found the preview too expensive and optimized it down but it was interesting to see there was some hidden horsepower there. Google has always been poised to be the AI leader/winner - excited to see if this is fluff or the real deal or another preview that gets nerfed.

conception | 21 hours ago

Of it can read ancient handwriting, it will be a revolution for historians work.

My wife is a historian and she is trained to recognize old handwriting. When we go to museums she"translates" the texts for the family

neves | an hour ago

Am I missing something here? Colonial merchant ledgers and 18th-century accounting practices have been extensively digitized and discussed in academic literature. The model has almost certainly seen examples where these calculations are broken down or explained. It could be interpolating from similar training examples rather than "reasoning."

xx_ns | 20 hours ago

It seems like a leap to assume it has done all sorts of complex calculations implicitly.

I looked at the image and immediately noticed that it is written as “14 5” in the original text. It doesn’t require calculation to guess that it might be 14 pounds 5 ounces rather than 145. Especially since presumably, that notation was used elsewhere in the document.

MagicMoonlight | 17 hours ago

> So that is essentially the ceiling in terms of accuracy.

I think this is mistaken. I remember... ten years ago? When speech-to-text models came out that dealt with background noise that made the audio sound very much like straight pink noise to my ear, but the model was able to transcribe the speech hidden within at a reasonable accuracy rate.

So with handwritten text, the only prediction that makes sense to me is that we will (potentially) reach a state where the machine is at least probably more accurate than humans, although we wouldn't be able to confirm it ourselves.

But if multiple independent models, say, Gemini 5 and Claude 7, both agree on the result, and a human can only shrug and say, "might be," then we're at a point where the machines are probably superior at the task.

gcanyon | 18 hours ago

I’ve seen those A/B choices on Google AI Studio recently, and there wasn’t a substantial difference between the outputs. It felt more like a different random seed for the same model.

Of course it’s very possible my use case wasn’t terribly interesting so it wouldn’t reveal model differences, or that it was a different A/B test.

pavlov | 21 hours ago

I've been complaining on hn for some time now that my only real test of an LLM is that it can help my poor wife with her research, she spends all day every day in small town archives pouring over 18th century American historical documents. I thought maybe that day had come, I showed her the article and she said "good for him I'm still not transcribing important historical documents with a chat bot and nor should he" - ha. If you wanna play around with some difficult stuff here are some images from her work I've posted before: https://s.h4x.club/bLuNed45

neom | 20 hours ago

I think the author has become a bit too enthusiastic. "Emerging capabilities" become code for, unexpectedly good results that are statistical serendipity but that I prefer to infer as some hidden capability in a model I can't resist anthropomorphizing.

elzbardico | 5 hours ago

Is anyone aware of any benchmark evaluation for handwriting recognition? I have not been able to find one, myself — which is somewhat surprising.

dr_dshiv | 2 hours ago

This is exciting news, as I have some elegantly scribed family diaries from the 1800s that I can barely read (:

With that said, the writing here is a bit hyperbolic, as the advances seem like standard improvements, rather than a huge leap or final solution.

jumploops | 20 hours ago

The thinking models (especially OpenAI's o3) still seem to do by far the best at this task as they look across the document to see how the writer wrote certain letters where the word is more clear when it runs into confusing words.

I built a whole product around this: https://DocumentTranscribe.com

But I imagine this will keep getting better and that excites me since this was largely built for my own research!

AaronNewcomer | 16 hours ago

This might just be a handcrafted prompt framework for handwriting recognition tied in with reasoning - do a rough pass, make assumptions and predictions, check assumptions and predictions, if they pass, use the degree of confidence in their passage to inform what the other characters might be, and gradually flesh out an interpretation of what was intended to be communicated.

If they could get this to occur naturally - with no supporting prompts, and only one-shot or one-shot reasoning, then it could extend to complex composition generally, which would be cool.

observationist | 21 hours ago

Author says "It is the most amazing thing I have seen an LLM do, and it was unprompted, entirely accidental." and then jumps back to the "beginning of the story". Including talking about a trip to Canada.

Skip to the section headed "The Ultimate Test" for the resolution of the clickbait of "the most amazing thing...". (According to him, it correctly interpreted a line in an 18th century merchant ledger using maths and logic)

netsharc | a day ago

I dunno man, looks like goodharts law in action to me. That isnt to say the models wont be good at what is stated, but it does mean it might not signal a general improvement in competence but rather a targeted gain with more general deficits rising up in untested/ignored areas, some which may or may not be catastrophic. I guess we will see but for now Imma keep my hype in the box.

Grimblewald | 11 hours ago

I just used AI studio for recognizing text from a relative's 60 day log of food ingested 3 times a day. I think I am using models/gemini-flash-latest and it was shockingly good at recognizing text, far better than ChatGPT 5.1 or Claude's Sonnet (IIRC its 4.5) model.

https://pasteboard.co/euHUz2ERKfHP.png

Its response I have captured here https://pasteboard.co/sbC7G9nuD9T9.png is shockingly good. I could only spot 2 mistakes. And those that seems to have been the ones even I could not read or was very difficult for me to make out what the text was.

ghm2199 | 20 hours ago

Rgd the "14 lb 5 oz" point in the article, the simpler explanation than the hypothesis there that it back calculated the weight is that there seems to be a space between 14 and 5 - i.e. It reads more like "14 5" than "145"?

sriku | 11 hours ago

It hasn't met my doctor.

koliber | 11 hours ago

Gemini 2.5 PRO is already incredibly good in handwritten recognition. It makes maybe one small mistake every 3 pages.

It has completely changed the way I work, and it allows me to write math and text and then convert it with the Gemini app (or with a scanned PDF in the browser). You should really try it.

_giorgio_ | 13 hours ago

> it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts

> As is so often the case with AI, that is exciting and frightening all at once

> we need to extrapolate from this small example to think more broadly: if this holds the models are about to make similar leaps in any field where visual precision and skilled reasoning must work together required

> this will be a big deal when it’s released

> What appears to be happening here is a form of emergent, implicit reasoning, the spontaneous combination of perception, memory, and logic inside a statistical model

> model’s ability to make a correct, contextually grounded inference that requires several layers of symbolic reasoning suggests that something new may be happening inside these systems—an emergent form of abstract reasoning that arises not from explicit programming but from scale and complexity itself

Just another post with extreme hyperbolic wording to blow up another model release. How many times have we seen such non-realistic build up in the past couple of years.

barremian | 15 hours ago

Pretty hyperbolic reaction to what seems like a fairly modest improvement

greekrich92 | 21 hours ago

Betteridge's law surely applies.

lproven | 21 hours ago

I much prefer this tone about improvements in AI over the doomerism I constantly read. I was waiting for a twist where the author changed their minds and suddenly went "this is the devil's technology" or "THEY T00K OUR JOBS" but it never happened. Thank you for sharing, it felt like breathing for the first time in a long time.

kittikitti | 21 hours ago

No, just another academic with the ominous handle @generativehistory that is beguiled by "AI". It is strange that others can never reproduce such amazing feats.

bgwalter | 21 hours ago
[deleted]
| 20 hours ago

What an unnecessarily wordy article. It could have been a fifth of the length. The actual point is buried under pages and pages of fluff and hyperbole.

Legend2440 | 21 hours ago

Substack: When you have nothing to say and all day to say it.

mmaunder | 18 hours ago

[dead]

temptemptemp111 | 21 hours ago

[dead]

superlukas99 | 17 hours ago

[dead]

anthem2025 | 16 hours ago

It's a diffusion model, not autocomplete.

phkahler | 20 hours ago

We are probably just a few weeks away from Google completely wiping OpenAI out.

outside2344 | 20 hours ago

Reading HN comments just makes me realize how vastly LLMs exceed human intelligence.

cheevly | 14 hours ago