This reminds me of the experiment to run paint splatters through OCR and check, whether the result is valid Perl code (spoiler: 93% evaluated just fine).<a href="https:&#x2F;&#x2F;www.mcmillen.dev&#x2F;sigbovik&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.mcmillen.dev&#x2F;sigbovik&#x2F;</a>

No, they already have the ability to be uncertain if you are careful what you ask for. And I don&#x27;t buy that it&#x27;s some fundamental weakness: I&#x27;m sure we&#x27;ll be much better at dealing with uncertainty a month and a year from now.But of course, they will always sometimes be wrong, just like humans.

The problem with that is the same as the problem with all modern AI, it seems: AI hallucinations, which get more plausible the better the model is. Humans see things, too, but humans can have the insight to know when they&#x27;re unsure and leave lacunae in the transcription with a note to come back later and discuss it with others. AIs don&#x27;t have that ability, and so always seem certain.

Can I suggest some truly horrible church record handwriting for you to try?Here&#x27;s a fairly moderate example:<a href="https:&#x2F;&#x2F;media.digitalarkivet.no&#x2F;view&#x2F;7442&#x2F;155" rel="nofollow">https:&#x2F;&#x2F;media.digitalarkivet.no&#x2F;view&#x2F;7442&#x2F;155</a>

It&#x27;s not cheap but GPT-4 handles it. I&#x27;m in the image processing private beta and its failure rate on text is well below the human baseline. Cursive, damaged, pixelated, 5 year old hand writing, weird lighting or angles, highly stylized, artificially distorted or obstructed, poor contrast. Doesn&#x27;t matter.My instincts tell me they haven&#x27;t made it public yet because it will end captchas for good and they&#x27;re uneasy about rug pulling the entire public internet. Any image obfuscated to a point of defeating the LLM will also defeat the majority of humans.

Know of any good handwritten OCR libraries that are FOSS? And ideally can be called through python? TrOCR is the best I&#x27;ve seen so far though it&#x27;s not amazing.

<a href="https:&#x2F;&#x2F;readcoop.eu" rel="nofollow">https:&#x2F;&#x2F;readcoop.eu</a> transckribus is quite nice for handwritten ocr.

OCR is hard, but maybe we can make some real progress on it now with modern AI. A context-smart church records handwriting transcriber would be pretty great.

<pre><code> silver searcher
</code></pre>
Thanks for the reminder, I knew something like this existed but I couldn&#x27;t remember what it was called!

There are faster tools than grep for dealing with large files. ag the silver searcher workes okay.

Would the typos (Chroincling, orthe, etc.) be on purpose?

<pre><code> I&#x27;ve poured over ((ok, grepped) ~500GB of Chroincling America data to find lines that meet my low standard for nonsene, basically ones that match egrep &quot;[^a-zA-Z0-9 ]{3,}&quot;
</code></pre>
I&#x27;m super curious to know fast this was. grep is generally very fast and this should be doable on a normal computer, though it might take a little while

Spent a load of time doing OCR and dealing with its failures... this is absolutely wonderful, thanks for sharing!

Poetry from dirty OCR