PDF to Text, a challenging problem

ingve | 302 points

Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?

I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

90s_dev | 17 hours ago

One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.

There are a few tools that allow inspecting a PDF's contents (https://news.ycombinator.com/item?id=41379101) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2... has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):

    0 g 0 G
    0 g 0 G
    BT
    /F19 10.9091 Tf 88.936 709.041 Td
    [(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ
    -16.936 -21.922 Td
    [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ
    0 -21.923 Td
and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.
svat | 20 hours ago

This is mostly what I worked on for many years at Apple with reasonable success. The main secret was to accept that everything was geometry, and use cluster analysis to try to distinguish between word gaps and letter gaps. On many PDF documents, it works really well, but there are so many different kinds of PDF documents that there are always cases were the results are not that great. If I were to do it today, I would stick with geometry, avoid OCR completely, but use machine learning. One big advantage for machine learning is that I could use existing tools to generate PDFs from known text, so that the training phase could be completly automatic. (Here is Bertrand Serlet announcing the feature at WWDC in 2009: https://youtu.be/FTfChHwGFf0?si=wNCfI9wZj1aj9rY7&t=308)

herodotus | 8 hours ago

"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:

1. reliable OCR from documents (to index for search, feed into a vector DB, etc)

2. structured data extraction (pull out targeted values)

3. end-to-end document pipelines (e.g. automate mortgage applications)

Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)

kbyatnal | 18 hours ago

This was a great read. You've done an excellent job breaking down what makes PDFs so uniquely annoying to work with. People often underestimate how much of the “document-ness” (like headings, paragraphs, tables) is just visual, with no underlying semantic structure.

We ran into many of the same challenges while working on Docsumo, where we process business documents like invoices, bank statements, and scanned PDFs. In real-world use cases, things get even messier: inconsistent templates, rotated scans, overlapping text, or documents generated by ancient software with no tagging at all.

One thing we’ve found helpful (in addition to heuristics like font size/weight and spacing) is combining layout parsing with ML models trained to infer semantic roles (like "header", "table cell", "footer", etc.). It’s far from perfect, but it helps bridge the gap between how the document looks and what it means.

Really appreciate posts like this. PDF wrangling is a dark art more people should talk about.

snehanairdoc | an hour ago

The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.

dwheeler | 19 hours ago

Below is a PDF. It is a .txt file. I can save it with a .pdf extension and open it in a PDF viewer. I can make changes in a text editor. For example, by editing this text file, I can change the text displayed on the screen when the PDF is opened, the font, font size, line spacing, the maximum characters per line, number of lines per page, the paper width and height, as well as portrait versus landscape mode.

   %PDF-1.4
   1 0 obj
   <<
   /CreationDate (D:2025)
   /Producer 
   >>
   endobj
   2 0 obj
   <<
   /Type /Catalog
   /Pages 3 0 R
   >>
   endobj
   4 0 obj
   <<
   /Type /Font
   /Subtype /Type1
   /Name /F1
   /BaseFont /Times-Roman
   >>
   endobj
   5 0 obj
   <<
     /Font << /F1 4 0 R >>
     /ProcSet [ /PDF /Text ]
   >>
   endobj
   6 0 obj
   <<
   /Type /Page
   /Parent 3 0 R
   /Resources 5 0 R
   /Contents 7 0 R
   >>
   endobj
   7 0 obj
   <<
   /Length 8 0 R
   >>
   stream
   BT
   /F1 50 Tf
   1 0 0 1 50 752 Tm
   54 TL
   (PDF is)' 
   ((a) a text format)'
   ((b) a graphics format)'
   ((c) (a) and (b).)'
   ()'
   ET
   endstream
   endobj
   8 0 obj
   53
   endobj
   3 0 obj
   <<
   /Type /Pages
   /Count 1
   /MediaBox [ 0 0 612 792 ]
   /Kids [ 6 0 R ]
   >>
   endobj
   xref
   0 9
   0000000000 65535 f 
0000000009 00000 n 0000000113 00000 n 0000000514 00000 n 0000000162 00000 n 0000000240 00000 n 0000000311 00000 n 0000000391 00000 n 0000000496 00000 n trailer << /Size 9 /Root 2 0 R /Info 1 0 R >> startxref 599 %%EOF
1vuio0pswjnm7 | 17 hours ago

Having built some toy parsers for PDF files in the past it was a huge wtf moment for me when I realized how the format works. With that in mind, it's even more puzzling how it's used often in text-heavy cases.

I always think about the invoicing use-case: digital systems should be able to easy extract data from the file while still being formatted visually for humans. It seems like the tech world would be much better off if we migrated to a better format.

trevor-e | 10 hours ago

Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

bartread | 20 hours ago

PDF is a display format. It is optimised for eyeballs and printers. There has been some feature creep. It is a rubbish machine data transfer mechanism but really good for humans and say storing a page of A4 (letter for the US).

So, you start off with the premise that a .pdf stores text and you want that text. Well that's nice: grow some eyes!

Otherwise, you are going to have to get to grips with some really complicated stuff. For starters, is the text ... text or is it an image? Your eyes don't care and will just work (especially when you pop your specs back on) but your parser is probably seg faulting madly. It just gets worse.

PDF is for humans to read. Emulate a human to read a PDF.

gerdesj | 12 hours ago

One of my favorite documents for highlighting the challenges described here is the PDF for this article:

https://academic.oup.com/auk/article/126/4/717/5148354

The first page is classic with two columns of text, centered headings, a text inclusion that sits between the columns and changes the line lengths and indentations for the columns. Then we get the fun of page headers that change between odd and even pages and section header conventions that vary drastically.

Oh... to make things even better, paragraphs doing get extra spacing and don't always have an indented first line.

Some of everything.

ted_dunning | 18 hours ago

Good old https://linux.die.net/man/1/pdftotext and a little Python on top of its output will get you a long way if your documents are not too crazy. I use it to parse all my bank statements into an sqlite database for analysis.

patrick41638265 | 15 hours ago

> The absolute best way of doing this is these days is likely through a vision based machine learning model, but that is an approach that is very far away from scaling to processing hundreds of gigabytes of PDF files off a single server with no GPU.

SmolDocling is pretty fast and the ONNX weights can be scaled to many CPUs: https://huggingface.co/ds4sd/SmolDocling-256M-preview

Not sure what time scale the author had in mind for processing GBs of PDFs, but the future might be closer than “very far away”

lewtun | 3 hours ago

We[1] Create "Units of Thought" from PDF's and then work with those for further discovery where a "Unit of Thought" is any paragraph, title, note heading - something that stands on its own semantically. We then create a hierarchy of objects from that pdf in the database for search and conceptual search - all at scale.

[1] https://graphmetrix.com/trinpod-server https://trinapp.com

gibsonf1 | 18 hours ago

Definitely recommend docling for this. https://docling-project.github.io/docling/

smcleod | 16 hours ago

When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

bob1029 | 19 hours ago

Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.

xnx | 20 hours ago

Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.

wrs | 20 hours ago

I did some contract work some years back with a company who had a desktop product (for Mac) that would apply some smarts to strip out extraneous things on pages while printing (such as ads on webpages) as well as try to avoid the case where only a line or two was printed on a page, wasting paper. It initially was getting into things at the PostScript layer, which unsurprisingly was horrifying, but eventually worked on PDFs. This required finding and interpreting various textual parts of the passed documents and was a pretty big technical challenge.

While I'm not convinced it was viable at the business level, it feels like something platform/OS companies could focus on to have a measurable environmental and cost overhead impact.

incanus77 | 12 hours ago

I've been using Azure's "Document Intelligence" thingy (prebuilt "read" model) to extract text from PDFs with pretty good results [1]. Their terminology is so bad, it's easy to dismiss the whole thing for another Microsoft pile, but it actually, like, for real, works.

[1] https://learn.microsoft.com/en-us/azure/ai-services/document...

rekoros | 13 hours ago

Some of the unsung heroes of the modern age are the programmers who, through what must have involved a lot of weeping and gnashing of teeth, have managed to implement the find, select, and copy operations in PDF readers.

Sharlin | 16 hours ago

I've worked on this in my day job: extracting _all_ relevant information from a financial services PDF for a bert based search engine.

The only way to solve that is with a segmentation model followed by a regular OCR model and whatever other specialized models you need to extract other types of data. VLM aren't ready for prime time and won't be for a decade on more.

What worked was using doclaynet trained YOLO models to get the areas of the document that were text, images, tables or formulas: https://github.com/DS4SD/DocLayNet if you don't care about anything but text you can feed the results into tesseract directly (but for the love of god read the manual). Congratulations, you're done.

Here's some pre-trained models that work OK out of the box: https://github.com/ppaanngggg/yolo-doclaynet I found that we needed to increase the resolution from ~700px to ~2100px horizontal for financial data segmentation.

VLMs on the other hand still choke on long text and hallucinate unpredictably. Worse they can't understand nested data. If you give _any_ current model nothing harder than three nested rectangles with text under each they will not extract the text correctly. Given that nested rectangles describes every table no VLM can currently extract data from anything but the most straightforward of tables. But it will happily lie to you that it did - after all a mining company should own a dozen bulldozers right? And if they each cost $35.000 it must be an amazing deal they got, right?

noosphr | 16 hours ago

I built a simple OSS tool for qualitative data analysis, which needs to turn uploaded documents into text (stripped HTML). PDFs have been a huge problem from day one.

I have investigated many tools, but two-column layouts and footers etc often still mess up the content.

It's hard to convince my (often non-technical) users that this is a difficult problem.

remram | 11 hours ago

I think using Gemma3 in vision mode could be a good use-case for converting PDF to text. It’s downloadable and runnable on a local computer, with decent memory requirements depending on which size you pick. Did anyone try it?

EmilStenstrom | 19 hours ago

So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: https://mozilla.github.io/pdf.js/.

rad_gruchalski | 20 hours ago

I guess I'm lucky the PDF's I need to process are mostly rather dull unadventurous layouts. So far I've had great success using docling.

PeterStuer | 17 hours ago

Tried extracting data from a newspaper. It is really hard. What is a headline and which headline belongs to which paragraphs? Harder than you think! And chucking it as is into OpenAI was no good at all. Manually dealing with coordinates from OCR was better but not perfect.

coolcase | 13 hours ago

Check out mathpix.com. We handle complex tables, complex math, diagrams, rotated tables, and much more, extremely accurately.

Disclaimer: I'm the founder.

nicodjimenez | 17 hours ago

Recently tested a (non-english) pdf ocr with Gemini 2.5 Pro. First, directly ask it to extract text from pdf. Result: random text blob, not useable.

Second, I converted pdf into pages of jpg. Gemini performed exceptional. Near perfect text extraction with intact format in markdown.

Maybe there's internal difference when processing pdf vs jpg inside the model.

elpalek | 16 hours ago

Cloudflare’s ai.toMarkdown() function available in Workers AI can handle PDFs pretty easily. Judging from speed alone, it seems they’re parsing the actual content rather than shoving into OCR/LLM.

Shameless plug: I use this under the hood when you prefix any PDF URL with https://pure.md/ to convert to raw text.

andrethegiant | 20 hours ago

Maybe it's time for new document formats and browsers that neatly separate content, presentation and UI layers? PDF and HTML are 20+ years old and it's often difficult to extract information from either let alone author a browser.

bickfordb | 16 hours ago

coincidentally, posted this over on Show HN today. OCR workbench, AI OCR & editing tools for OCRing old / hard documents. https://news.ycombinator.com/item?id=43976450. Tesseract works fine for modern text documents, but it fails badly on older docs (e.g. colonial american, etc)

viking2917 | 12 hours ago

Why hasn't the PDF standard been replaced or revised to require the text in meta form? Seems like a no brainer.

fracus | 12 hours ago

Mistral OCR has best in class doing document understanding

https://mistral.ai/news/mistral-ocr

ljlolel | 17 hours ago

I currently use ocrmypdf for my private library. Then Recoll to index and search. Is there a better solution I'm missing?

devrandoom | 18 hours ago

They should called it NDF - Non-Portable Document Format.

anonu | 18 hours ago

Reminds me of github.com/docwire/docwire

dobraczekolada | 17 hours ago

PDF parsing is hell indeed, with all sorts of edge cases that breaks business workflows, more on that here https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

constantinum | 18 hours ago

Part of a problem being challenging is recognizing if it's new, or just new to us.

We get to learn a lot when something is new to us.. at the same time the untouchable parts of PDF to Text are largely being solved with the help of LLMs.

I built a tool to extract information from PDFs a long time ago, and the break through was having no ego or attachment to any one way of doing it.

Different solutions and approaches offered different depth or quality of solutions and organizing them to work together in addition to anything I built myself provided what was needed - one place where more things work.. than not.

j45 | 20 hours ago

As someone who has worked on this FT. (S&P, parsing of financial disclosures)

The solution is OCR. Don't fuck with internal file format. PDF is designed to print/display stuff, not to be parseable by machines.

TZubiri | 12 hours ago

For people who want people to read their documents[1] they should have their PDF point to a more digital-friendly format, an alt document.

Looks like you’ve found my PDF. You might want this version instead:

PDFs are often subpar. Just see the first example: standard Latex serif section title. I mean, PDFs often aren’t even well-typeset for what they are (dead-tree simulations).

[1] No sarcasm or truism. Some may just want to submit a paper to whatever publisher and go through their whole laundry list of what a paper ought to be. Wide dissemanation is not the point.

keybored | 17 hours ago

Is this what GoodPDF does?

Obscurity4340 | 19 hours ago

[dead]

Klaus_ | 8 hours ago

https://github.com/jalan/pdftotext

pdftotext -layout input.pdf output.txt

pip install pdftotext

reify | 19 hours ago