HNPWA with Next.js

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code

simonw | 200 points

I see a lot of snark in the comments. Simon is a researcher and I really like seeing his experiments! Sounds like the goal here was to delegate a discrete task to an LLM and have it solve the problem much like one would task a junior dev to do the same.

And like a junior dev it ran into some problems and needed some nudges. Also like a junior dev it consumed energy resources while doing it.

In the end I like that the chunk size of work that we can delegate to LLMs is getting larger.

bahmboo | 2 days ago

The point of DeepSeek-OCR is not an one way image recognition and textual description of the pixels, but text compression to a generated image, and text restoration from that generated image. This video explains it well [1].

> From the paper: Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20×, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs.

It's main purpose is: a compression algorithm from text to image, throw away the text because it costs too many tokens, keep the image in the context window instead of the text, generate some more text, when text accumulates even more compress the new text to image and so on.

The argument is, pictures store a lot more information than words, "A picture is worth a thousand words" after all. Chinese characters are pictograms, it doesn't seem that strange to think that, but I don't buy it.

I am doing some experiments of removing text as an input for LLMs and replacing with it's summary, and I have reduced the context window by 7 times already. I am still figuring what it the best way to achieve that, but 10 times is not far off. My experiments involve novel writing not general stuff, but still it works very well just replacing text with it's summary.

If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!

[1] https://www.youtube.com/watch?v=YEZHU4LSUfU

emporas | 2 days ago

I did the opposite yesterday. I used GPT5 to brute force dotnet into Claude Code for Web, which finally involved it writing an entire HTTP proxy in Python to download nuget packages.

qingcharles | 2 days ago

No idea why Nvidia has such crusty torch prebuilds on their own hardware. Just finished installing unsloth on a Thor box for some finnetuning, it's a lengthy build marathon, thankfully aided by Grok giving commands/environment variables for the most part (one finishing touch is to install latest CUDA from nvidia website and then replace compiler executables in triton package with newer ones from CUDA).

cat_plus_plus | 2 days ago

From the article:

> Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.

Given the importance of TDD in this style of continual agentic loop - I was a bit surprised to see that the author only seems to have provided an input but not an actual expected output.

Granted this is more difficult with OCR since you really don't know how well DeepSeek-OCR might perform, but a simple Jaccard sanity test between a very legible input image and expected output text would have made it a little more hands-off.

EDIT: After re-reading the article, I guess this was more of a test to see if DeepSeek-OCR would run at all. But I bet you could setup a pretty interesting TDD harness using the aforementioned algorithm with an LLM in a REPL trying to optimize Tesseract parameters against specific document types which was ALWAYS such a pain in the past.

vunderba | 2 days ago

My assumption is that there could be higher dimensional 'token' representations that can be used instead of vision tokens (though obviously not human interpretable). I wonder if there is an optimisation for compression that takes the vector space at a specific level in the network to provide the most context for minimal memory space.

achr2 | 2 days ago

I also use Claude Code to install CUDA and PyTorch and HuggingFace models on my quad A100 machine. Shouldn't feel like debugging a 2000s Linux driver.

HuggingFace has incredible reach but poor UX, and PyTorch installs remain fragile. There’s real space here for a platform that makes this all seamless maybe even something that auto-updates a local SSD with fresh models to try every day.

amirhirsch | 2 days ago

[deleted]

| 3 days ago

> There’s honestly so much material in the resulting notes created by Claude that I haven’t reviewed all of it

I've had the same "problem" and feel like this is the major hazard involved. It is tricky to validate the written work Claude (or any other LLM) produces due to high levels of verbosity, and the temptation to say "well, it works!"

As ever though, it is impressive what we can do with these things.

If I were Simon, I might have asked Claude (as a follow up) to create a minimal ansible playbook, or something of that nature. That might also be more concise and readable than the notes!

hkt | 2 days ago

Compute well spent... finding out to download a version and hardware appropriate wheel.

BoredPositron | 3 days ago

I am the only one seeing this Nvidia Spark as meh?

I had it in my cart, but then watched few videos from influencers and it looks like power of this thing doesn't match the hype.

varispeed | 2 days ago

Ehh, is it cool and time savings that it figured it out? Yes. But the solution was to get a “better” version prebuilt wheel package of PyTorch. This is a relatively “easy” problem to solve (figuring out this was the problem does take time). But it’s (probably, I can’t afford one) going to be painful when you want to upgrade the cuda version or specify a specific version. Unlike a typical PC, you’re going to need to build a new image and flash it. I would be more impressed when a LLM can do this end to end for you.

syntaxing | 4 days ago