Use DuckDB-WASM to query TB of data in browser

mlissner | 234 points

My company tried DuckDB-WASM + parquet + S3 a few months ago but we ended up stripping it all out and replacing it with a boring REST API.

On paper it seemed like a great fit, but it turned out the WASM build doesn't have feature-parity with the "normal" variant, so things that caused us to pick it like support for parquet compression and lazy loading were not supported. So it ended up not having great performance while introducing a lot of complexity, and also was terrible for first page load time due to needing the large WASM blob. Build pipeline complexity was also inherently higher due to the dependency and data packaging needed.

Just something to be aware of if you're thinking of using it. Our conclusion was that it wasn't worth it for most use cases, which is a shame because it seems like such a cool tech.

dtech | 4 days ago

OK, this is really neat: - S3 is really cheap static storage for files. - DuckDB is a database that uses S3 for its storage. - WASM lets you run binary (non-JS) code in your browser. - DuckDB-Wasm allows you to run a database in your browser.

Put all of that together, and you get a website that queries S3 with no backend at all. Amazing.

mlissner | 5 days ago

Yesterday there was a somewhat similar DuckDB post, "Frozen DuckLakes for Multi-User, Serverless Data Access". https://news.ycombinator.com/item?id=45702831

jdnier | 5 days ago

My initial thought is why query 1TB of data in a browser, maybe I'm the wrong target audience for this but it seems that it's pushing that everything has to be in a browser rather than using appropriate tools

SteveMoody73 | 5 days ago

It's one of the best tricks in the book.

We have been doing it for quite some time in our product to bring real time system observability with eBPF to the browser and have even found other techniques to really max-it-out beyond what you get off the shelf.

https://yeet.cx

r3tr0 | 5 days ago

I built something on top of DuckDB last year but it never got deployed. They wanted to trust Postgres.

I didn't use the in browser WASM but I did expose an api endpoint that passed data exploration queries directly to the backend like a knock off of what new relic does. I also use that same endpoint for all the graphs and metrics in the UI.

DuckDB is phenomenal tech and I love to use it with data ponds instead of data lakes although it is very capable of large sets as well.

leetrout | 5 days ago

I tried DuckDB - liked it a lot - was ready to go further.

But found it to be a real hassle to help it understand the right number of threads and the amount of memory to use.

This led to lots of crashes. If you look at the projects github issues you will see many OOM out of memory errors.

And then there was some indexed bug that crashed seemingly unrelated to memory.

Life is too short for crashy database software so I reluctantly dropped it. I was disappointed because it was exactly what I was looking for.

wewewedxfgdf | 5 days ago

Also similar procedure used on joblist.today https://github.com/joblisttoday to fetch hiring companies and their jobs and store them into sqlite and duckdb, and retrieved on the client side with their wasm modules. The database are generated with a daily github workflow and hosted as artifact on a github page.

ngc6677 | 4 days ago

Where do I learn how to set up this sort of stuff? Trial and error? I kinda never need it for personal projects (so far), which always leads me to forget this stuff in between jobs kinda quickly. Is there a decent book?

barrenko | 4 days ago
[deleted]
| 5 days ago

Neat. Can you use duckdb backed on another store like rocksdb or something? Also, I wonder how one stops ddos. Put the whole thing behind Cloudflare?

amazingamazing | 5 days ago

How… does it not blow up browser’s memory?

didip | 4 days ago

This is brilliant guys, omg this is brilliant. If you think about it, freely available data always suffer with this burden... "But but we don't make money, all this stuff is public data by law, and government doesn't give us a budget". This solves that, the "can't afford it" spirit of public agencies.

bzmrgonz | 4 days ago