Hey everyone, I'm a software engineer at Eventual, the team behind Daft! Huge thanks to the op for the benchmark, we're a huge fan of your blog posts and this gave us some really useful insights. For context, Daft is a high-performance data processing engine for AI workloads that works both on single-node and distributed setups.
We're actively looking into the results of the benchmark and hope to share some of our findings soon. From initial results, we found a lot of potential optimizations we could make to our deltalake reader to improve parallelism and our groupby operator to improve pipelining for count aggregations. We're hoping to roll our these improvements over the next couple of releases.
If you're interested to learn more about our findings, check out our GitHub (https://github.com/Eventual-Inc/Daft) or follow us on Twitter (https://x.com/daftengine) and LinkedIn (https://www.linkedin.com/showcase/daftengine) for updates. Also if Daft sounds interesting to you, give us a try via pip install daft!
650GB? Your data is small, fits on my phone. Dump the hyped tooling and just use gnu tools.
Here's an oldie on the topic: https://adamdrake.com/command-line-tools-can-be-235x-faster-...
I often crunch 'biggish data' on a single node using duckdb (because I love using the modern style of painless and efficient SQL engines).
I don't use delta or iceberg (because I haven't needed to; I'm describing what I do, not what you can do :)), but rather just iterate over the underlying parquet files using filename listing or wildcarding. I often run queries on BigQuery and suck down the results to a bunch of ~1GB local parquet files - way bigger than RAM - that I can then mine in duckdb using wildcarding. Works great!
I'm in a world where I get into the weeds of 'this kind of aggregation works much faster on Bigquery than duckdb, or vice versa, so I'll split my job into this part of sql running on Bigquery then feeding into this part running in duckdb'. It's the fun end of data engineering.
Honestly this benchmark feels completely dominated by the instance's NIC capacity.
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
I am not in data eng, but I do occasionally query data lake at my company. Where does Snowflake stand in this? (specially looking at that Modern Data Stack image)
Presto (a.k.a. AWS Athena) might be a faster/better alternative? Also would like to see if 650GB data is available locally.
> Truly, we have not been thinking outside the box with the Modern Lake House architecture. Just because Pandas failed us doesn’t mean distributed computing is our only option.
Well yea, I would have picked polars as well. To be fair , I didn’t know about some of these.
There are other factors as well, that drive the decision makers to clusters and big-data tech, even when the benchmarks do not justify that. At the root, the reasons are organizational, not technical. Risk aversion seeks to avoid single point of failure, needs accountability, favors outsourcing to specialists etc. Performance alone is not going to beat all of that.
In places I have worked at that used Databricks, I feel they chose it for the same reasons big orgs use Microsoft: it comes out of a box and has a big company behind it. Technical benchmarks or even cost considerations would be a distant second.
If I understand correctly, polars relies on delta-rs for Delta Lake support, and that is what does not support Deletion vectors: https://github.com/delta-io/delta-rs/issues/1094
It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
DuckDb has a new "DuckLake" catalog format that would be another candidate to test. https://ducklake.select/
The main reason why clusters still make sense is because you'll have a bunch of people accessing subsets of much larger data regularly, or competing processes that need to have their output ready at around the same time. You distribute not only compute, but also I/O, which others are pointing out to likely dominate the runtime of the benchmarks.
Beyond Spark (one shouldn't really be using vanilla Spark anyways, see Apache Comet or Databricks Photon), distributing my compute makes sense because if a job takes an hour to run, (ignoring overnight jobs) there will be a bunch of people waiting for that data for an hour.
If I run a 6 node cluster that makes the data available in 10 minutes, then I save in waiting time. And if I have 10 of those jobs that need to run at the same time, then I need a burst of compute to handle that.
That 6 node cluster might not make sense on-prem unless I can use the compute for something else, which is where PAYG on some cloud vendor makes sense.
650GB would’ve fit in a not-exotic-at-all basically off the shelf server ram a decade ago
This is somewhat real world, except real world would probably index some parquet columns to avoid a full scan like that.
650GB relates to size of parquet files which are compressed in reality it’s way more.
32 GB of parquet cannot fit in 32GB of RAM
One thing that I never really see mentioned in these types of articles is that a lot of DuckDB’s functionality does not work if you need to spill to disk. iirc, percentiles/quartiles (among other aggregate functions) caused DuckDB to crash out when it spilled to disk.
6$ of data does not a compelling story make. This is not 1998
I hate this screenshots for commands and outputs everywhere
650GB? We have 72PB IN S3, know people who have multiple EB in S3.
I love this article! But I think this insight shouldn't be surprising. Distribution always has overheads, so if you can do things on a single machine it will almost always be faster.
I think a lot of engineers expect 100 computers to be faster than 1, because of the size comparison. But we're really looking at a process here, and a process shifting data between machines will almost always have to do more stuff, and therefore be slower.
Where spark/daft are needed is if you have 1tb of data or something crazy were a single machine isn't viable. If I'm honest though, I've seen a lot of occasions where someone thinks they have that happening, and none so far where they actually do.