Optimizing ClickHouse for Intel's 280 core processors

ashvardanian | 176 points

This is my favorite type of HN post, and definitely going to be a classic in the genre for me.

> Memory optimization on ultra-high core count systems differs a lot from single-threaded memory management. Memory allocators themselves become contention points, memory bandwidth is divided across more cores, and allocation patterns that work fine on small systems can create cascading performance problems at scale. It is crucial to be mindful of how much memory is allocated and how memory is used.

In bioinformatics, one of the most popular alignment algorithms is roughly bottlenecked on random RAM access (the FM-index on the BWT of the genome), so I always wonder how these algorithms are going to perform on these beasts. It's been a decade since I spent any time optimizing large system performance for it though. NUMA was already challenging enough! I wonder how many memory channels these new chips have access to.

epistasis | 11 hours ago

Clickhouse is excellent btw. I took it for a spin, loading a few TB of orderbook changes into it as entire snapshots. The double compression (type-aware and generic) does wonders. It's amazing how you get both the benefit of small size and quick querying, with minimal tweaks. I don't think I changed any system level defaults, yet I can aggregate through the entire few billion snapshots in a few minutes.

lordnacho | 9 hours ago

This post looks like excellent low-level optimisation writing just in the first sections, and (I know this is kinda petty, but...) my heart absolutely sings at their use of my preferred C++ coding convention where & (ref) neither belongs to the type nor the variable name!

pixelpoet | 11 hours ago

288 cores is an absurd number of cores.

Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…

That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).

bee_rider | 11 hours ago

NUMA is satan. Source: Working in real-time computer vision.

kookamamie | 2 hours ago

>Intel's latest processor generations are pushing the number of cores in a server to unprecedented levels - from 128 P-cores per socket in Granite Rapids to 288 E-cores per socket in Sierra Forest, with future roadmaps targeting 200+ cores per socket.

It seems today's Intel CPU can replace yesteryear's data center.

May someone can try for fun running 1000 Red Hat Linux 6.2 in parallel on one CPU, like it's year 2000 again.

DeathArrow | an hour ago

Those ClickHouse people get to work on some cool stuff

secondcoming | 10 hours ago

Great work!

I like duckdb, but clickhouse seems more focused on large scale performance.

I just thought that the article is written from the point of view of a single person, but has multiple authors, which is a bit weird. Did I misunderstood something?

jiehong | 10 hours ago

I'd like to see ClickHouse change its query engine to use Optimistic Concurrency Control.

pwlm | 10 hours ago

I'm generally surprised they're still using the unmaintained old version of jemalloc instead of a newer allocator like the Bazel-based TCMalloc or mimalloc which have significantly better techniques due to better OS primitives & about a decade or so of R&D behind them.

vlovich123 | 9 hours ago