Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse

super_ar | 77 points

How is this better than using ReplacingMergeTree in ClickHouse?

RMT dedups automatically albeit with a potential cost at read time and extra work to design schema for performance. The latter requires knowledge of the application to do correctly. You need to ensure that keys always land in the same partition or dedup becomes incredibly expensive for large tables. These are issues to be sure but have the advantage that the behavior is relatively easy to understand.

Edit: clarity

hodgesrm | 3 days ago

How does the deduplication itself work? The blog didn't have many details.

I'm curious because it's no small feat to do scalable deduplication in any system. You have to worry about network latencies if your deduplication mechanism is not on localhost, the partitioning/sharding of data in the source streams, and handling failures writing to the destination successfully, all of which cripples throughput.

I helped maintain the Segmentio deduplication pipeline so I tend to be somewhat skeptical of dedupe systems that are light on details.

https://www.glassflow.dev/blog/Part-5-How-GlassFlow-will-sol...

https://segment.com/blog/exactly-once-delivery/

caust1c | 3 days ago

Seems interesting, but I'm not sure what duplication means in this context? Is Kafka sending several time the same row? and for what reasons?

Could you give practical examples where duplication happens?

My use-case is IoT with devices connecting on MQTT and sending batches of data, each time we ingest a batch we stream all corresponding rows in database, because we only ingest a batch once, I don't think there can really be duplicates, so I don't think I would be the target of your solution,

but I'm still curious at in which case such things happen, and why couldn't Kafka or Clickhouse dedup themselves using some primary key or something?

oulipo | 3 days ago

Very cool stuff, good luck!

I didn't quickly find this in the documentation, but given that you're using the NATS Kafka Bridge, would it be a lot of work to configure streaming from NATS directly?

maxboone | 3 days ago

Congratulations!!

Questions:

1. Why only to ClickHouse, can’t we make it generic for any DB? Or is it reference implementation for ClickHouse?

2. Similarly, why only from Kafka?

3. Any default load testing done?

the_arun | 3 days ago

Neat project! Quick question, will this work only if the entire row is a duplicate? Or even if just a set of columns (ex: primary key) conflict and you guarantee only presence of the latest version of the conflict? I’m assuming former because you are deduping before data is ingested into ClickHouse. I could be missing something, wanted to confirm.

- Sai from ClickHouse

saisrirampur | 3 days ago

How do you avoid creating duplicate rows in ClickHouse?

- What happens when your insertion fails but some of the rows are actually still inserted?

- What happens when your de-duplication server crashes before the new offset into Kafka has been recorded but after the data was inserted into ClickHouse?

YZF | 3 days ago

What uses cases would this be effective compared to using replacing merge tree (RMT) in clickhouse that eventually (usually in a short period of time) can handle dups itself? We had issues with dups that we solved using RMT and query time filtering.

ram_rar | 3 days ago

Are there any load test results available, we would like to use this at zenskar but at high scale really need it to work.

System merges and final are definitely unpredictable so nice project.

darkbatman | 3 days ago

Just wanna say I dig the design. In-house or outsourced?

brap | 3 days ago