2024 12 18 Chicago DataFusion Meetup 01 Adrian Garcia Badaracco

Поделиться
HTML-код
  • Опубликовано: 17 янв 2025

Комментарии • 3

  • @starpact
    @starpact 22 дня назад

    Not convinced that this trace_id index approach could scale, like, how can it be only hunderds of trace_ids per row group?
    From my experience for otel trace data trace_ids themselves can take >6% of overall storage space after deduplication, sorting and compression(using ClickHouse AggregatingMergeTree order by trace_id as secondary index), which can be quite large for postgres. I think it would be very hard for analytical systems to hand over high cardinality data indexing to postgres as the index itself is large and expensive to build.

    • @agarbanzo360
      @agarbanzo360 22 дня назад

      You’re right that is does remain to be seen how far it scales, certainly not indefinitely. Building a system like this is iterative work. Ideally I’d like something like a bloom filter in Postgres which I do think would scale very very far. And that’s not far off given that the documentation for bloom indexes in Postgres say “it is possible to add support for arrays with union and intersection operations in the future” so there are paths forward. Maybe it’s as simple as implementing that. Or we need to write a custom extension
      to support it. Worst case we just stick this bit of data into ClickHouse or some other system and deal with the extra complexity of synchronization. The point is that you just need a solution that is workable for the medium term until you’re sure that’s the right problem to solve.
      I’m not sure what sort of compression AggregatingMergeTree does but do also keep in mind we’re compressing each trace id into 4 bytes and the index on top of that is lossy. I believe our index is

    • @starpact
      @starpact 22 дня назад

      @ Thanks for your reply, this is absolutely very practical and clever idea! I think the major difference with my experience is the size of trace_ids. 4 bytes are 1/4 of the original, which for my otel trace dataset(100+TB) would be more than 1%of overall space, which is still a lot larger than your number(ignoring the size of gist index). I'm not sure if this is because your data is log so not every record has a trace_id or a single trace_ id is shared by a lot more records compared to trace?
      About bloom filter, I initially tried bloom filter on trace_id column but because of high cardinality the bloom filter itself got pretty large(8% of the original with 0.01 fpp) and expensive to use, and ends up not performant enough. So I think at least for large volume trace dataset use case trace_id index would eventually become some kind of sorting based data structure(like common search engines), I end up using a separate AggregatingMergeTree table for this.