2024 12 18 Chicago DataFusion Meetup 02 Tim Saucer

Apache Arrow DataFusion Architecture Part 1

2024 12 18 Chicago DataFusion Meetup 03 Xiangpeng Hao

YELLOWSTONE Season 5 Episode 14 Ending Explained

This Month Was Tough on Us..

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

2024 12 18 Chicago DataFusion Meetup 01 Adrian Garcia Badaracco

Andrew Lamb

Просмотров 129

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 17 янв 2025

Комментарии • 3

@starpact 22 дня назад
Not convinced that this trace_id index approach could scale, like, how can it be only hunderds of trace_ids per row group?
From my experience for otel trace data trace_ids themselves can take >6% of overall storage space after deduplication, sorting and compression(using ClickHouse AggregatingMergeTree order by trace_id as secondary index), which can be quite large for postgres. I think it would be very hard for analytical systems to hand over high cardinality data indexing to postgres as the index itself is large and expensive to build.
@agarbanzo360 22 дня назад
You’re right that is does remain to be seen how far it scales, certainly not indefinitely. Building a system like this is iterative work. Ideally I’d like something like a bloom filter in Postgres which I do think would scale very very far. And that’s not far off given that the documentation for bloom indexes in Postgres say “it is possible to add support for arrays with union and intersection operations in the future” so there are paths forward. Maybe it’s as simple as implementing that. Or we need to write a custom extension
to support it. Worst case we just stick this bit of data into ClickHouse or some other system and deal with the extra complexity of synchronization. The point is that you just need a solution that is workable for the medium term until you’re sure that’s the right problem to solve.
I’m not sure what sort of compression AggregatingMergeTree does but do also keep in mind we’re compressing each trace id into 4 bytes and the index on top of that is lossy. I believe our index is
@starpact 22 дня назад
@ Thanks for your reply, this is absolutely very practical and clever idea! I think the major difference with my experience is the size of trace_ids. 4 bytes are 1/4 of the original, which for my otel trace dataset(100+TB) would be more than 1%of overall space, which is still a lot larger than your number(ignoring the size of gist index). I'm not sure if this is because your data is log so not every record has a trace_id or a single trace_ id is shared by a lot more records compared to trace?
About bloom filter, I initially tried bloom filter on trace_id column but because of high cardinality the bloom filter itself got pretty large(8% of the original with 0.01 fpp) and expensive to use, and ends up not performant enough. So I think at least for large volume trace dataset use case trace_id index would eventually become some kind of sorting based data structure(like common search engines), I end up using a separate AggregatingMergeTree table for this.

Следующие

Автовоспроизведение

2024 12 18 Chicago DataFusion Meetup 02 Tim Saucer

2024 12 18 Chicago DataFusion Meetup 02 Tim Saucer

Apache Arrow DataFusion Architecture Part 1

Apache Arrow DataFusion Architecture Part 1

2024 12 18 Chicago DataFusion Meetup 03 Xiangpeng Hao

2024 12 18 Chicago DataFusion Meetup 03 Xiangpeng Hao

YELLOWSTONE Season 5 Episode 14 Ending Explained

YELLOWSTONE Season 5 Episode 14 Ending Explained

This Month Was Tough on Us..

This Month Was Tough on Us..

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

MAKING BURR BASKETS FOR EACHOTHER!! ft: EVELYN ORTIZ

MAKING BURR BASKETS FOR EACHOTHER!! ft: EVELYN ORTIZ

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

SIGMOD 2024 Practice: Apache Arrow DataFusion A Fast, Embeddable, Modular Analytic Query Engine

SIGMOD 2024 Practice: Apache Arrow DataFusion A Fast, Embeddable, Modular Analytic Query Engine

The Return of Procedural Programming - Richard Feldman

The Return of Procedural Programming - Richard Feldman

2024 12 18 Chicago DataFusion Meetup 04 Andrew Lamb

2024 12 18 Chicago DataFusion Meetup 04 Andrew Lamb

Transformers (how LLMs work) explained visually | DL5

Transformers (how LLMs work) explained visually | DL5

Solving one of PostgreSQL's biggest weaknesses.

Solving one of PostgreSQL's biggest weaknesses.

Когда нас ждут хорошие времена? Реформы в России, кризис в США, бунт ИИ. Астролог Константин Дараган

Когда нас ждут хорошие времена? Реформы в России, кризис в США, бунт ИИ. Астролог Константин Дараган

Андрей МОВЧАН: Россия в инфляционной спирали. Танкам нужны депозиты. Трамп не напугал Путина. Тренды

Андрей МОВЧАН: Россия в инфляционной спирали. Танкам нужны депозиты. Трамп не напугал Путина. Тренды

Why Does Diffusion Work Better than Auto-Regression?

Why Does Diffusion Work Better than Auto-Regression?

КТО ТАКОЙ ПРУНСЕЛЬ?! / UNDERTALE

КТО ТАКОЙ ПРУНСЕЛЬ?! / UNDERTALE

Cute dog became Squid Game Doll 😱💀 #dog # funny #cartoon

Cute dog became Squid Game Doll 😱💀 #dog # funny #cartoon

когда до алгебры 8 уроков.. #шортс #тикток

когда до алгебры 8 уроков.. #шортс #тикток

ЗЭК ИЗБИЛ МОЛОТКОМ ЖЕНУ, А ОНА ПРОСТИЛА...? - МУЖСКОЕ ЖЕНСКОЕ

ЗЭК ИЗБИЛ МОЛОТКОМ ЖЕНУ, А ОНА ПРОСТИЛА...? - МУЖСКОЕ ЖЕНСКОЕ

Дедушка не в силах оживить свой убитый Пассат на пенсию

Дедушка не в силах оживить свой убитый Пассат на пенсию

Спасение служебной собачки

Спасение служебной собачки

ВЛОГ СВАДЬБА как в АНОРЕ | LA - LAS-VEGAS 🇺🇸 💒

ВЛОГ СВАДЬБА как в АНОРЕ | LA - LAS-VEGAS 🇺🇸 💒

Кто яйцо держал дольше?

Кто яйцо держал дольше?