Gábor Szárnyas - DuckDB: The Power of a Data Warehouse in your Python Process

CrowdStrike IT Outage Explained by a Windows Developer

Why should you care about DuckDB? ft. Mihai Bojin

Last To Stop Riding Supercars KEEP IT!!! ($3 MILLION)

Exposing ALL “Billion Money" Glitches in Blox Fruits..

A COMPLETE UNKNOWN | Official Teaser | Searchlight Pictures

Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale

Coiled

Просмотров 7 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 27 июл 2024
We run the common TPC-H Benchmark suite at 10 GB, 100 GB, 1 TB, and 10 TB scale on the cloud a local machine and compare performance for common large dataframe libraries.
No tool does universally well. We look at common bottlenecks and compare performance between the different systems.
This talk was originally given at PyData NYC 2023. These results are preliminary, and come from only a couple weeks of exploration.
00:00 Introduction
01:58 Background!
13:30 Charts!
20:00 Analysis.
30:12 Deployment!
Learn More:
- Latest TPC-H results and more details: docs.coiled.io/blog/tpch.html
- Performance improvements for Dask DataFrame: docs.coiled.io/blog/dask-data...
Наука

Комментарии • 16

@andrewm4894 8 месяцев назад ⁺²
Great talk, thanks
@mooncop 8 месяцев назад
you are most welcome (suffered well)
worth it for the duck
@randywilliams7696 6 месяцев назад ⁺²
Great video! Recently switched from Dask to Duckdb on my ~1TB workloads, interesting to see some of the same issues I found brought up here. One gotcha I've found is that it is REALLY easy to blunder your way into making non-performant queries in dask (things that end up shuffling, partitioning, etc. a lot behind the scenes). It was more straightforward for my use case to write performant SQL queries for duckdb since that is much more of a common, solved problem. The scale-out feature of Dask and Spark is interesting too, as we are considering the merits of a natively clustered solution vs just breaking up our queries into chunks that can fit on multiple single instances for duckdb.
@MatthewRocklin 6 месяцев назад ⁺¹
Yup. Totally agreed. The query optimization in Dask Dataframe should handle what you ran into historically. The problem wasn't unique to you :)
@ravishmahajan9314 6 месяцев назад
But what about distributed databases. Is DuckDB able to query distributed databases?
Is this technology replacing spark framework??
@richerite 16 дней назад
Great talk! What would you recommend for ingesting about 100-200GB of geospatial data on premise?
@rjv 7 месяцев назад
Such a good video! So many good insights clearly communicated with proper data. Also love the interfaces you've built, very meaningful, clean and minimalistic.
Have you got comparison benchmarks where cloud cost is the only constraint and the number of machines or their size and type (GPU machines with cudf) is not restricted?
@FabioRBelotto 12 дней назад
My main issue with dask is the lack of support of the community (very different from pandas!)
@o0o0oo00oo00 8 месяцев назад ⁺²
I don’t see duckdb and polars kick spark dask ass on 10gb level in my practical usage.😅 we can’t always trust TPC-H benchmarks.
@taylorpaskett3703 6 месяцев назад
What software did you use for generating / displaying your plots? It looked really nice
@taylorpaskett3703 6 месяцев назад ⁺¹
Nevermind, if I just kept watching you showed the GitHub where it says ibis and altair. Thanks!
@ravishmahajan9314 6 месяцев назад
But DuckDB is good if your data fits one single machine. But the benchmarks shows different story when data is distributed. What about that?
@kokizzu 5 месяцев назад
Clickhouse ftw
@bbbbbbao 8 месяцев назад
It's not clear to me if you can use autoscaling with coiled.
@Coiled 8 месяцев назад ⁺²
You can use autoscaling with Coiled. See the `coiled.Cluster.adapt` method.
@maksimhajiyev7857 4 месяца назад
The problem is that in fact RUST based tooling actually wins and all the paid promotions just suck . The actual reason why RUST based tooling is sort of suppressed is very simple , hyperscalers (big cloud tech) earn a lot of money and if things are faster there is no huge bills for your spark clusters 😊)) , I was playing with RUST and huge datasets myself without external benchmarks course I don t trust all this market shit .Rust based EDA is maybe witch kraft but this thing runs as beast . try yourself guys with a huge datasets .

Следующие

Автовоспроизведение

Gábor Szárnyas - DuckDB: The Power of a Data Warehouse in your Python Process

Gábor Szárnyas - DuckDB: The Power of a Data Warehouse in your Python Process

CrowdStrike IT Outage Explained by a Windows Developer

CrowdStrike IT Outage Explained by a Windows Developer

Why should you care about DuckDB? ft. Mihai Bojin

Why should you care about DuckDB? ft. Mihai Bojin

Last To Stop Riding Supercars KEEP IT!!! ($3 MILLION)

Last To Stop Riding Supercars KEEP IT!!! ($3 MILLION)

Exposing ALL “Billion Money" Glitches in Blox Fruits..

Exposing ALL “Billion Money" Glitches in Blox Fruits..

A COMPLETE UNKNOWN | Official Teaser | Searchlight Pictures

A COMPLETE UNKNOWN | Official Teaser | Searchlight Pictures

GloRilla - All Dere (feat. Moneybagg Yo) (Official Music VIdeo)

GloRilla - All Dere (feat. Moneybagg Yo) (Official Music VIdeo)

Dask DataFrame is Fast Now

Dask DataFrame is Fast Now

Polars vs Spark - Larger Than Memory Datasets

Polars vs Spark - Larger Than Memory Datasets

DuckDBT: Not a database or a dbt adapter but a secret third thing - DuckCon #3 (San Francisco)

DuckDBT: Not a database or a dbt adapter but a secret third thing – DuckCon #3 (San Francisco)

Turns out REST APIs weren't the answer (and that's OK!)

Turns out REST APIs weren't the answer (and that's OK!)

State of the Duck (DuckCon #4, Amsterdam, 2024)

State of the Duck (DuckCon #4, Amsterdam, 2024)

In-Process Analytical Data Management with DuckDB - posit::conf(2023)

In-Process Analytical Data Management with DuckDB - posit::conf(2023)

Accelerating Python Data Analysis with DuckDB

Accelerating Python Data Analysis with DuckDB

Build a poor man’s data lake from scratch with DuckDB

Build a poor man’s data lake from scratch with DuckDB

Dask Demo Day 2024-03-21

Dask Demo Day 2024-03-21

Как правильно выключать звук на телефоне?

Как правильно выключать звук на телефоне?

iOS 18 Beta 4 обновление! Что нового iOS 18 Beta 4? Нагрев ушел, оптимизация, обзор iOS 18 Beta 4

iOS 18 Beta 4 обновление! Что нового iOS 18 Beta 4? Нагрев ушел, оптимизация, обзор iOS 18 Beta 4

iOS 18 Beta 4 обновление! Что нового iOS 18 Beta 4? Нагрев ушел, оптимизация, обзор iOS 18 Beta 4

iOS 18 Beta 4 обновление! Что нового iOS 18 Beta 4? Нагрев ушел, оптимизация, обзор iOS 18 Beta 4

Лазер против камеры смартфона

Лазер против камеры смартфона

Battery low 🔋 🪫

Battery low 🔋 🪫

КОМП В МЕШКЕ / КУПИЛ В ДНС ПК ЗА 50К ОТ MSI. ВСТРОЙКА ФОРЕВЕР?

КОМП В МЕШКЕ / КУПИЛ В ДНС ПК ЗА 50К ОТ MSI. ВСТРОЙКА ФОРЕВЕР?

POCO X6 PRO😈 Vs iPHONE 15 PRO💀Vs POCO F6 PRO😱 VsiQOO 12Vs 8GBvs4GBVs-PUBG TEST #pocox6pro #iPhone

POCO X6 PRO😈 Vs iPHONE 15 PRO💀Vs POCO F6 PRO😱 VsiQOO 12Vs 8GBvs4GBVs-PUBG TEST #pocox6pro #iPhone

Курьёзный случай с УЛЬТРАБУКОМ IRBIS NB132 подписчика или почему я решил сделать ему бук БЕСПЛАТНО?

Курьёзный случай с УЛЬТРАБУКОМ IRBIS NB132 подписчика или почему я решил сделать ему бук БЕСПЛАТНО?