Querying Parquet files on S3 with DuckDB

Поделиться
HTML-код
  • Опубликовано: 8 сен 2024
  • In this video, we'll learn how to query Apache Parquet files on Amazon S3, using DuckDB.
    #duckdb #s3 #apacheparquet
    Resources
    ► DuckDB - duckdb.org/
    ► Apache Parquet - parquet.apache...
    ► Lightweight Compression in DuckDB - duckdb.org/202...
    ► Build a poor man’s data lake from scratch with DuckDB - dagster.io/blo...

Комментарии • 10

  • @tributarydata
    @tributarydata Год назад +2

    This is the most concise and comprehensive DuckDB video I've seen so far. Great job, Mark!

  • @michaelsimons2560
    @michaelsimons2560 Год назад +1

    I love your information density.
    The 4secs for getting the first 10 results over the wire is ofc because you selected a bunch of different columns that are spread across the Parquet. With remove CSV it would have been faster, I guess.

  • @FranciscoRuizA
    @FranciscoRuizA Год назад +1

    This is great, thanks. Is ADLS in Azure supported?

    • @learndatawithmark
      @learndatawithmark  Год назад

      I don't think it does out of the box, but I just learnt about support for fsspec, which does support ADLS. So I need to give that a try! duckdb.org/docs/guides/python/filesystems

  • @benjaminwootton
    @benjaminwootton Год назад +1

    What's happening with the data transfer when we query a view? Do only the relevant columns get pulled back to the local DuckDb process? Im surprised performance is as good as it is here when we need to do the full table scan.

    • @learndatawithmark
      @learndatawithmark  Год назад

      On the full table scan we have a LIMIT on the query though, so my understanding is it'd only be reading the first row group that contains that data. I think you can read parts of files on S3 without needing to download the whole file, but I'm not an expert in that!

  • @lizardogre
    @lizardogre Год назад

    Were the queries running on a single parquet file, or across all the parquet files in the bucket?

  • @CaribouDataScience
    @CaribouDataScience Год назад

    It's not butter it's Parquet!