Tech Talk | Diving into Delta Lake Part 1: Unpacking the Transaction Log

Поделиться
HTML-код
  • Опубликовано: 28 июл 2024
  • Online Tech Talk hosted by Denny Lee, Developer Advocate @ Databricks with Burak Yavuz, Software Engineer @ Databricks
    Link to Notebook: github.com/dennyglee/databric...
    The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this session, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes.
    In this tech talk you will learn about:
    - What is the Delta Lake Transaction Log
    - What is the transaction log used for?
    - How does the transaction log work?
    - Reviewing the Delta Lake transaction log at the file level
    - Dealing with multiple concurrent reads and writes
    - How the Delta Lake transaction log solves other use cases including Time Travel and Data Lineage and Debugging
    See full Diving Into Delta Lake tutorial series here:
    databricks.com/diving-into-de... Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com/databricks-nam...
  • НаукаНаука

Комментарии • 21

  • @Databricks
    @Databricks  4 года назад +1

    Check out the Online Meetup playlist for video recordings of these tech talks. This one will be available later today! - dbricks.co/youtube-meetups

  • @Databricks
    @Databricks  4 года назад

    - Watch Part 2, Enforcing and Evolving the Schema: @
    - Watch Part 3: How do DELETE, UPDATE and MERGE work: ruclips.net/video/7ewmcdrylsA/видео.html

  • @joserfjunior8940
    @joserfjunior8940 2 года назад

    Otimo Video !! Obrigado !!

  • @no_more_free_nicks
    @no_more_free_nicks 4 года назад +4

    No, not everybody knows what a Data Lake is, so thanks for explaining it briefly.

  • @dheerajkumarsolanki5716
    @dheerajkumarsolanki5716 4 года назад +2

    There is default 30 days of transaction log retention period. So, after 30 days the older transaction logs files are automatically deleted?
    Similar to this, what happened to logically deleted data files after deletedFileRetentionDuration period, is they are automatically deleted or we have to manually delete it?

  • @machinelearninginreallife3558
    @machinelearninginreallife3558 3 года назад +2

    From what you've said, I understand that data versioning is not a real component of DeltaLake. What we want is only to avoid some mistakes (mistaken delete). Am I right?

  • @TheIceSpinner
    @TheIceSpinner 4 года назад

    You don't mention how you actually store reads and writes. Do you store them differentially, and if so, what is the unit? So eg. when you delete a single row in a dataframe, is it only the deleted row that's stored in the new parquet, with some kind of flag, or the whole dataset is duplicated (minus that row)?

    • @dennyglee
      @dennyglee 4 года назад

      There are new Parquet files that are created so that way you can have time travel. You can see which Parquet files are created within the transaction log.

  • @machinelearninginreallife3558
    @machinelearninginreallife3558 3 года назад +1

    I'm not sure to understand. Can we keep data for more than 30 days? Is it a bad practice? Is it even possible?

  • @amitjaju3351
    @amitjaju3351 7 месяцев назад

    Hello Burak, I need one small help .Could you please tell me if we are performing delete operation on Delta table which later and if we need to keep of records which we deleted then can we do that from transaction log folder or is there any other way bh which we can keep track like which record we deleted if someone ask us in future.
    Waiting for your response. Thanks

  • @ArturSukhenko
    @ArturSukhenko 4 года назад

    my_table/date=2019-01-01. Parquet doesn't support date format :) So date is string there?

  • @stuckinamomentt
    @stuckinamomentt 4 года назад

    So Vacuum does not remove log files (due to GDPR), then when are the log files cleaned up to avoid growing indefinitely?

    • @dennylee4934
      @dennylee4934 4 года назад +1

      That's correct, VACUUM does not remove the logs - only the data (parquet) files. Note that the logs are converted from JSON to Parquet which subsequently improves the performance of reading the log.

    • @stuckinamomentt
      @stuckinamomentt 4 года назад +1

      @@dennylee4934 Thanks, and I believe delta.logRetentionDuration controls how to clean up the logs

    • @harikrishnasiliveri1364
      @harikrishnasiliveri1364 4 года назад

      @@dennylee4934 In that case, once we do repartition, we cant achieve time-travel? (since logs are not pointing to the data files anymore) is that correct?

  • @bhanu4j
    @bhanu4j 3 года назад

    Can you share the link for this python notebook. I did not find it.

    • @dennyglee
      @dennyglee 3 года назад +2

      The notebook link is hiding in the description - here you go: github.com/dennyglee/databricks/blob/master/notebooks/Users/denny.lee%40databricks.com/Delta%20Lake/Diving%20Into%20Delta%20Lake:%20Unpacking%20The%20Transaction%20Log.py

  • @kevingomez-yo3or
    @kevingomez-yo3or 4 года назад +1

    Can we have the slides?

    • @dennylee4934
      @dennylee4934 4 года назад +1

      Sure, you can find them in our tech-talks repo at: github.com/databricks/tech-talks/tree/master/2020-03-26%20%7C%20Diving%20into%20Delta%20Lake%20-%20Unpacking%20the%20Transaction%20Log

  • @irochkalviv
    @irochkalviv 3 года назад

    Burak Yavuz,
    A couple of corrections:
    1. Turks arrived in Anatolia (from Central Asia), starting the 11th century, no need to falsify history with the intention of giving some rights to occupy Anatolia and the rest of so called "Turkey"
    2. 1453 is also the beginning of long five centuries of abuse, looting, oppression and massacres of the native Christian populations
    3. A very important important date seems to be omitted: !915, the genocide of native christian populations by the Turks
    Actually, the Turkic tribes that invaded Anatolia have many similarities with the fighters of the Islamic state ISIS. Is it why so called "Turkey" supported them?