Git for Data Lakes How lakeFS Scales data versioning to billions of objects Amit Kesarwani

Поделиться
HTML-код
  • Опубликовано: 11 янв 2023
  • Git for Data Lakes How LakeFS Scales data versioning to billions of objects - Amit Kesarwani
    A presentation from ApacheCon 2022
    apachecon.com/acna2022/slides...
    Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.
    lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.
    In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.
    By the end of the session you'll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.
    Batch and Stream analysis with Typescript? Yes, with Beam - Pablo Estrada
  • НаукаНаука

Комментарии • 1