What is this delta lake thing?

Поделиться
HTML-код
  • Опубликовано: 4 фев 2025

Комментарии • 39

  • @cantTouch948
    @cantTouch948 Год назад +3

    This video is gold, makes it easier to understand spark and delta lake - kudos!

  • @mikitaarabei
    @mikitaarabei 2 месяца назад

    Have to love the energy of the guy

  • @joannapodgoetsky4382
    @joannapodgoetsky4382 2 года назад +38

    A for Atomicity I think 😊

    • @yosh_2024
      @yosh_2024 5 месяцев назад +1

      right 🙂

  • @mohamedtarek-gh4fr
    @mohamedtarek-gh4fr 2 года назад

    Again, another great video from the great series (Azure synapse analytics)
    Thanks a lot guys(in the cube), you are amazing

  • @Himanshubishnoi-td3nh
    @Himanshubishnoi-td3nh 4 месяца назад

    the guest was awesome

  • @PCGHigh
    @PCGHigh 2 года назад

    Great video series for getting started with the topic. Probably the video is already in production but as a follow up to the series I can imagine it could be interesting to see how powerful the functionality of delta is. What exactly does the time travel feature look like. For me it was impressive to see how granular you can jump back in time and roll back changes to rows but also structural changes to a table. If we want to look at it more from an ETL perspective, maybe a look at the change data feed would be interesting.
    Regardless of how you continue this series I am very excited because your hands-on way of approaching these things takes the hurdle out of many to begin their journey.

  • @VjBroodz
    @VjBroodz 10 месяцев назад

    Very Clear, thanks

  • @matthiasg.4941
    @matthiasg.4941 2 месяца назад

    Cool video. Those DROP TABLE IF EXISTS and DROP DATABASE IF EXISTS are precautions so we don't run into an error when we replace what's there, right?

  • @maxirojo7829
    @maxirojo7829 Год назад

    Hello! excellent video! It is recommended in the first bronze layer to save the data in parquet and in the following two in delta? thank you

  • @ChronicSurfer
    @ChronicSurfer Год назад +2

    Interesting. What is the benefit of using this vs creating incremental loading within your merge statements? Are there more costs associated with using a delta lake? Additionally, will this pick-up changes from my source?

  • @radekou
    @radekou 2 года назад +2

    Hello, thanks for putting out great content and useful videos.
    Delta is certainly cool, however, after having a deeper look: Delta time travel does not seem to be a replacement for a proper Type2 SCD modelled data, since:
    - there is a limited data retention for the delta log (30 days), it can be extended of course
    - you can't leverage that time travel when using Serverless SQL Pool (which is how I'd expose Delta tables to Power BI)
    - or have I missed something obvious?
    Furthermore - the SQL / pySpark interoperability works to an extent, for example Synapse Spark SQL doesn't support SQL based time travel (SELECT * FROM TABLE AS OF VERSOIN N) - this has to be done via pySpark. On the bright side - pySpark is not that hard to pick up, takes getting used to, but it's quite powerful :)
    Now only add support for Delta for the Workspace-created Lake Database! :)
    Cheers

    • @MDevion
      @MDevion 2 года назад

      Im on the same route as you, but still do not fully understand where they want us to do the transformations.
      Should it be in a notebook, ADF or where exactly? And do we store the transformed data before it gets into silver/delta lake?

    • @radekou
      @radekou 2 года назад

      @@MDevion Re. the transformations - you can do most of those in steps using pySpark or SparkSQL on top of the temp view, and only persist the data physically once all applied. Dataframes are evaluated "lazily", to me it's a bit like layering SQL CTEs on top of each other.
      As for where to do it? Yeah - we're looking at what does it best - is it Dataflow, Pipelines (copy activity doesn't support writing to Delta) or Notebooks.
      Lastly - try creating a lake table based on delta location - the columns / schema don't seem to be picked up properly.
      I used to thing Databricks were a half-baked product but Synapse in its current state (at least for Spark / Serverless) is on another level. No ACLs for folders in the Develop blade? Phew. :)

    • @MDevion
      @MDevion 2 года назад +1

      @@radekou and what if I only want to apply transformations to the latest data? How does Synapse know that it only should transform the latest data?
      The whole line up now/process is just incoherent.

  • @nagoorpashashaik8400
    @nagoorpashashaik8400 Год назад

    @Guy in a cube - Can we do this same thing in ADF - Mapping dataflow?

  • @eth6706
    @eth6706 2 года назад

    You should do videos about machine learning models in Synapse

  • @dancrowell2933
    @dancrowell2933 2 года назад +1

    How do you handle change to the source system in a Delta lake? For example: when a source table adds 3 columns and drops two?

    • @kancharlasrimannarayana7068
      @kancharlasrimannarayana7068 24 дня назад

      Thier is concept called merge schema if you set it true when writing to delta table it will automatically adjust the schema change like datatype change,adding new column delete a column etc.

  • @user-uf7ie5pt9e
    @user-uf7ie5pt9e 10 месяцев назад

    To be clear, bronze layer (parquet, json, csv, ...etc), silver layer (delta lake, iceberg or hudi - one table open format) and gold layer (SQL Query - Views) to server data users ? is it correct?

  • @sid0000009
    @sid0000009 Год назад

    Can an API hosted on an App service in anyway fetch Delta tables data ? thanks

  • @mrnagoo2
    @mrnagoo2 Год назад

    ACID = "Atomicity" not "Automicity". Thanks for the video.

  • @googlegoogle1812
    @googlegoogle1812 2 года назад

    Do you know what is the difference between lake databases and delta lake project? Both seem to have roughly the same functionality - I can use Spark to do ETL tasks - and then use spark pools as well as serverless sql pools to query data.

  • @noahhadro8213
    @noahhadro8213 8 месяцев назад

    So is this a true statement. Delta, delta tables, and delta-parquet are all synonyms and mean the same thing?

  • @crystal9543
    @crystal9543 Год назад

    Yes explore the BOG boots on the ground

  • @matthiask4602
    @matthiask4602 2 года назад +3

    Adam looks different today...

  • @szklydm
    @szklydm 2 года назад +1

    PySQL should be a thing! 😁

  • @RamyNazier
    @RamyNazier 8 месяцев назад

    Nice video but the audio quality makes it a bit harder to understand

  • @martinbubenheimer6289
    @martinbubenheimer6289 2 года назад +1

    Previously I would not have perquet files, previously I would have a SQL-Server. What problem does a delta lake solve compared to just using a SQL-Server?

    • @ghoser1986
      @ghoser1986 2 года назад +2

      When your table is in a Delta format, it opens up new use cases for you. Streaming, Data Science and Analytics from a single source of truth in the language of your choice - sql, python, r. Ultimately getting more value from your data without having to store it multiple times. Also it’s open and cheaper to store vs SQL server.
      I’m biased but I feel Databricks allows you to get more value from it vs Synapse

    • @MDevion
      @MDevion 2 года назад +1

      The big advantage is, if you are in the cloud, computation and storage are seperated. Which means you only pay for storage when its not doing anything and only pay processing when needed. For Serverless its around 5 euro/5 usd per 1TB, minimum per query 10MB invoiced. Its alot cheaper then a dedicated pool in Synapse.
      Also its easier to scale up. In Azure SQL DB(Assuming you are using this), both storage and computation scale up and scale down. There is more flexibility then there used to be, but they are still linked. Also Azure SQL DB is not suited for heavy DWH workload, mainly due how SQL logs and Azure DBs are locked into FULL as log.

    • @martinbubenheimer6289
      @martinbubenheimer6289 2 года назад

      Interesting aspect. I would love to discuss with the Exasol guys when they would recommend to prefer a delta lake against their Exasol SQL database for scalability, heavy workload, or analytics use cases.
      Do you know what was the origin of delta lake? ACID compliance doesn't seem to be the most important DWH requirement for a data storage solution on the silver layer where access can be controlled by the ETL process: Source systems write to the copper layer, destination system read from the gold layer, inbetween ETL can be orchestrated to eliminate the need of ACID compliance. This sounds more like a transactional requirement.

  • @helloranjan89
    @helloranjan89 2 года назад

    Seems complex 🤔

  • @kshitizaggarwal1
    @kshitizaggarwal1 Год назад +1

    Atomicity not automicity

  • @kaizer-919
    @kaizer-919 7 месяцев назад

    This is really Bananas

  • @willi1978
    @willi1978 9 месяцев назад

    guess i'd prefer snowflake to this.