Let's Build A...Delta Lake Solution using Azure Synapse Analytics Mapping Data Flows

Поделиться
HTML-код
  • Опубликовано: 24 окт 2024

Комментарии • 29

  • @harryakb11
    @harryakb11 3 месяца назад

    Your tutorial is really helpful, mate. Big thanks!

  • @ashah2910
    @ashah2910 2 года назад +2

    Easy to assimilate the knowledge from this tutorial into real life situations. No need to write pyspark, etc. scripts when not required.

  • @pankajnakil6173
    @pankajnakil6173 Год назад

    Thank you for such good content.
    The way you calmly explain the basic concepts is fabulous, would like to binge on other videos of yours.

  • @capoeiracordoba
    @capoeiracordoba Год назад

    Thanks for demo!! great resources videos

  • @Poups26
    @Poups26 2 года назад +2

    Hello,
    Thanks for this demo, it helped me start setting up my architecture.
    However something is quite confusing to me, I did not see you typing a primary key or any rule for the delta lake to Identify the corresponding record to update when a record is changed.
    In your case, the correct productId were updated. But how were they Identified ?
    After you changed the price, how was it able to understand that the new record with new product is actually the same product with a price change, although you never mentionned that one product is identified by the "ProductID" field.
    Could you clarify how it workds ??
    Thank you and looking forward to the next video !

    • @DatahaiBI
      @DatahaiBI  2 года назад

      HI, actually there wasn't anything particularly clever here as it was just a Delta update. What happens with Delta is you can just write new data to the Delta "table" and it keeps history of the changes, there's no need to join any columns. However, you can join columns when doing an UPSERT and yes you need a key column (or a set of columns to act as a key) which is configured in the Delta sink. You also need an ALTER ROW transform before the Delta sink.

  • @jasonwilson6043
    @jasonwilson6043 Год назад

    Great video! Have you posted the follow video with only capturing changes from the source and partitioning?

    • @DatahaiBI
      @DatahaiBI  Год назад

      No not yet, I've got a question with MS about destination partition overwriting

  • @dperezc88
    @dperezc88 2 года назад +2

    Excellent tutorial; looking forward for the one on partitioning

    • @DatahaiBI
      @DatahaiBI  2 года назад +2

      Thanks, partitioning video coming soon

    • @nagoorpashashaik8400
      @nagoorpashashaik8400 Год назад

      @@DatahaiBI - Is partitioning video out?

    • @DatahaiBI
      @DatahaiBI  Год назад

      @@nagoorpashashaik8400 No not worked on that yet. Delta is quite limited in functionality with Mapping Data Flows, I'm working on how to optimise the merge process when the data is partitioned.

  • @rafibasha4780
    @rafibasha4780 2 года назад +1

    Excellent tutoria

  • @darta1094
    @darta1094 Год назад

    Thanks, good session. Probably add AS OF demo to show time travel.

    • @DatahaiBI
      @DatahaiBI  Год назад

      Yes good idea. At some point I'll look at time travel

  • @souranwaris142
    @souranwaris142 Год назад

    Hello, I need help with Delta load (incremental load) from the Oracle server to blob storage(CSV). First I loaded the table with full load by copy activity now I want to make another pipeline with the trigger that run every week load new data and append in the master CSV(that i loaded first pipeline). or you can suggest me how I can load table from oracle to data lake in csv with Delta load (incremental load).
    I'm attempting to obtain fresh data using a query-based lookup activity, but I have no idea how to use data flow to add the fresh data to the master CSV in the data lake. Perhaps I'm going about this the incorrect way.

  • @RonaldPostelmans
    @RonaldPostelmans 2 года назад +1

    THANKS, great video. I thought delta lake was something in databricks, question. I would like to see you try getting only the new inserts and changes in the for each. Can you also do the delta thing when you have multiple parquet files joining together with a left join. For example, when i want to create an Dimension table products that is incrementally updated which contains joins to product category and product group in 2 left joins and finaly land into a Delta Dimension table. I'm very curious for your answer, thanks. greetz Ronald

    • @DatahaiBI
      @DatahaiBI  2 года назад +1

      Hi, thank you for watching. Yes Delta is becoming very popular across different vendors. For the incremental I would use some form of flag in the source system to identify if something has changes. For changes to Delta via joins, I would construct the dataset from joins (eg several tables required to make a dimension) then look at what had changed, then do the merge - is this helpful?

  • @daniel-florinstefan9252
    @daniel-florinstefan9252 2 года назад +1

    Thanks for the demonstration, Andy! I'm a beginner data engineer and I have a question: have you tried working with delta tables in Lake Databases on Synapse? The documentation seems kind of unclear to me as to whether they do work together or not, though from my attempts they don't really seem to, but maybe you have more experience with it? Thanks in advance for the help and looking forward to the next video! :)

    • @DatahaiBI
      @DatahaiBI  2 года назад

      Hi, at the moment if you create a Lake Database and then use the database designer it only supports delimited (csv) or Parquet. If you create an external table in a Lake Database using Spark then you can use Delta.

  • @RonaldPostelmans
    @RonaldPostelmans 2 года назад

    again thanks. i have another question. Why do you load the data first From an SQL db to a parquet file and as a second step to a different parquet file as sort of a staging area. I mean, why not doing the loading from the sql database to delta parquet format right away in a dataflow?

    • @DatahaiBI
      @DatahaiBI  Год назад +1

      Yes that will also work. I'm demonstrating the different stages in the data loading cycle from raw to cleansed etc. But yes loading straight to delta is a good idea

  • @danielwie8472
    @danielwie8472 2 года назад +2

    Thanks for a great video! You do mention a minor change in the flow to not overwrite files but only the changes - coming in the next stream. ( approx at ruclips.net/video/o2fJj6SVJlQ/видео.html) Which video is that?

    • @DatahaiBI
      @DatahaiBI  2 года назад +2

      Hi, as in incremental changes to the destination data? Yes I have something coming, actually there should be a session video out soon where I talk about UPSERTing data into Delta. It's not as flexible as using Spark as there's no ability to overwrite partition, you need to use a key (can be a single column or multiple columns) to allow the merge process to work.

    • @Alex-cs5mf
      @Alex-cs5mf Год назад

      @@DatahaiBI Hi Andy, have you managed to do this Video yet, id be interested in seeing it, its the only missing piece in my architecture plans I am proposing internally, mostly thanks to you!

    • @DatahaiBI
      @DatahaiBI  Год назад +1

      @@Alex-cs5mf Hi, not yet no. I had a question that's stuck in the queue at MS to look at how partitions can be switched out

    • @Alex-cs5mf
      @Alex-cs5mf Год назад

      @@DatahaiBI be interested to know the results of that!

    • @vravensteijn
      @vravensteijn Год назад

      Me 2!

  • @michaeldemarco82
    @michaeldemarco82 9 месяцев назад

    Just a tangential comment he has the same vocal intonations as George Michael