Advancing Spark - Building Delta Live Table Frameworks

Поделиться
HTML-код
  • Опубликовано: 28 ноя 2024
  • In previous videos we've worked with Delta Live Tables to make repeatable, reusable templates, and that's cool... but we had to create pipelines for each table we wanted to process, and that's no good when we're working at scale.
    Thanks to some handy pointers from the Databricks team, we can now encapsulate our template inside iterative functions, meaning we can create a whole bunch of delta live tables, with dependencies & logic, all within a single notebook.
    In this video Simon takes an example templated delta live table notebook and builds list-driven workflow that generates multiple-stage live tables all within a single pipeline.
    As always, don't forget to hit like & subscribe, and get in touch if Advancing Analytics can help you achieve your Data Lakehouse objectives!

Комментарии • 56

  • @anupamchand3690
    @anupamchand3690 9 месяцев назад +1

    Probably one of the best examples of creating a framework using DLT. I was looking for something like this for some time. Thanks!

    • @GuyBehindTheScreen1
      @GuyBehindTheScreen1 3 месяца назад

      Exactly! I’m not seeing anything in the documentation on how to do this.

  • @allthingsdata
    @allthingsdata 3 года назад +1

    Love how you present your thought process and how you iteratively improve it and also that you don't take out things that didn't work in the first place. Keep up the great work so we mere mortals can learn as well :-)

  • @madireddi
    @madireddi 2 года назад +4

    Hi Simon, Many thanks. You are the first to demo a metadata-driven framework using DLT. Could you share the code used in this demo?

  • @vap72a25
    @vap72a25 2 года назад

    Beautiful approach.. really liked!

  • @unknown_f2794
    @unknown_f2794 Год назад

    Great work mate!!! I am very thankfull for you video content and all the work you are putting into it

  • @datoalavista581
    @datoalavista581 2 года назад

    Thank you so much for sharing !! Great presentation

  • @atpugnes
    @atpugnes 2 года назад +1

    Great video. I got 2 questions -
    1. If there is a cycle in the dependencies a->b->c->a does databricks detect and flag that?
    2. How does Restartability work? can we restart from the point of failure?
    I would be curious to know more on these points.

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      Honestly, haven't tried circular references but I'd hope it throws an error during dependency processing before the pipeline commences. From a restart point of view, you can restart workflows, but I can't recall if you can restart a DLT pipeline.
      Time for a quick experiment I think!

  • @MrSvennien
    @MrSvennien 2 года назад

    Hi and thanx for all the hard work you put in to these presentations.
    A Delta Live Table will load all new data that arrive in the source for instance landing zone defined, but will the data allready loaded need to be present in the source in the future or can these records be removed or moved after initial load to the bronze layer? Thanx!

    • @anupamchand3690
      @anupamchand3690 7 месяцев назад

      If you ever need to rerun the pipeline with full refresh, that is the only time that the historical data will be retrieved. Let's say you delete the old data, if you re run the pipeline with full refresh, only the data that is present will be loaded. So yes, the raw csv files should be present for as long as you intend to keep the history on the tables for.

  • @kurtmaile4594
    @kurtmaile4594 3 года назад +1

    Great demo thanks - hoping (guessing you will one day) delve into 2 key DLT topics 1) DLT testing strategies and patterns 2) CI/ CD and topics like roll back (e.g can we time travel table data & metadata along side code)

    • @kurtmaile4594
      @kurtmaile4594 3 года назад

      …and considering SQL too! :)

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      Certainly some elements we'll dig into - the tables created are just delta tables so you can use all the usual delta tricks. For CI/CD & Rollback, it's interesting as they're mainly just databricks notebooks, which can be deployed using repos, devops pipelines etc, but I've not seen a formal CI/CD story around them!

    • @kurtmaile4594
      @kurtmaile4594 3 года назад +1

      @@AdvancingAnalytics great thanks for reply. Defo need to dig into operational behaviours (e.g updating a pipeline will DB let all batches being processed in different pipeline steps finish first then change it? Manual replays off quarantined data, roll back. Mechanics of ci/cd are probably easy enough to figure out it’s the behaviours I’m a little unclear of ATM. Also Emerging patterns / best practice around DLT testing too this feels super important. Love your work thanks heaps for the great content!!!!

  • @zycbrasil2618
    @zycbrasil2618 9 месяцев назад

    Hi Simon. DLT features looks great but my concern is about balance between features and costs. DLT is something like 3x more expensive in terms of DBUs. What's the best approach for cost efficient designs? DLT or classic workflows?

  • @darryll127
    @darryll127 3 года назад

    Great demo / example.
    Is how does dependencies on individual tables in the pipeline work for a downstream pipeline dependent on such?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад

      They need to be in the same "DLT Pipeline" for the dependencies to figure themselves out. But if you run it as an entirely separate job based on the hive tables created by DLT, should still work if it's scheduled to run after the initial one.

  • @chitrar2994
    @chitrar2994 2 года назад

    Hi Simon, i love your videos..its str8 to the point and covers the points i am looking pointer too (like the missing features :).. anyways with regards to this dlt framework that u ahve built, lets say i want to load the curated data and skip the dependency of the silver tables refresh ..basically execute just the goldnotebook component in the dlt pipeline.....is that possible ?

  • @letme4u
    @letme4u 2 года назад

    Great Demo Simon. Will you help in integrating this gold/silver data to visualization like SuperSet or Ploty?
    If you have covered, please suggest the url/tutorial.

  • @likhui
    @likhui 2 года назад

    Firstly Simon, thanks a lot for tackling & demonstrating such framework; much appreciated. Just curious if you've played around with passing in parameters to a DLT pipeline such as via ADF and wondering what are your thoughts on it; i.e:
    1) pass a JSON output from an ADF activity as a string to the 'configuration' property of the DLT's pipeline JSON via REST API Web activity
    2) retrieve the parameters passed in within the DLT notebooks (parse the 'JSON' string as you see fit/relevant)
    3) essentially your 'tableList' is derived from the passed in parameters
    Keen to know your experiments/approach in implementing a parameterised DLT from an external caller. Thanks in advance

    • @ayushmangal4417
      @ayushmangal4417 Год назад

      Hi Luke, did you got something regarding passing JSON as a parameter into dlt from an external source.
      Thanks

    • @likhui
      @likhui Год назад

      @Ayush Mangal I did it via REST API call. Not sure if that answers your question?

    • @anupamchand3690
      @anupamchand3690 7 месяцев назад +1

      You can very well do this. But you need 2 API calls. 1 is to update the pipeline with the new configuration and then the actual call to start the pipeline. The rest API to start the pipeline does not have any ability to pass parameters to it. Also, only 1 instance of a DLT pipeline can execute at a time. You cannot treat this as a workflow where you can have the same workflow triggered multiple times with different parameters.

    • @likhui
      @likhui 7 месяцев назад +1

      @anupamchand3690 thanks for your reply which I believe that's how I implemented previously. Cheers.

  • @yiyang9489
    @yiyang9489 3 года назад

    Thanks, great demo Simon. I like the for loop you created in the first notebook (bronze to silver) to make it a generic pipeline. Would the loop run in parallel or in sequential? Thinking whether a failure in one of the DLT would have any impact to other tables.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад

      The definition of the table is performed sequentially, but that doesn't have much relevance. The actual loading of data will be performed in parallel (as much as your cluster sizing allows for), as it works out dependencies & table definitions before kicking any actual spark jobs off.
      Failure is an interesting one - if one of the definitions failed, I assume the whole pipeline will fail to resolve. If there is a failure whilst loading a specific table, it would be interesting to see how the rest of the dependencies are managed! I assume running jobs will complete but no more jobs will be kicked off, but I haven't tested it!
      Simon

  • @Mta213
    @Mta213 3 года назад

    Simon I love the tutorials! If data contains GDPR fields or there was a need to to add surrogate keys for Dim tables in the Gold level, I'm not sure if this approach would be realistic. Seems A lot more metadata and complexity would be required? Let me know if I am being short-sighted on this. Maybe those would require different patterns?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      Oh for sure, the examples I'm running through are the most basic "I need a simple data model" patterns while I kick the tyres on DLT. As consultants, advancing analytics use a home-grown set of python libraries, metadata stores and custom notebooks to automate the complexity you're talking about, and you can't get that inside DLT yet. It's heading in a promising direction, but it's still the "easy path" solution for now!
      Simon

  • @kuldipjoshi1406
    @kuldipjoshi1406 2 года назад

    wanted to know about how CICD would look like, deploying notebooks should be easy enough, But how to schedule (trigger) would be deployed, is there any way to call the workflow notebook from the data factory and use triggers, also pass configs from there to either workflow or notebook directly

  • @datalearn4398
    @datalearn4398 3 года назад

    You rock! I have a problem with switching from streaming (appending) to batch, where I have fact tables and I need to use historical data and update. What is the best practice for transition from stream to batch?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      We generally stream into our "Base" layer (call it silver, clean, ODS, whatever) and then batch read from there into our "Curated" (gold, model etc). If we are worried about historical records in our curated layer, we will often store type 2 or type 4 data in Base and ensure our Curated is built in such a way to apply those changes and rebuild the history en-masse.
      There's no current way to do that using DLT, but I'd hope to see things like merges coming in the future so we can apply that kind of pattern!
      Simon

  • @brads2041
    @brads2041 2 года назад

    We are considering dlt, I suspect for a file ingest + validation pipeline. So very interested in the templating but need more exploration on the exceptions and corresponding logging as we have a number of validations and some are complex. Also, we logically split the data flow into valid and invalid data sets (both end up stored) so how would that work here is a question

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      The expectations are great, provided you can express it as a SQL statement, but if you want to retain & split, I'd instead use similar logic to add a calculated column to the data frame, then define as two table destinations (aka - do it the same as you would in a normal spark batch). Possibly something you could do that's a little more baked into the process, but there certainly approaches!

  • @briancuster7355
    @briancuster7355 2 года назад

    Can you use a metadata based framework with delta live tables? If so, how would you do that?

  • @trodenn4977
    @trodenn4977 Год назад

    is it possible to use writeStream inside delta live tables?

  • @jordanfox470
    @jordanfox470 2 года назад +1

    Does it run the tables in parallel (concurrently) or in sequence (one at a time)?

    • @rajkumarv5791
      @rajkumarv5791 2 года назад

      had the same question, did you figure this out Jordan

    • @anupamchand3690
      @anupamchand3690 7 месяцев назад

      The actual loading happens concurrently. Unless there is a dependency between 2 different loading. DLT will understand this and load the dependency first.

  • @datalearn4398
    @datalearn4398 3 года назад

    Another question, in a big scale with hundreds of tables, is how to recover the broken stream? and fix the staging tables

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      With most of the streaming patterns, the unfortunate answer is to delete the checkpoint and reprocess, which can sometimes lead to HUGE reprocessing time/cost. If you're working with standard streaming you can include the delta transaction starting version, but with DLT you don't have any option if you want to restart, it's incremental loading or a complete rebuild!

  • @kaurivneet1
    @kaurivneet1 2 года назад

    Great content as always Simon! One question on adding notebook list(aka libraries) in pipeline configuration, does the notebook execution carried out only sequentially? Is there a way that my 2 gold notebooks which are not dependent on each other can run in parallel after completing the first (bronze + silver) notebook?
    Also, if my gold table is dependent on only 3-4 silver tables, will it still wait for the first notebook(bronze+silver) to complete?

    • @anupamchand3690
      @anupamchand3690 7 месяцев назад

      DLT is intelligent enough to know what the sequence if at all. If there is no dependency, the notebooks will be executed together provided there is enough computing power to do so.

  • @rajkumarv5791
    @rajkumarv5791 2 года назад

    Hi Simon, do those different bronze to Silver extractions happen in Parallel or is it a sequence? If its a sequence perhaps an ADF based parameterization where you can call multiple bronze to silver ingestions parallelly a better approach?

    • @anupamchand3690
      @anupamchand3690 7 месяцев назад

      Bronze to silver will happen in parallel if there is no dependency. If there is a dependency, that will be honoured and the processing will happen in sequence.

  • @jithendrakanna2532
    @jithendrakanna2532 2 года назад

    Hi, it is a great video. Could you also let us know how to drop a delta live table.

  • @aaronpanpj
    @aaronpanpj 3 года назад

    1. Does it mean to have visibility of all end to end lineages? We need to include all notebooks in single pipeline?
    2. When is it going to support update into a table?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад

      Yep, to see all lineages together, it would have to be in a single pipeline - although the demo slides around unity catalog showed similar lineage, so that might bridge that gap? Maybe?
      Regarding updates I have no idea of timescales, but Databricks know there's a lot of use cases for people needing merge functionality!

  • @dipeshvora8621
    @dipeshvora8621 3 года назад

    How can we merge incoming data to silver layer using delta live tables?

  • @pappus88
    @pappus88 3 года назад

    I tried similar pattern 2 weeks ago but the "second layer" table was unable to refer to the "first layer" table since they weren't created in the same function call. I was also unable to refer live table to another live table created in different notebook (although part of the same pipeline). Got weird "out of graph" or something error..
    Also interested to know any info on update instead of append/overwrite?
    Appreaciate your contributions Simon!

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      Hrm, weird - were you referring to the tables using their full hive name, or using live. as the "database"? Seemed to work fine as long as everything was live.tablename, whether they were in different functions, notebooks or whatever, as long as they're all in the same pipeline!
      No news on updates, the databricks team know it's a big customer ask, but I'll share when there's news!

    • @pappus88
      @pappus88 3 года назад

      @@AdvancingAnalytics I used live.tablename. It worked fine if DLT was declared in absolute terms (i.e. not parametrized inside func). I gotta retry then..

  • @RishiAnand-n5e
    @RishiAnand-n5e Год назад

    Hi Simon, Could you please provide your code link used in this demo.

  • @naginisitaraman9715
    @naginisitaraman9715 2 года назад

    Can we create 4 different delta live tables in a single same pipeline ?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Год назад

      Yep, you can create as many tables as you want, across separate notebooks even