Advancing Spark - Delta Live Tables Generally Available!

Поделиться
HTML-код
  • Опубликовано: 28 июл 2024
  • There's a huge amount of work that goes into building out a data processing framework, and getting it right with auditing, logging and data quality monitoring is a complex task in itself. Delta Live Tables provides a handy framework that takes your table definitions, interprets the dependencies and produces an end-to-end data pipeline for you!
    In this video, Simon looks at the Launch notes, reviews the current state of DLT as we hit General Availability, and quickly recaps what can be done through Delta Live Tables for those new to the tool!
    The release announcement can be found here: databricks.com/blog/2022/04/0...
    As always, if you need help accelerating your Lakehouse journey, get in touch with Advancing Analytics!

Комментарии • 32

  • @richardjtomlinson
    @richardjtomlinson 2 года назад +1

    Great video Simon! Would love to see you dig deeper into the streaming / continuous pipeline side in another vid.

  • @alexischicoine2072
    @alexischicoine2072 2 года назад +2

    Thanks for covering this Simon :)
    I remember when I tried it you can’t run this outside of the pipeline which makes it hard to run pieces individually without breaking up the pipeline. The other issue is the source control integration for this which is the same problem as for clusters, jobs, cluster policies, etc. Our Databricks consultant suggested we use Terraform and do infrastructure as code but my organization is nowhere near that level of maturity. It would be nice to have this saved to source control like with notebooks. I had the problem with multijobs where a job disappeared and fortunately I had a backup of the Json but imagine having to redo configurations of over 50 tasks each having multiple dependencies, retry settings, etc. I’m going back to orchestrating in Synapse pipelines mainly for the git integration even though I’ll lose out on cluster reuse (though I’m using pools and it’ll be easier to right-size the jobs so it shouldn’t be a big difference).
    Most of the benefits from dlt you can code in a few functions easily. Just add parameters to your functions you use for saving your tables. How hard would be it be to pass a dictionary with the assertions and process them? Doing a function to simplify merge is also just a few lines of simple code. I have many such functions of my own that my colleagues can also use easily without understanding all the details.
    I guess the target audience is for users that have simple use cases and won’t be building too much of their own functionality.
    If dlt could automate real change data capture that bypasses the append only limitations of streaming when running in batch that would be a killer feature making it worthwhile. Databricks already has change data feeds on delta tables but the logic for processing these can get quite involved so having it automated would bring a lot of value.
    This reminds me of mapping data flows in data factory or synapse where you lose quite some flexibility and pay a lot more while not getting that much more simplicity. If you’re building something that’s expensive to run I don’t feel like those tools save you the development hours to justify their cost.

    • @mahmuzicamel
      @mahmuzicamel Год назад

      The same is true for me. Implementing and debugging the code is really frustrating, except for the great capabilities of Delta Live Tables. Where there is also room for improvement is the fact that you currently can't connect to external Maven dependencies. This is certainly an advantage if you want to fetch other external sources, such as Elastic Cloud by using DLT. Currently, this is only possible if bronze data is in the datalake.

  • @kaurivneet1
    @kaurivneet1 2 года назад +1

    The moment DLT went GA, I was searching youtube for your new videos :D. Great content always! I have been experimenting with it for some time. One problem I have faced is how to build Type 2 style table. As the readstream is streaming concept, it doesn't allow window lead/lag function, so I am unable to build history. Other problem is with the apply_changes, what if I want to pass in additional parameters to join on ( ex: iscurrent='Y'). It only allows column list in keys. Any suggestions?

  • @ilia8265
    @ilia8265 Год назад

    Great content!! But how does this affect devops?

  • @dangz90
    @dangz90 2 года назад

    Hi Simon, how would you recommend to integrate both delta live tables and feature tables? Do you recommend to include saving into feature tables inside the dlt pipeline?

  • @mvryan
    @mvryan 2 года назад

    Has anyone been able to use libraries dependant on jars? I've tried a number of ways but cannot seem to coerce the clusters to install jars on launch. Even setting spark.jars.packages didn't help which was surprising since I thought it would install from maven regardless.

  • @jasperputs
    @jasperputs 2 года назад

    Hey Simon, amazing video as usual. I love how you start with an explanation of a new concept followed by some examples. One thing I still struggle with in Databricks is how you manage where the data is actually stored. How I understood it is that you store your delta tables in different databases/schema, which is then typically a different folder on your data lake. Is this also possible with DLT? It seems that you use bronze, silver, ... in the table name? Is it possible to change the storage location per table?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад +2

      Hey! So with DLT, you specify the "location" at the pipeline level, so all tables are folders within that main root folder. If you want separate storage locations, would would need to build separate DLT pipelines and chain them together with a DBX Workflow!

    • @jasperputs
      @jasperputs 2 года назад

      @@AdvancingAnalytics Thanks for the answer! That makes sense. But if you create different DLT pipelines (to store the tables in different folders), can you build on DLTs in other folders? I didn't find a possibility of that yet

    • @danrichardson7691
      @danrichardson7691 2 года назад

      @@jasperputs I believe you need to define them as views. It's a shame you have to do this, but you can programatically generate the views - it just doesn't give you that bronze to gold dependency view that you'd get with one pipeline. The best compromise I found was putting everything in one pipeline and specifiying the "path" option alongside the table name - here you can define the physical storage location. Problem with this solution is everything appears under one hive metastore database. Trying you cheat your way round this by manually specifiying the schemas gives this error: "Materializing tables in custom schemas is not supported".
      I hope they expand the framework to be a bit more dynamic - putting all your bronze, silver and gold tables in one db with prefixes (like in this demo) seems a bit disorganised :|

  • @tjommevergauwen
    @tjommevergauwen Год назад +1

    Hi, is it actually possible to use existing Delta tables as an input source for Delta Live Tables?
    Preferably so it only processes the changed lines from the existing Delta table into the Delta Live Table

  • @vikrantrathaur_
    @vikrantrathaur_ Год назад

    I would be great if we can have access to these notebooks you are using for demo. I would to happy to run the code myself and play around with it.

  • @kurtmaile4594
    @kurtmaile4594 2 года назад

    Great vid! Do you know if you can read the change stream on the other side of the merge / apply changes? My understanding is you can do into a table, but not yet on the other side, reading the changes on the other side of the table post write

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      Ooh, haven't yet tried! Delta Streaming prefers append-only transactions, but you can build around it normally. Not sure if we have any restrictions in DLT. That's one for a future video!

  • @rayt0t
    @rayt0t Год назад

    Hi Simon, great videos! Been googling this problem and I could not find the answer. How can I trigger a Delta live pipeline from Azure data factory? Any chance you know how to do this? I want to leverage the pipeline not just the notebooks from ADF. regards

  • @rajkumarv5791
    @rajkumarv5791 2 года назад

    Nice one! Would it be possible to trigger the Delta live pipelines from Azure data factory?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад +1

      Not sure if you can directly, but you can certainly trigger it from a Databricks Job, then trigger that job from an ADF API Call?

  • @NeumsFor9
    @NeumsFor9 2 года назад

    Do we know if the storage of the Data Quality violations are anything like the Kimball Architecture for Data Quality?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      We do, they're not! The DQ event metrics are currently part of the event stream Delta table - I put a video together exploring these metrics a while ago: ruclips.net/video/He8lSBZjZlQ/видео.html

  • @lancasm
    @lancasm 2 года назад

    DLT! The hairy cornflake?

  • @user-zv9um9pb6w
    @user-zv9um9pb6w Год назад

    your honestly telling me you support using notebooks in production? wow

    • @user-zv9um9pb6w
      @user-zv9um9pb6w Год назад

      and it doesn't seem to do parametrization.. so if you have a few pipelines its "maybe" ok. But scale this across 1k pipelines ich..

    • @AdvancingAnalytics
      @AdvancingAnalytics  Год назад

      I 100%, absolutely, with zero doubts support using notebooks in production! - Simon

    • @AdvancingAnalytics
      @AdvancingAnalytics  Год назад

      @@user-zv9um9pb6w And yep, you can parameterise and build generic DLT pipelines. Run over some metadata, spin up dependencies over hundreds of pipelines with a single notebook. I'll admit, we don't use it for 1k+ pipeline workloads, mainly because of flexibility about the underlying lake structure!

    • @user-zv9um9pb6w
      @user-zv9um9pb6w Год назад

      @@AdvancingAnalytics well thats we're we differ. Production workloads should be done with a little more quality. Notebooks are at best hacky.

    • @user-zv9um9pb6w
      @user-zv9um9pb6w Год назад

      @@AdvancingAnalytics i get what your saying for that style pattern . Ill relook at the rest API again, but i was referring to passing it value used to generically process. If i have 2k tables /files to processes im going to trigger based on something unique like a path . Im not going to have a list of tables to check for updates. Jmo

  • @gardnmi
    @gardnmi 2 года назад +3

    No mix language support in a single notebook. No scala support. No interactive cluster support. Im not impressed.
    Edit: Been testing it for 2 days now. It seems like it was originally built for SQL only and they jammed in python last second which makes a lot of sense when looking at the non-pythonic code examples and not being able to create dataframes outside of the dlt functions.

    • @alexischicoine2072
      @alexischicoine2072 2 года назад +1

      Yeah you have to bypass that by creating functions that create your data frame and calling those in the code of the table definition. A bit annoying but it works and can allow you to test your transformations outside of the pipeline.

    • @danrichardson7691
      @danrichardson7691 2 года назад

      @@alexischicoine2072 This is the way