Advancing Spark - Making Databricks Delta Live Table Templates

Поделиться
HTML-код
  • Опубликовано: 9 авг 2021
  • The new Delta Live Tables functionality within Databricks is intended to simplify data engineering tasks and automate a whole load of traditionally complex tasks for you... but at first glance this appears to come at the cost of flexibility and reusability.
    In this video Simon takes the example workbook from the documentation and deconstructs it to build a generic, metadata driven template that can be used across multiple DLT Pipelines.
    The sample code used in the video can be found on the databricks documentation: docs.microsoft.com/en-us/azur...
    As always, don't forget to hit like & subscribe, and get in touch if Advancing Analytics can help you achieve your data lakehouse objectives!

Комментарии • 34

  • @NeumsFor9
    @NeumsFor9 Год назад +1

    Took this and wrote to our custom metadata repo (data quality subsection) using the Kimball Architecture For Data Quality as metadata output and the modernized Marco Metamodel for the input settings......including expected file formats and source to targets. Also integrated the audit keys in the targets tables to both our audit dimensions and our data quality subsystem.

  • @RadThings
    @RadThings Год назад +1

    Awesome video. Really liked the templating example. As for doing this more dynamic. You can dynamically set the dlt table name based on a variable. So what you do is create a function called create_bronze_table(mytable:str, df: DataFrame)
    and when you do
    @dlt.table(name=mytable)
    def create_bronze_table():
    return df
    Then you can parameterize the config to be a specific source and loop through all tables calling the create_bronze_table passing in the table and the dataframe
    You can take this to the next level and setup auto loader to listen to s3 events and pass that to your function to load a whole source of tables. This can then drastically accelerate your source ingestion having a generic pipeline.

  • @polakadi
    @polakadi 3 года назад

    Interesting feature indeed, thank you, Simon, for creating this video!

  • @jimzhang2562
    @jimzhang2562 Год назад

    Great videos! Thank you.

  • @datoalavista581
    @datoalavista581 2 года назад

    Brilliant ! . Thank you for sharing

  • @julsgranados6861
    @julsgranados6861 2 года назад

    thank u Simon!! just great :)

  • @alexischicoine2072
    @alexischicoine2072 3 года назад

    I thought I'd go ahead and use delta live tables for a project even though it's in preview. I didn't have many problems with the sql endpoint but this feature really isn't ready as you warned us in another video. I had used it only for a few simple steps so it was easy to redo using normal spark streaming.
    I had a few times where the pipeline seemed to be corrupt and it wouldn't load properly. Recreating it from the same notebooks and configuration fixed it. Another problem I got is that sometimes it took almost 10 minutes to start after the cluster initialized which is just too long for something that took 1-2 minutes to run once started. I've found that as I got to explore it there were too many issues that didn't justify the two main befits I saw of seeing the graph of my steps and the data expectations. Looking forward to what they do with it in the future though it could be amazing if done right.

  • @stvv5546
    @stvv5546 3 года назад

    Hey, that was a great example of trying to get certain functionality in a maybe not so standard way. But also confirms how much stuff they (Databricks) need to still roll out in order to have fully functional generic/dynamic pipelines, right?
    While watching I was really hoping that by the end we will see a convenient way to loop over those 'Address' and 'Product' tables without having to go to the pipeline json and manually change the pipeline parameter. Hopefully we can have something like this in the future releases of DLT. When we got the point that we can't really parametrize the storage, well I got a bit dissapointed. I really hope they provide us with more control above that as well.
    Thanks Simon for those great insights into Databricks world! Amazing!

  • @WhyWouldYouDrawThat
    @WhyWouldYouDrawThat 2 года назад

    All going well, we will absolutely be using this. We are looking at using Live Tables to build a data hub. This will primarily supply data to business apps, secondarily power analytics. Can you please help me out by doing a video on this? From everything I’ve read this is absolutely the best tool for this job. Essentially the source of data for 95% of enterprise ETL jobs will be live tables. We like the idea that the data we are using is the same data that is being reported, and is also 100% up to date. I’m also interested in publishing changes from delta tables to Azure data hubs for ease of consumption. Very keen to hear your thoughts and comments.

  • @briancuster7355
    @briancuster7355 2 года назад

    I tried Delta Live tables on a project and it worked out pretty well. I didn't use them entirely for my ETL but I did use them to go from silver to other intermediate silver tables and to gold tables. I found it to be pretty practical and easy to use.

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад +1

      Nice! That's a great use case in my head - using it where it's likely that we have business users expressing logic in SQL, but that still want the elements of engineering applied!

    • @briancuster7355
      @briancuster7355 2 года назад

      @@AdvancingAnalytics Yes, we did it expressly for that reason. We have also been trying to automate the whole process and ended up creating a notebook where a user can come in and build a pipeline and save all of the metadata to a database on SQL. Then when it comes time to build the pipeline in Databricks, we have a method that translates the configuration from the database back out to a databricks pipeline. It's pretty cool!

    • @maheshatutube
      @maheshatutube Год назад

      @@briancuster7355 Hi Brian, do you have any video or documentation around automating the whole process of building the pipeline in databricks using metadata approach. Any pointers would be highly appreciated

  • @ferrerolounge1910
    @ferrerolounge1910 Год назад

    Wondering if there is a similar feature in adf or az synapse. Well explained as usual!

  • @doniyorturemuratov6999
    @doniyorturemuratov6999 Год назад

    Awesome job! Where can we find the final result notebooks used in the video? Thanks!

  • @mimmakutu
    @mimmakutu 2 года назад

    we did this using normal delta table with a json config, using generic merge ie insertAll and updateAll apis. This gets data upto raw or bronze zone in data lake. For us this works for ~700 tables

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      Yep, absolutely you can achieve this with straight delta and a bit of engineering. The point is that this aims to make doing that super easy for people who are not deep into the data engineering side of things - it can't do everything you could do manually of course, but it's a decent start in making these things more accessible.
      Simon

  • @abhradwipmukherjee3697
    @abhradwipmukherjee3697 Год назад

    Excellent video & thanks for the valuable insight. But after creating the raw table from dataframe, can we write the data into another dataframe from the newly created delta table and create the silver table from the new dataframe?

  • @harisriniram
    @harisriniram Год назад

    Can we configure the notebook path under libraries section in the JSON and target. We want this to be populated by output from previous task(notebook task type)

  • @plamendimitrov9097
    @plamendimitrov9097 2 года назад

    Did you try the SQL syntax , build those SQL statements dynamically and execute them using spark.sql('...') in a loop?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      Not yet, haven't had a chance to dig further into it - easy to give it a quick try though! Hoping we won't need any workarounds once it matures though!

  • @NM-xg7kd
    @NM-xg7kd 3 года назад

    It will be interesting to see how Databricks develop this. It currently looks a bit unwieldy and when you compare to something like multi task jobs which appears organised and easy to follow it just begs the question, where are they going with this, centralised vs decentralised?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      Like most things, Databricks tend to be code-first - get it working in a techy, codey way, worry about putting a UI over the top later (if at all). If this is going to be the "citizen engineering" approach they push going forwards, it'll need a bit more polish, for sure!
      If you look at architecture approaches like data mesh, a lot of it is supported by democratising the tools & making it easier for the domain owners to engineer... which this is certainly heading towards.

    • @NM-xg7kd
      @NM-xg7kd 3 года назад

      @@AdvancingAnalytics I saw the recent Databricks vid on data mesh and yes interesting decentralised approach. Re data ownership though, I cannot see that flying in most organisations without some serious risk analysis, but like you say, this definitely lends itself to the methodology.

  • @saivama7816
    @saivama7816 Год назад

    Aswesome Thnks a lot Simon. Only ingestion into Bronz can be generic. is there a way to generalize Transformation into Silver lake.

    • @AdvancingAnalytics
      @AdvancingAnalytics  Год назад

      Depends how you define silver! For us, we perform cleaning, validation and audit stamping when going into Silver. We can provide the transformations required as metadata that is looked up by the process, so the template is nice and generic - but you then need to build & manage a metadata framework ;)

  • @krishnag5624
    @krishnag5624 2 года назад +1

    Hi Simon
    I need your help in azure databricks. Your excellent.
    Thanks
    Krish

  • @joyyoung3288
    @joyyoung3288 Год назад

    how to connect and install adventure works to dbfs, can you share more information? thanks

    • @AdvancingAnalytics
      @AdvancingAnalytics  Год назад

      In this case, I ran a quick Data Factory job to scrape the tables from an Azure SQL DB using a copy activity. There isn't a quick way to get it into DBFS - however there are a TON of datasets mounted under the /databricks-datasets/ mount that are just as good!

  • @alexischicoine2072
    @alexischicoine2072 3 года назад

    I like your generic approach. Using the spark config might get a bit unwieldy when you start having a lot of steps in your pipeline. I think you could write this generic functionality in a python module you can import in your notebooks and have very simple notebooks that contain the pipeline configuration where you call the functions.
    Otherwise to support complex pipelines you'll end up having to create a complete pipeline definition language based on text parameters you supply as config? I could see that being worth it if you're trying to integrate this into some other tool where the config would come from but if you're working directly in Databricks as a data engineer I'm not sure what the advantage would be to define your pipelines in this format instead of using the language framework dlt provides.

    • @gamachu2000
      @gamachu2000 3 года назад

      Your absolutely right about creating a function. We just deploy two zone silver and bronze and we use a function with a for loop to iterate the config that has all information of our table. Works like a charm. I hope Simon could show that in his advancement of this video. Delta live doesn't support merge so our gold zone we went back to standard spark. There saying the merge feature is coming soon in dlt. Once that happens then everything will be dlt. We are also looking into cdc part of the dlt. Great video for the community Simon. Keep the great work.

    • @alexischicoine2072
      @alexischicoine2072 3 года назад

      @@gamachu2000 Ah yes I had the same issue with the merge and I don't think you can use forEachBatch either. Interesting to know the merge is coming.

  • @roozbehderakhshan2053
    @roozbehderakhshan2053 2 года назад

    Interesting, just a quick question. Can the source of data instead of cloud storage be an streaming source (i.e. Kinesis or Kafka)
    CREATE INCREMENTAL LIVE TABLE customers
    COMMENT "The customers buying finished products, ingested from /databricks-datasets."
    TBLPROPERTIES ("myCompanyPipeline.quality" = "mapping")
    AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv");
    can we do:
    CREATE INCREMENTAL LIVE TABLE customers
    COMMENT "The customers buying finished products, ingested from /databricks-datasets."
    TBLPROPERTIES ("myCompanyPipeline.quality" = "mapping")
    AS SELECT * FROM ******Kinesis;