Thanks for covering this Simon :) I remember when I tried it you can’t run this outside of the pipeline which makes it hard to run pieces individually without breaking up the pipeline. The other issue is the source control integration for this which is the same problem as for clusters, jobs, cluster policies, etc. Our Databricks consultant suggested we use Terraform and do infrastructure as code but my organization is nowhere near that level of maturity. It would be nice to have this saved to source control like with notebooks. I had the problem with multijobs where a job disappeared and fortunately I had a backup of the Json but imagine having to redo configurations of over 50 tasks each having multiple dependencies, retry settings, etc. I’m going back to orchestrating in Synapse pipelines mainly for the git integration even though I’ll lose out on cluster reuse (though I’m using pools and it’ll be easier to right-size the jobs so it shouldn’t be a big difference). Most of the benefits from dlt you can code in a few functions easily. Just add parameters to your functions you use for saving your tables. How hard would be it be to pass a dictionary with the assertions and process them? Doing a function to simplify merge is also just a few lines of simple code. I have many such functions of my own that my colleagues can also use easily without understanding all the details. I guess the target audience is for users that have simple use cases and won’t be building too much of their own functionality. If dlt could automate real change data capture that bypasses the append only limitations of streaming when running in batch that would be a killer feature making it worthwhile. Databricks already has change data feeds on delta tables but the logic for processing these can get quite involved so having it automated would bring a lot of value. This reminds me of mapping data flows in data factory or synapse where you lose quite some flexibility and pay a lot more while not getting that much more simplicity. If you’re building something that’s expensive to run I don’t feel like those tools save you the development hours to justify their cost.
The same is true for me. Implementing and debugging the code is really frustrating, except for the great capabilities of Delta Live Tables. Where there is also room for improvement is the fact that you currently can't connect to external Maven dependencies. This is certainly an advantage if you want to fetch other external sources, such as Elastic Cloud by using DLT. Currently, this is only possible if bronze data is in the datalake.
Hi, is it actually possible to use existing Delta tables as an input source for Delta Live Tables? Preferably so it only processes the changed lines from the existing Delta table into the Delta Live Table
The moment DLT went GA, I was searching youtube for your new videos :D. Great content always! I have been experimenting with it for some time. One problem I have faced is how to build Type 2 style table. As the readstream is streaming concept, it doesn't allow window lead/lag function, so I am unable to build history. Other problem is with the apply_changes, what if I want to pass in additional parameters to join on ( ex: iscurrent='Y'). It only allows column list in keys. Any suggestions?
Hi Simon, great videos! Been googling this problem and I could not find the answer. How can I trigger a Delta live pipeline from Azure data factory? Any chance you know how to do this? I want to leverage the pipeline not just the notebooks from ADF. regards
Hi Simon, how would you recommend to integrate both delta live tables and feature tables? Do you recommend to include saving into feature tables inside the dlt pipeline?
Has anyone been able to use libraries dependant on jars? I've tried a number of ways but cannot seem to coerce the clusters to install jars on launch. Even setting spark.jars.packages didn't help which was surprising since I thought it would install from maven regardless.
Great vid! Do you know if you can read the change stream on the other side of the merge / apply changes? My understanding is you can do into a table, but not yet on the other side, reading the changes on the other side of the table post write
Ooh, haven't yet tried! Delta Streaming prefers append-only transactions, but you can build around it normally. Not sure if we have any restrictions in DLT. That's one for a future video!
Hey Simon, amazing video as usual. I love how you start with an explanation of a new concept followed by some examples. One thing I still struggle with in Databricks is how you manage where the data is actually stored. How I understood it is that you store your delta tables in different databases/schema, which is then typically a different folder on your data lake. Is this also possible with DLT? It seems that you use bronze, silver, ... in the table name? Is it possible to change the storage location per table?
Hey! So with DLT, you specify the "location" at the pipeline level, so all tables are folders within that main root folder. If you want separate storage locations, would would need to build separate DLT pipelines and chain them together with a DBX Workflow!
@@AdvancingAnalytics Thanks for the answer! That makes sense. But if you create different DLT pipelines (to store the tables in different folders), can you build on DLTs in other folders? I didn't find a possibility of that yet
@@jasperputs I believe you need to define them as views. It's a shame you have to do this, but you can programatically generate the views - it just doesn't give you that bronze to gold dependency view that you'd get with one pipeline. The best compromise I found was putting everything in one pipeline and specifiying the "path" option alongside the table name - here you can define the physical storage location. Problem with this solution is everything appears under one hive metastore database. Trying you cheat your way round this by manually specifiying the schemas gives this error: "Materializing tables in custom schemas is not supported". I hope they expand the framework to be a bit more dynamic - putting all your bronze, silver and gold tables in one db with prefixes (like in this demo) seems a bit disorganised :|
We do, they're not! The DQ event metrics are currently part of the event stream Delta table - I put a video together exploring these metrics a while ago: ruclips.net/video/He8lSBZjZlQ/видео.html
@@user-zv9um9pb6w And yep, you can parameterise and build generic DLT pipelines. Run over some metadata, spin up dependencies over hundreds of pipelines with a single notebook. I'll admit, we don't use it for 1k+ pipeline workloads, mainly because of flexibility about the underlying lake structure!
@@AdvancingAnalytics i get what your saying for that style pattern . Ill relook at the rest API again, but i was referring to passing it value used to generically process. If i have 2k tables /files to processes im going to trigger based on something unique like a path . Im not going to have a list of tables to check for updates. Jmo
No mix language support in a single notebook. No scala support. No interactive cluster support. Im not impressed. Edit: Been testing it for 2 days now. It seems like it was originally built for SQL only and they jammed in python last second which makes a lot of sense when looking at the non-pythonic code examples and not being able to create dataframes outside of the dlt functions.
Yeah you have to bypass that by creating functions that create your data frame and calling those in the code of the table definition. A bit annoying but it works and can allow you to test your transformations outside of the pipeline.
Great video Simon! Would love to see you dig deeper into the streaming / continuous pipeline side in another vid.
Its on the future videos list!
Thanks for covering this Simon :)
I remember when I tried it you can’t run this outside of the pipeline which makes it hard to run pieces individually without breaking up the pipeline. The other issue is the source control integration for this which is the same problem as for clusters, jobs, cluster policies, etc. Our Databricks consultant suggested we use Terraform and do infrastructure as code but my organization is nowhere near that level of maturity. It would be nice to have this saved to source control like with notebooks. I had the problem with multijobs where a job disappeared and fortunately I had a backup of the Json but imagine having to redo configurations of over 50 tasks each having multiple dependencies, retry settings, etc. I’m going back to orchestrating in Synapse pipelines mainly for the git integration even though I’ll lose out on cluster reuse (though I’m using pools and it’ll be easier to right-size the jobs so it shouldn’t be a big difference).
Most of the benefits from dlt you can code in a few functions easily. Just add parameters to your functions you use for saving your tables. How hard would be it be to pass a dictionary with the assertions and process them? Doing a function to simplify merge is also just a few lines of simple code. I have many such functions of my own that my colleagues can also use easily without understanding all the details.
I guess the target audience is for users that have simple use cases and won’t be building too much of their own functionality.
If dlt could automate real change data capture that bypasses the append only limitations of streaming when running in batch that would be a killer feature making it worthwhile. Databricks already has change data feeds on delta tables but the logic for processing these can get quite involved so having it automated would bring a lot of value.
This reminds me of mapping data flows in data factory or synapse where you lose quite some flexibility and pay a lot more while not getting that much more simplicity. If you’re building something that’s expensive to run I don’t feel like those tools save you the development hours to justify their cost.
The same is true for me. Implementing and debugging the code is really frustrating, except for the great capabilities of Delta Live Tables. Where there is also room for improvement is the fact that you currently can't connect to external Maven dependencies. This is certainly an advantage if you want to fetch other external sources, such as Elastic Cloud by using DLT. Currently, this is only possible if bronze data is in the datalake.
Hi, is it actually possible to use existing Delta tables as an input source for Delta Live Tables?
Preferably so it only processes the changed lines from the existing Delta table into the Delta Live Table
The moment DLT went GA, I was searching youtube for your new videos :D. Great content always! I have been experimenting with it for some time. One problem I have faced is how to build Type 2 style table. As the readstream is streaming concept, it doesn't allow window lead/lag function, so I am unable to build history. Other problem is with the apply_changes, what if I want to pass in additional parameters to join on ( ex: iscurrent='Y'). It only allows column list in keys. Any suggestions?
Great content!! But how does this affect devops?
I would be great if we can have access to these notebooks you are using for demo. I would to happy to run the code myself and play around with it.
Hi Simon, great videos! Been googling this problem and I could not find the answer. How can I trigger a Delta live pipeline from Azure data factory? Any chance you know how to do this? I want to leverage the pipeline not just the notebooks from ADF. regards
Hi Simon, how would you recommend to integrate both delta live tables and feature tables? Do you recommend to include saving into feature tables inside the dlt pipeline?
Has anyone been able to use libraries dependant on jars? I've tried a number of ways but cannot seem to coerce the clusters to install jars on launch. Even setting spark.jars.packages didn't help which was surprising since I thought it would install from maven regardless.
Great vid! Do you know if you can read the change stream on the other side of the merge / apply changes? My understanding is you can do into a table, but not yet on the other side, reading the changes on the other side of the table post write
Ooh, haven't yet tried! Delta Streaming prefers append-only transactions, but you can build around it normally. Not sure if we have any restrictions in DLT. That's one for a future video!
Nice one! Would it be possible to trigger the Delta live pipelines from Azure data factory?
Not sure if you can directly, but you can certainly trigger it from a Databricks Job, then trigger that job from an ADF API Call?
Hey Simon, amazing video as usual. I love how you start with an explanation of a new concept followed by some examples. One thing I still struggle with in Databricks is how you manage where the data is actually stored. How I understood it is that you store your delta tables in different databases/schema, which is then typically a different folder on your data lake. Is this also possible with DLT? It seems that you use bronze, silver, ... in the table name? Is it possible to change the storage location per table?
Hey! So with DLT, you specify the "location" at the pipeline level, so all tables are folders within that main root folder. If you want separate storage locations, would would need to build separate DLT pipelines and chain them together with a DBX Workflow!
@@AdvancingAnalytics Thanks for the answer! That makes sense. But if you create different DLT pipelines (to store the tables in different folders), can you build on DLTs in other folders? I didn't find a possibility of that yet
@@jasperputs I believe you need to define them as views. It's a shame you have to do this, but you can programatically generate the views - it just doesn't give you that bronze to gold dependency view that you'd get with one pipeline. The best compromise I found was putting everything in one pipeline and specifiying the "path" option alongside the table name - here you can define the physical storage location. Problem with this solution is everything appears under one hive metastore database. Trying you cheat your way round this by manually specifiying the schemas gives this error: "Materializing tables in custom schemas is not supported".
I hope they expand the framework to be a bit more dynamic - putting all your bronze, silver and gold tables in one db with prefixes (like in this demo) seems a bit disorganised :|
Do we know if the storage of the Data Quality violations are anything like the Kimball Architecture for Data Quality?
We do, they're not! The DQ event metrics are currently part of the event stream Delta table - I put a video together exploring these metrics a while ago: ruclips.net/video/He8lSBZjZlQ/видео.html
DLT! The hairy cornflake?
Hah, no relation to any disgraced radio DJs 😅
your honestly telling me you support using notebooks in production? wow
and it doesn't seem to do parametrization.. so if you have a few pipelines its "maybe" ok. But scale this across 1k pipelines ich..
I 100%, absolutely, with zero doubts support using notebooks in production! - Simon
@@user-zv9um9pb6w And yep, you can parameterise and build generic DLT pipelines. Run over some metadata, spin up dependencies over hundreds of pipelines with a single notebook. I'll admit, we don't use it for 1k+ pipeline workloads, mainly because of flexibility about the underlying lake structure!
@@AdvancingAnalytics well thats we're we differ. Production workloads should be done with a little more quality. Notebooks are at best hacky.
@@AdvancingAnalytics i get what your saying for that style pattern . Ill relook at the rest API again, but i was referring to passing it value used to generically process. If i have 2k tables /files to processes im going to trigger based on something unique like a path . Im not going to have a list of tables to check for updates. Jmo
No mix language support in a single notebook. No scala support. No interactive cluster support. Im not impressed.
Edit: Been testing it for 2 days now. It seems like it was originally built for SQL only and they jammed in python last second which makes a lot of sense when looking at the non-pythonic code examples and not being able to create dataframes outside of the dlt functions.
Yeah you have to bypass that by creating functions that create your data frame and calling those in the code of the table definition. A bit annoying but it works and can allow you to test your transformations outside of the pipeline.
@@alexischicoine2072 This is the way