WoW !!! Thank you so much. For the last couple of months, I have been struggling to understand DLT. I wish I had known sooner that a ~30mins video would do the trick.
Oh crap. I wrote my own Delta Live Table-like implementation (not as fancy, of course) early this year. Now I need to make a choice... Need to read the docs and get on a call with the Databricks folks. Got a lot of questions. Thanks for the video!
I think there's a lot of people in the same place! Build/Maintain your own framework with all the flexibility, or take the out-of-the-box for simplicity. Will be interesting as it matures, how feasible the latter is!
haha me too! I have also been working on my own implementation that aims to populate a pipeline of DeltaLake tables. My biggest challenge is to figure out 'which part of downstream table needs to be updated because the corresponding part of upstream table is updated'. Somehow I think the 'inode' concept in OS File System might help... Would be interesting to see DLT's appraoch!
For anyone coming from visual etl, just think check constraints + SSIS error path in metadata when check constraints or other constraints or violated + data quality output process metadata + ability to define your own hardware.....minus the overhead of the RDBMS transaction log.
Always thanks for those awesome videos you create. A question: How can we make those 3 tables created appeared in data tab under a schema (a.k.a database)? What happens with those 3 data folders created in storage if we don't specify the location/path while configuring, where do they go?
Delta live tables is an odd name. However the workflow concept is really cool. Been playing with it and like the expectations bit. The dependency appearing as a diagram is cool too...somewhat of a lineage concept. Prefer python over SQL. Find the SQL bit limiting. Probably works for Data Analyst role as you mentioned.
I always develop/test my notebook code locally, and then as a final step deploy to Databricks. With DLT, I feel the costs will skyrocket with those clusters needing to be running, and also it is very slow. I am really hesitant to use this in it's current state.
Hi, thanks for the awesome video. I would like to know whether DLT can read data from kafka or not? At our company, we wish to read data from kafka, transform it and then load it to Cosmos DB. Want to know whether this can be possible using DLT.
It almost seems like they asked a scala developer to write some python code and he took some creative freedoms. That is the craziest looking code I've ever seen from a professional company trying to sell a product.
Great video! Databricks is decided to improve the number of services provided overtime. However, it's getting a bit confusing since we seem to have now services that are competing within the same ecosystem. When would you recommend to use ADF instead of DLT?
when you have data processing steps that are not encapsulated in Databricks environment then use ADF. If all your ETL steps are in Databricks then use DLT.
Hi Simon....That was a really nice video and I love it. Even my all doubts are cleared. Do you have video for merging delta table using Z ordering & multilevel partition to optimize incremental load? If yes please share the link.
Thanks for this awesome video.. However a quick question - Using python we can imply multiple additional functionalities over a dataframe in side the delta live table function, like custom function, UDFand also multi step programming (as you shown).. but don't think we can do all those using SQL in delta table.. Could pls correct me if wrong?
Nope, you don't have the same iterative power as you would in Pyspark, but you can certainly achieve a lot. I've not tested whether Databricks SQL Functions work inside DLT, but if they do that's most of the functionality you list covered!
Love your channel. Regarding the import dlt I found it annoying as well. I think it might be possible to hack together a fake import dlt to be able to have the function definitions at least run and provide some autocomplete with doc strings. I'm going to look into it if I can grab some source code at runtime. Have you had a chance to look at the new multi-task jobs / orchestration? For example I got a use case where I run a merge from some parquet source into a delta table that I then use as my first source for streaming in my delta live tables pipeline. With this new feature they can be ran one after the other in the same job. Keep up the good work your videos are really clear and you have an engaging presentation.
I'm holding off on mocking up the dlt library for now - I'm hoping that it'll be baked into future DBX runtimes (at least for autocomplete etc as you say), and it's just the preview nature that means it uses a custom runtime... but we'll see what it looks like as it moves towards general availability! Haven't looked at multi-task jobs yet - I checked yesterday and my workspace isn't enabled yet. I'll have a check early next week and put together a quick vid! Simon
clickstream_raw is mapped with BRONZE layer? clickstream_cleaned is mapped with SILVER LAYER? how could I map each delta table with the medallion layers?
Thanks for the video Simon. Have enjoyed watching it as always. I have got a quick question. Can we execute the Delta Live table pipelines from orchestrators such as Data Factory or Apache Airflow?
I'm not Simon but what I've been doing is creating a job that runs the pipeline and then starting the job using the rest api. If you need it to wait for the job to finish you can write a while loop to get the status of the job using the api. You can do the calls from your tool or do that in a notebook on a tiny machine if that's easier.
Does anyone know how to start a full refresh without clicking on the button on the ui? I've only been able to setup a regular refresh but not the full. I have a use case where I run a streaming query but periodically I save the output and restart a new streaming input so it doesn't grow too large for steps that run over the whole data and aren't streamed. Right now I send myself an email to go and do it it's not ideal. Otherwise the refresh throws an error as expected after modifying the streaming input.
Is it possible to use some other source like a JDBC database or azure event hub instead of cloud files? BTW, I watch your videos regularly. Great Work!! Thanks.
Yep, anything that has a spark dataframe reader! I'm sure there's a little bit of nuance with the weirder ones like event hubs, but it's just spinning up a spark job so should be doable with most things spark can read! Simon
We write our notebooks in Scala but looking at you video the supported languages are Pythion and SQL. Do you know if Scala will be a possible language to use in DLT?
Honestly don't know! Some of the more abstracted Databricks elements (table access control, passthrough ADD etc) are python/SQL only, so it may be a similar limitation? No idea what the future plan is inside Databricks! Simon
Potentially - certainly does some of the orchestration elements, but certainly isn't as good at other workflow elements, copying data into the platform etc!
Would it be possible to use Delta Live tables for some temporary jobs like giving user a decrypted data where a decryption job runs and once the use is over we delete the data from Workspace
@@AdvancingAnalytics we would like to expose this to other clients like Tableau so within databricks custom notebook is fine for external clients we wanted to use this option
Thanks Simon for this new video, I love your channel ! I'm not very familiar with Databricks but I just did some practice on Azure Synapse (based on your video ruclips.net/video/lpBM4Yn2k3U/видео.html ) and after watching at this new video on Delta Live Table I was wondering if the same "outcome" couldn't be achieved with Synapse (scheduled) Pipeline and Data Flows (with Delta Table in Sink) ....Or did I completely missed the point (which is completely possible :p )
Hey! Yep, you could build a pipeline loading a delta table using Synapse pipelines & data flows that would achieve the same loading for the quick batch example - the deeper point of DLT is around incremental loading, the data quality elements and (hopefully) building some reusable transformation functions, which you wouldn't be able to do to the same level in a Synapse data flow! Simon
The view objects literally are just SQL Views, only difference is the extra wrapping that lets you materialise the data back to physical Delta tables. Updates & Deletes aren't currently supported, but I'm hoping we'll see them in there eventually!
Speak of the devil - just announced: databricks.com/blog/2022/02/10/databricks-delta-live-tables-announces-support-for-simplified-change-data-capture.html
They way the code was explained was outstanding!
WoW !!! Thank you so much. For the last couple of months, I have been struggling to understand DLT. I wish I had known sooner that a ~30mins video would do the trick.
Wow, that's amazing. Thank you Simon!
Love these videos. Thank you Simon!
Oh crap. I wrote my own Delta Live Table-like implementation (not as fancy, of course) early this year. Now I need to make a choice... Need to read the docs and get on a call with the Databricks folks. Got a lot of questions. Thanks for the video!
I think there's a lot of people in the same place! Build/Maintain your own framework with all the flexibility, or take the out-of-the-box for simplicity. Will be interesting as it matures, how feasible the latter is!
haha me too! I have also been working on my own implementation that aims to populate a pipeline of DeltaLake tables. My biggest challenge is to figure out 'which part of downstream table needs to be updated because the corresponding part of upstream table is updated'. Somehow I think the 'inode' concept in OS File System might help...
Would be interesting to see DLT's appraoch!
Thank you Simon for covering dlt in few mins.. Much helpful as always..
Great explanation. Thanks. Wondering how to do incremental loading, reprocessing, watermarks and all that good stuff.
Very good!! Explained perfectly!
For anyone coming from visual etl, just think check constraints + SSIS error path in metadata when check constraints or other constraints or violated + data quality output process metadata + ability to define your own hardware.....minus the overhead of the RDBMS transaction log.
Always thanks for those awesome videos you create. A question: How can we make those 3 tables created appeared in data tab under a schema (a.k.a database)? What happens with those 3 data folders created in storage if we don't specify the location/path while configuring, where do they go?
this looks really cool indeed. The expectation checks are neat, I wonder what else they will introduce to make DQ/testing of the pipelines easier.
Delta live tables is an odd name. However the workflow concept is really cool. Been playing with it and like the expectations bit. The dependency appearing as a diagram is cool too...somewhat of a lineage concept. Prefer python over SQL. Find the SQL bit limiting. Probably works for Data Analyst role as you mentioned.
Nice work! Not sure where to use it now, but looks cool!
I always develop/test my notebook code locally, and then as a final step deploy to Databricks. With DLT, I feel the costs will skyrocket with those clusters needing to be running, and also it is very slow. I am really hesitant to use this in it's current state.
Hi, thanks for the awesome video. I would like to know whether DLT can read data from kafka or not? At our company, we wish to read data from kafka, transform it and then load it to Cosmos DB. Want to know whether this can be possible using DLT.
It almost seems like they asked a scala developer to write some python code and he took some creative freedoms. That is the craziest looking code I've ever seen from a professional company trying to sell a product.
Great video!
Databricks is decided to improve the number of services provided overtime. However, it's getting a bit confusing since we seem to have now services that are competing within the same ecosystem.
When would you recommend to use ADF instead of DLT?
when you have data processing steps that are not encapsulated in Databricks environment then use ADF. If all your ETL steps are in Databricks then use DLT.
Can we implement CDC capture and column name changes or transformations in single layer for DLT?
best explanation
Hi Simon....That was a really nice video and I love it. Even my all doubts are cleared. Do you have video for merging delta table using Z ordering & multilevel partition to optimize incremental load? If yes please share the link.
Thanks for this awesome video.. However a quick question -
Using python we can imply multiple additional functionalities over a dataframe in side the delta live table function, like custom function, UDFand also multi step programming (as you shown).. but don't think we can do all those using SQL in delta table.. Could pls correct me if wrong?
Nope, you don't have the same iterative power as you would in Pyspark, but you can certainly achieve a lot. I've not tested whether Databricks SQL Functions work inside DLT, but if they do that's most of the functionality you list covered!
Can we do model scoring within the delta table definition script? Like pick a model from the registry and load it as udf and apply it live?
Is it incremental load or full load. What does behind the scenes write statement looks like. Can we partition , bucket while writing ?
Love your channel.
Regarding the import dlt I found it annoying as well. I think it might be possible to hack together a fake import dlt to be able to have the function definitions at least run and provide some autocomplete with doc strings. I'm going to look into it if I can grab some source code at runtime.
Have you had a chance to look at the new multi-task jobs / orchestration? For example I got a use case where I run a merge from some parquet source into a delta table that I then use as my first source for streaming in my delta live tables pipeline. With this new feature they can be ran one after the other in the same job.
Keep up the good work your videos are really clear and you have an engaging presentation.
I'm holding off on mocking up the dlt library for now - I'm hoping that it'll be baked into future DBX runtimes (at least for autocomplete etc as you say), and it's just the preview nature that means it uses a custom runtime... but we'll see what it looks like as it moves towards general availability!
Haven't looked at multi-task jobs yet - I checked yesterday and my workspace isn't enabled yet. I'll have a check early next week and put together a quick vid!
Simon
How do you actually specify the storage location in the delta lake pipeline to azure delta lake without mounting it to dbfs?
clickstream_raw is mapped with BRONZE layer? clickstream_cleaned is mapped with SILVER LAYER? how could I map each delta table with the medallion layers?
good work Man :)
Thanks for the video Simon. Have enjoyed watching it as always. I have got a quick question. Can we execute the Delta Live table pipelines from orchestrators such as Data Factory or Apache Airflow?
I'm not Simon but what I've been doing is creating a job that runs the pipeline and then starting the job using the rest api. If you need it to wait for the job to finish you can write a while loop to get the status of the job using the api. You can do the calls from your tool or do that in a notebook on a tiny machine if that's easier.
Does anyone know how to start a full refresh without clicking on the button on the ui? I've only been able to setup a regular refresh but not the full.
I have a use case where I run a streaming query but periodically I save the output and restart a new streaming input so it doesn't grow too large for steps that run over the whole data and aren't streamed. Right now I send myself an email to go and do it it's not ideal. Otherwise the refresh throws an error as expected after modifying the streaming input.
Is it possible to use some other source like a JDBC database or azure event hub instead of cloud files?
BTW, I watch your videos regularly. Great Work!! Thanks.
Yep, anything that has a spark dataframe reader! I'm sure there's a little bit of nuance with the weirder ones like event hubs, but it's just spinning up a spark job so should be doable with most things spark can read!
Simon
We write our notebooks in Scala but looking at you video the supported languages are Pythion and SQL.
Do you know if Scala will be a possible language to use in DLT?
Honestly don't know! Some of the more abstracted Databricks elements (table access control, passthrough ADD etc) are python/SQL only, so it may be a similar limitation? No idea what the future plan is inside Databricks!
Simon
please help the storage location? as bump into some problems.
Thoughts on dbt compared to this? Seems very similar.
Yeah, seems to be aiming at a similar space but has a lot less polish so far. Honestly I've only dabbled with DBT so can't comment much further!
Transformations very similar to what we have in DBT
Does it replace Azure Data Factory in some extension?
Potentially - certainly does some of the orchestration elements, but certainly isn't as good at other workflow elements, copying data into the platform etc!
Would it be possible to use Delta Live tables for some temporary jobs like giving user a decrypted data where a decryption job runs and once the use is over we delete the data from Workspace
You certainly could - seems like a lot of setup if it's a throwaway bit of data. Probably easier to manage that via a custom notebook?
@@AdvancingAnalytics we would like to expose this to other clients like Tableau so within databricks custom notebook is fine for external clients we wanted to use this option
Thanks Simon for this new video, I love your channel !
I'm not very familiar with Databricks but I just did some practice on Azure Synapse (based on your video ruclips.net/video/lpBM4Yn2k3U/видео.html ) and after watching at this new video on Delta Live Table I was wondering if the same "outcome" couldn't be achieved with Synapse (scheduled) Pipeline and Data Flows (with Delta Table in Sink)
....Or did I completely missed the point (which is completely possible :p )
Hey! Yep, you could build a pipeline loading a delta table using Synapse pipelines & data flows that would achieve the same loading for the quick batch example - the deeper point of DLT is around incremental loading, the data quality elements and (hopefully) building some reusable transformation functions, which you wouldn't be able to do to the same level in a Synapse data flow!
Simon
Expectations...... Assert transforms....... check constraints.......same stuff, different day.....
How is this different from an SQL View. Also can I do upserts and deletes on a Delta table using this?
The view objects literally are just SQL Views, only difference is the extra wrapping that lets you materialise the data back to physical Delta tables.
Updates & Deletes aren't currently supported, but I'm hoping we'll see them in there eventually!
Speak of the devil - just announced: databricks.com/blog/2022/02/10/databricks-delta-live-tables-announces-support-for-simplified-change-data-capture.html
@@AdvancingAnalytics That is amazing. Thanks for sharing the link :)