Advancing Spark - Databricks Delta Live Tables First Look

Advancing Analytics

Просмотров 42 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 24 ноя 2024

Комментарии • 55

@Sriramiyer1992 Год назад
They way the code was explained was outstanding!
@MegaSb360 Год назад
WoW !!! Thank you so much. For the last couple of months, I have been struggling to understand DLT. I wish I had known sooner that a ~30mins video would do the trick.
@danielperico2806 3 года назад ⁺¹
Wow, that's amazing. Thank you Simon!
@Simondoubt4446 3 года назад ⁺²
Love these videos. Thank you Simon!
@JianZhouVA 3 года назад ⁺²
Oh crap. I wrote my own Delta Live Table-like implementation (not as fancy, of course) early this year. Now I need to make a choice... Need to read the docs and get on a call with the Databricks folks. Got a lot of questions. Thanks for the video!
@AdvancingAnalytics 3 года назад ⁺¹
I think there's a lot of people in the same place! Build/Maintain your own framework with all the flexibility, or take the out-of-the-box for simplicity. Will be interesting as it matures, how feasible the latter is!
@bobj8690 2 года назад
haha me too! I have also been working on my own implementation that aims to populate a pipeline of DeltaLake tables. My biggest challenge is to figure out 'which part of downstream table needs to be updated because the corresponding part of upstream table is updated'. Somehow I think the 'inode' concept in OS File System might help...
Would be interesting to see DLT's appraoch!
@ashrafrcet 3 года назад
Thank you Simon for covering dlt in few mins.. Much helpful as always..
@amateurvisser 2 года назад
Great explanation. Thanks. Wondering how to do incremental loading, reprocessing, watermarks and all that good stuff.
@LulaStudios 8 месяцев назад
Very good!! Explained perfectly!
@NeumsFor9 Год назад
For anyone coming from visual etl, just think check constraints + SSIS error path in metadata when check constraints or other constraints or violated + data quality output process metadata + ability to define your own hardware.....minus the overhead of the RDBMS transaction log.
@KrisKoirala Год назад ⁺¹
Always thanks for those awesome videos you create. A question: How can we make those 3 tables created appeared in data tab under a schema (a.k.a database)? What happens with those 3 data folders created in storage if we don't specify the location/path while configuring, where do they go?
@biancairis93 2 года назад
this looks really cool indeed. The expectation checks are neat, I wonder what else they will introduce to make DQ/testing of the pipelines easier.
@lackshubalasubramaniam7311 2 года назад ⁺²
Delta live tables is an odd name. However the workflow concept is really cool. Been playing with it and like the expectations bit. The dependency appearing as a diagram is cool too...somewhat of a lineage concept. Prefer python over SQL. Find the SQL bit limiting. Probably works for Data Analyst role as you mentioned.
@dmitryanoshin8004 3 года назад
Nice work! Not sure where to use it now, but looks cool!
@morrolan Год назад
I always develop/test my notebook code locally, and then as a final step deploy to Databricks. With DLT, I feel the costs will skyrocket with those clusters needing to be running, and also it is very slow. I am really hesitant to use this in it's current state.
@devanshsharma7929 2 года назад ⁺¹
Hi, thanks for the awesome video. I would like to know whether DLT can read data from kafka or not? At our company, we wish to read data from kafka, transform it and then load it to Cosmos DB. Want to know whether this can be possible using DLT.
@gardnmi 2 года назад ⁺¹
It almost seems like they asked a scala developer to write some python code and he took some creative freedoms. That is the craziest looking code I've ever seen from a professional company trying to sell a product.
@AVGMachine 3 года назад ⁺²
Great video!
Databricks is decided to improve the number of services provided overtime. However, it's getting a bit confusing since we seem to have now services that are competing within the same ecosystem.
When would you recommend to use ADF instead of DLT?
@limitlesslife7536 2 года назад ⁺²
when you have data processing steps that are not encapsulated in Databricks environment then use ADF. If all your ETL steps are in Databricks then use DLT.
@nithishreddy752 Год назад
Can we implement CDC capture and column name changes or transformations in single layer for DLT?
@suresh.suthar.24 Год назад
best explanation
@vikashmishra2759 3 года назад
Hi Simon....That was a really nice video and I love it. Even my all doubts are cleared. Do you have video for merging delta table using Z ordering & multilevel partition to optimize incremental load? If yes please share the link.
@neelbanerjee7875 Год назад
Thanks for this awesome video.. However a quick question -
Using python we can imply multiple additional functionalities over a dataframe in side the delta live table function, like custom function, UDFand also multi step programming (as you shown).. but don't think we can do all those using SQL in delta table.. Could pls correct me if wrong?
@AdvancingAnalytics Год назад ⁺¹
Nope, you don't have the same iterative power as you would in Pyspark, but you can certainly achieve a lot. I've not tested whether Databricks SQL Functions work inside DLT, but if they do that's most of the functionality you list covered!
@pranesh1213 Год назад
Can we do model scoring within the delta table definition script? Like pick a model from the registry and load it as udf and apply it live?
@kuldipjoshi1406 2 года назад
Is it incremental load or full load. What does behind the scenes write statement looks like. Can we partition , bucket while writing ?
@alexischicoine2072 3 года назад
Love your channel.
Regarding the import dlt I found it annoying as well. I think it might be possible to hack together a fake import dlt to be able to have the function definitions at least run and provide some autocomplete with doc strings. I'm going to look into it if I can grab some source code at runtime.
Have you had a chance to look at the new multi-task jobs / orchestration? For example I got a use case where I run a merge from some parquet source into a delta table that I then use as my first source for streaming in my delta live tables pipeline. With this new feature they can be ran one after the other in the same job.
Keep up the good work your videos are really clear and you have an engaging presentation.
@AdvancingAnalytics 3 года назад ⁺¹
I'm holding off on mocking up the dlt library for now - I'm hoping that it'll be baked into future DBX runtimes (at least for autocomplete etc as you say), and it's just the preview nature that means it uses a custom runtime... but we'll see what it looks like as it moves towards general availability!
Haven't looked at multi-task jobs yet - I checked yesterday and my workspace isn't enabled yet. I'll have a check early next week and put together a quick vid!
Simon
@jespermartinsson8331 2 года назад
How do you actually specify the storage location in the delta lake pipeline to azure delta lake without mounting it to dbfs?
@ezequielchurches5916 6 месяцев назад
clickstream_raw is mapped with BRONZE layer? clickstream_cleaned is mapped with SILVER LAYER? how could I map each delta table with the medallion layers?
@deenquotes786 3 года назад
good work Man :)
@sankhachakraborty5801 3 года назад
Thanks for the video Simon. Have enjoyed watching it as always. I have got a quick question. Can we execute the Delta Live table pipelines from orchestrators such as Data Factory or Apache Airflow?
@alexischicoine2072 3 года назад ⁺¹
I'm not Simon but what I've been doing is creating a job that runs the pipeline and then starting the job using the rest api. If you need it to wait for the job to finish you can write a while loop to get the status of the job using the api. You can do the calls from your tool or do that in a notebook on a tiny machine if that's easier.
@alexischicoine2072 3 года назад
Does anyone know how to start a full refresh without clicking on the button on the ui? I've only been able to setup a regular refresh but not the full.
I have a use case where I run a streaming query but periodically I save the output and restart a new streaming input so it doesn't grow too large for steps that run over the whole data and aren't streamed. Right now I send myself an email to go and do it it's not ideal. Otherwise the refresh throws an error as expected after modifying the streaming input.
@advanceddataengineering3784 3 года назад
Is it possible to use some other source like a JDBC database or azure event hub instead of cloud files?
BTW, I watch your videos regularly. Great Work!! Thanks.
@AdvancingAnalytics 3 года назад
Yep, anything that has a spark dataframe reader! I'm sure there's a little bit of nuance with the weirder ones like event hubs, but it's just spinning up a spark job so should be doable with most things spark can read!
Simon
@tiagorente 3 года назад
We write our notebooks in Scala but looking at you video the supported languages are Pythion and SQL.
Do you know if Scala will be a possible language to use in DLT?
@AdvancingAnalytics 3 года назад ⁺¹
Honestly don't know! Some of the more abstracted Databricks elements (table access control, passthrough ADD etc) are python/SQL only, so it may be a similar limitation? No idea what the future plan is inside Databricks!
Simon
@joyyoung3288 2 года назад
please help the storage location? as bump into some problems.
@thomasadams6860 3 года назад ⁺¹
Thoughts on dbt compared to this? Seems very similar.
@AdvancingAnalytics 3 года назад ⁺¹
Yeah, seems to be aiming at a similar space but has a lot less polish so far. Honestly I've only dabbled with DBT so can't comment much further!
@denisgodunov6157 9 месяцев назад
Transformations very similar to what we have in DBT
@dmitryanoshin8004 3 года назад ⁺¹
Does it replace Azure Data Factory in some extension?
@AdvancingAnalytics 3 года назад
Potentially - certainly does some of the orchestration elements, but certainly isn't as good at other workflow elements, copying data into the platform etc!
@prashantthakur4324 2 года назад
Would it be possible to use Delta Live tables for some temporary jobs like giving user a decrypted data where a decryption job runs and once the use is over we delete the data from Workspace
@AdvancingAnalytics 2 года назад
You certainly could - seems like a lot of setup if it's a throwaway bit of data. Probably easier to manage that via a custom notebook?
@prashantthakur4324 2 года назад
@@AdvancingAnalytics we would like to expose this to other clients like Tableau so within databricks custom notebook is fine for external clients we wanted to use this option
@GuillaumeBerthier 3 года назад ⁺¹
Thanks Simon for this new video, I love your channel !
I'm not very familiar with Databricks but I just did some practice on Azure Synapse (based on your video ruclips.net/video/lpBM4Yn2k3U/видео.html ) and after watching at this new video on Delta Live Table I was wondering if the same "outcome" couldn't be achieved with Synapse (scheduled) Pipeline and Data Flows (with Delta Table in Sink)
....Or did I completely missed the point (which is completely possible :p )
@AdvancingAnalytics 3 года назад ⁺⁴
Hey! Yep, you could build a pipeline loading a delta table using Synapse pipelines & data flows that would achieve the same loading for the quick batch example - the deeper point of DLT is around incremental loading, the data quality elements and (hopefully) building some reusable transformation functions, which you wouldn't be able to do to the same level in a Synapse data flow!
Simon
@NeumsFor9 2 года назад
Expectations...... Assert transforms....... check constraints.......same stuff, different day.....
@PersonOfBook 2 года назад
How is this different from an SQL View. Also can I do upserts and deletes on a Delta table using this?
@AdvancingAnalytics 2 года назад ⁺¹
The view objects literally are just SQL Views, only difference is the extra wrapping that lets you materialise the data back to physical Delta tables.
Updates & Deletes aren't currently supported, but I'm hoping we'll see them in there eventually!
@AdvancingAnalytics 2 года назад ⁺²
Speak of the devil - just announced: databricks.com/blog/2022/02/10/databricks-delta-live-tables-announces-support-for-simplified-change-data-capture.html
@PersonOfBook 2 года назад
@@AdvancingAnalytics That is amazing. Thanks for sharing the link :)

Следующие

Автовоспроизведение

Advancing Spark - Making Databricks Delta Live Table Templates