Advancing Spark - Exploring DLT Event Metrics

Поделиться
HTML-код
  • Опубликовано: 1 дек 2021
  • One of the huge benefits of Delta Live Tables is the metrics that are generated automatically - logging rows processed, durations, audit information and, crucially, the results of the expectations placed on each table. However, it can be pretty tricky to pull out the metrics from the underlying tables.
    In this video, Simon walks through the improvements to the DQ monitoring within DLT pipelines themselves, and then runs through the process of manually querying the event log and augmenting his DLT pipeline with some prepared data quality metrics
    For more info on delta live tables, check out the docs here: docs.microsoft.com/en-us/azur...
    As always, feel free get in touch with Advancing Analytics if we can help you get to where you need to be on your Lakehouse journey

Комментарии • 9

  • @chrisstephenson9890
    @chrisstephenson9890 2 года назад

    Very informative, thank you. I very much like the additional features DLT and autoloader are adding to provide for standardised ELT frameworks.

  • @kurtmaile9977
    @kurtmaile9977 2 года назад

    Great video as always thanks, some of my favourite listening, particular around DLT which we area actively looking at ourselves! And great topic on DQ and handling with DLT.
    Similar to Darryll's question, would love to see you explore the process of handling failed expectations as the next natural step to trapping and getting metrics. We next need to handle those records in the DLT pipeline that failed. For me there is 2 key scenarios:
    1) Failure requiring manual intervention of a single record within a batch if it failed and can be patched within databricks itself. e.g identify suspect row(s), write SQL to update the row in the source, and let the CDC change stream into the processing again and pass through. Can see then some level of incident 'run(note)book automation' for example to fix this if predictable enough and the issue at source cant be fixed.
    2) Transient failures - this for me is the actual biggest / most common one, and can often be observed with late arriving data at the point of a stream join where that enrichment is needed (and thus expressed as expectation on the target table) . e.g a SalesOrderDetail row, being enriched with Product/Sku info (i.e another DLT table), but for whatever reason (e.g product master feed is down for a period of time), the product row is not present at the time of the SalesOrderDetail join. BUT we expect it to eventually arrive in some acceptable bound of time.
    This can and should be expected in a distributed system anyway - ideally 'what' you would want to occur is some automated retry of the failed rows again (by some condition, e.g up to 24 hours, or x retires, retrying every x period of time), where we expect it to eventually join correctly, and then pass the expectation on the target table, without any intervention needed. In essence you suspect it will self heal over time when the dependant data arrives, its just a matter of not 'hard failing' and then needing manual intervention.
    Would love to hear your thoughts on this (scenario 2 in particular) and even some level of demo! :) My understanding is you'd need to hand roll something now for both use cases, point 1) is understandable, for 2), would love to see something natively built into DLT for this (perhaps there is and Im not aware), but in the absence of, what your thoughts are on handling such a scenario?
    Thanks heaps, keep up the good work

  • @drummerboi4eva
    @drummerboi4eva Год назад

    thanks a lot Simon ,clear ideas , architecture and execution !! :)

  • @TechMomentAI
    @TechMomentAI 2 года назад +3

    Great video thanks. Can you share the notebooks you used?

  • @RajanieshKaushikk
    @RajanieshKaushikk 2 года назад

    Simply Awesome!!

  • @darryll127
    @darryll127 2 года назад +1

    How can we specifically identify which rows and which expectations on individual rows failed?

  • @ravirajuvysyaraju123
    @ravirajuvysyaraju123 2 года назад

    Thanks

  • @kaurivneet1
    @kaurivneet1 2 года назад

    As always brilliant content. I like the way you structure your approach. One question, can we create column like (dqcheckfailed Y/N) rows in the table which doesn't meet expectations? That way if the data is corrected back in the source and reprocessed in lake, the column value gets updated. This will allow to give an accurate count of problematic rows in the data.
    The above can be achieved by building custom data quality framework but was wondering if it can be baked in dlt.

  • @NeumsFor9
    @NeumsFor9 Год назад

    Vendor has a lot of nerve charging premium prices and then leaving the engineer to "keep on digging". They've just got to do better. If I am buying the Expensive Car, I should not need to jury rig and jiggle the steering wheel to get my car to start.
    Simon: Many thanks.
    Databricks: Sew it together more professionally.