Databricks - Slowly Changing Dimension & CDC with Delta Live Tables and Azure SQL Database

Поделиться
HTML-код
  • Опубликовано: 4 окт 2024

Комментарии • 5

  • @andriifadieiev9757
    @andriifadieiev9757 6 месяцев назад

    Thank you for sharing! Great explnation

  • @VenkatesanVenkat-fd4hg
    @VenkatesanVenkat-fd4hg 6 месяцев назад

    Superr video as always

  • @hectoroviedo960
    @hectoroviedo960 6 месяцев назад

    Hello , when I make some deletes and update I get this in "users_cdc_clean" and "user_cdc_quarantine" : "Detected a data update (for example part-00000-3795a007-fac7-4ed9-9a5a-f446b34caef9-c000.snappy.parquet) in the source table at version 3. This is currently not supported. If you'd like to ignore updates, set the option 'skipChangeCommits' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory" why is that? how can I manage deletes and updates?

    • @AthanasiouApostolos
      @AthanasiouApostolos  6 месяцев назад

      You can't update or delete on streaming tables and restart. This is why we enabled CDC on the database and we have the update type information that can be used for SCD2. If this is what you are asking.

    • @peuzeltje
      @peuzeltje 5 месяцев назад

      @@AthanasiouApostolos I have the same problem with implementing this. The pipeline can only run as a full refresh, but cannot be updated when new cdc records come into the source tables. The 'middle' streaming table (users_cdc_clean in your example) fails due to changes in source table (users_cdc_bronze in your example). After the initial run (or full refresh), making updates to the source table (which would be TEST123 in your example) and then running the pipeline update, it fails with error (org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 98392bb5-e5ee-419c-887a-514bbd0d4b8d, runId = 8fd2d375-9796-4ad0-a24f-eaf7bdd3f6f4] terminated with exception: [DELTA_SOURCE_TABLE_IGNORE_CHANGES] Detected a data update (for example part-00000-50168213-b0c8-4b5e-a20b-251aa77c3d4b-c000.snappy.parquet) in the source table at version 38. This is currently not supported. If you'd like to ignore updates, set the option 'skipChangeCommits' to 'true'). The updates that I did in TEST123 should only lead to appends in the users_cdc_bronze, but somehow, DB runtime seems to think there are updates in this table. Even when settings options skipChangeCommits to true for the users_cdc_bronze ingest, the pipeline fails on the second run.