Use BigQuery Transfer Service to Optimise a File Based CDC

Поделиться
HTML-код
  • Опубликовано: 31 дек 2024

Комментарии • 9

  • @abednegosantoso
    @abednegosantoso 2 месяца назад +1

    Thank you for this superb explanation. If I get it right, this is about how we ingest the data to our BigQuery, right?
    At the moment, our solution is to use Datastream. Which one do you think is better?

    • @practicalgcp2780
      @practicalgcp2780  2 месяца назад

      Yes it’s about ingesting data into BigQuery. However, it’s more focused on situations where your CDC solution’s target isn’t BigQuery but GCS (cloud storage).
      Not everyone had already ingested data into BigQuery, some did it years ago where BigQuery connector isn’t supported, or have issues when it comes to handling schema changes. For these customers, changing it to another solution can be really costly and time consuming.
      Datastream is a solution does CDC all the way to BigQuery, if you already has it, no need to change. A solution like Datastream ingest data directly into BigQuery offers much better latency (almost near real time), compared to file based CDC.

    • @abednegosantoso
      @abednegosantoso 2 месяца назад

      @@practicalgcp2780 if I'm not mistaken, Datastream also allows GCS as destination?
      Anyway, have you tried Iceberg on GCS and use BigQuery on top of it?

  • @andrelsjunior
    @andrelsjunior 2 месяца назад +1

    Nice tips!
    In that SCD2 type. How do you recommend to do the most optimized soft delete?

    • @practicalgcp2780
      @practicalgcp2780  2 месяца назад

      Thank you! I don’t know what’s most optimised, but the model I see the most is to have a valid_from and valid_to column with timestamps.

    • @andrelsjunior
      @andrelsjunior 2 месяца назад

      But then running an update query after ingesting for old records?

    • @practicalgcp2780
      @practicalgcp2780  2 месяца назад

      @@andrelsjunior I haven’t done it for a while but typically on BigQuery this isn’t done through traditional update but DML Merge (I believe DBT can manage this quite easily without needing you to write the DML).
      The other way to do it (again very mixed, some like it some don’t), it’s just to add a row with the new data the first time you see it, then you have both rows with their own timestamps and it works well with incremental workload without needing to update any pre existing records. But I am not sure if this follows any open standard, personally I find it’s easier to deal with.
      Whichever way you do it, it’s really important to make sure the whole thing can always be rebuilt from data lake, this is easy enough to do and manage with tools like DBT but as a concept it’s very important regardless of the tooling choices, so if the logic is wrong or needs updating, it’s just a full rebuild and things are all fixed.

    • @andrelsjunior
      @andrelsjunior 2 месяца назад

      @@practicalgcp2780 appreciate your pov!
      I was thinking about using dbt specifically, but it adds an extra layer haha
      Seems that this is the best way.

  • @practicalgcp2780
    @practicalgcp2780  2 месяца назад

    A good question got asked about schema management, see www.linkedin.com/feed/update/urn:li:activity:7248816174312960000?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7248816174312960000%2C7248834731771858944%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7248816174312960000%2C7248943929612468224%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287248834731771858944%2Curn%3Ali%3Aactivity%3A7248816174312960000%29&dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287248943929612468224%2Curn%3Ali%3Aactivity%3A7248816174312960000%29