SQL Best Practices - Designing An ETL - Part 1

Поделиться
HTML-код
  • Опубликовано: 19 окт 2024

Комментарии • 40

  • @abnuru1784
    @abnuru1784 5 лет назад +19

    This is down to earth . I do appreciate your level of understanding and passing it on to me as a beginner

    • @SeattleDataGuy
      @SeattleDataGuy  5 лет назад +2

      Thank you for your kind words Abdul! Glad I could pass on some knowledge.

  • @yaqubhassan51
    @yaqubhassan51 5 лет назад +4

    By far one of the best i have seen on data mapping and stages of ETL...Thanks Ben! Keep them coming!!

    • @SeattleDataGuy
      @SeattleDataGuy  5 лет назад

      Thank you! I am trying to find time between work and consulting!

  • @higiniofuentes2551
    @higiniofuentes2551 3 года назад +3

    Very interesting video, with a lot of ideas. Because of the title, the content of the video is bit different, I thought it was more focused in presenting the "players" to construct an ETL.
    The idea of raw data seems to come more from the operational database than from the flat files (csv, etc), and after the cleaning/staging/mapping comes the Stage DB and then the Data warehouse.
    I can't wait to see part 2 !

  • @ablack0
    @ablack0 4 года назад +11

    Awesome explanation. Thank you. I have lots of questions about this topic. How do you handle additions of new data? Do the raw and stage databases get cleared out each time new data is added to the data warehouse? Also how do you handle changes in ETL logic over time? How would you handle a situation where a portion of historical data needs to be reloaded into the data warehouse possibly using new ETL logic? Should it always be possible to recreate the entire data warehouse from the flat files? If so how do you ensure that the current state of the data warehouse is the same as it would be if it were blasted away and recreated from scratch? Are there any strategies for version control of a data warehouse?

  • @jpank11
    @jpank11 3 года назад +1

    You’re a natural teacher. Great vid!

  • @passais
    @passais 5 лет назад +22

    Is a Part 2 coming? I could not find it. Very good vid!

    • @zhangleo9192
      @zhangleo9192 3 года назад

      I can’t find it either

    • @elysel9424
      @elysel9424 3 года назад +3

      ruclips.net/video/2qM3UlX8zTo/видео.html&ab_channel=SeattleDataGuySeattleDataGuy i think this

  • @09soleil
    @09soleil 2 года назад +1

    No part 2 ? What about the logging and error tracking video? Really nice video, thank you very much

  • @Georgehwp
    @Georgehwp 2 года назад +2

    Would be great if you could make this more specific with something like DBT or prefect/dagster/airflow involved.

  • @KoldFiyure
    @KoldFiyure 5 лет назад +3

    Thank you so much. I needed this for an interview I have coming up. I need some formal concepts for what I am doing at a current job where things aren't exactly referred to in these ways.

    • @SeattleDataGuy
      @SeattleDataGuy  5 лет назад

      Let me know if you have any other specific questions or concepts you would like covered!

  • @michailo87
    @michailo87 2 года назад

    Did you ever done some data quality monitoriing system and send notifications for loading issues.? It's help to track issues on every level of loading.

  • @JanUnitra
    @JanUnitra 2 года назад

    Where slowly changing dimension process should be done, in staging or DW ?

  • @hakank.560
    @hakank.560 3 года назад +2

    As an financial auditor I want to extract data from our clients database and then manipulate it to have auditiformation. Is learning SQL language the best thing to do? Like to hear form you.

    • @SeattleDataGuy
      @SeattleDataGuy  3 года назад

      That's interesting. Is there any other way to extract the data? Learning SQL is a big lift and might not be a good time trade off. If you need to do analytics on the data then i would say yes. But if you're just auditing, it might not provide the same benefit.

  • @Thiago280690
    @Thiago280690 2 года назад +1

    Great explanation!

    • @SeattleDataGuy
      @SeattleDataGuy  2 года назад

      Thank you! Man this video is old..i need to make a new one

  • @00EagleEye00
    @00EagleEye00 3 года назад +2

    Hi there.
    Got a question on raw data (flatfiles).
    These doesn't have identities or keys so you formulate a candidate key by combining some columns (product, location, target_year).
    Here's the question, if there are some columns to correct and it belongs to the combination of candidate keys, how can the data be corrected or updated? What approach is to be made ?

    • @SeattleDataGuy
      @SeattleDataGuy  3 года назад

      I am sorry for leaving this question so long. I really wasn't sure the best way to respond.
      If you're still curious, do you mind rephrasing the question?

    • @00EagleEye00
      @00EagleEye00 3 года назад +1

      ​@@SeattleDataGuy in the absence of a unique/id keys for flatfiles, there are suggestions to use composite keys or combine columns to be used as a candidate keys (sample product, area, product_year columns resulting to : apple-washington-1999) as identity. Question is what if one of the candidate columns have nulls or empty value (e.g. no area value resulting to : apple-1999 as a key) on the first ingestion then a later update came with fill-up value (let say area is indiana that would result to apple-indiana-1999 as a key). What is the best approach to update this kind of scenario? Would that result to a data loss since the first combination keys are incomplete then when an update came and it was completed, record with incomplete combination keys should be removed/cleaned?

  • @kshitijpathak8646
    @kshitijpathak8646 3 года назад +2

    Great Video! I have one clarifying question. Is there a need to create csv, xml etc files from operational DB and then load the data into Raw DB. Will it not be easier and efficient to simply load the data from operational DB to Raw DB without creating any files in between?

    • @SeattleDataGuy
      @SeattleDataGuy  3 года назад +3

      This can depend. Some people like to do this for observability sake.

  • @weouthere6902
    @weouthere6902 2 года назад

    Wish you made more content like this

  • @lambdakicks
    @lambdakicks 5 лет назад +2

    What about removing nulls and malformed entries -- would you recommend doing that prior to Staging or afterwards?

    • @lambdakicks
      @lambdakicks 5 лет назад

      Excellent video by the way, thank you!

    • @SeattleDataGuy
      @SeattleDataGuy  5 лет назад +3

      Personally, I prefer loading raw data as is. Regardless if there are errors. Why? Because how do you know where the errors came from? If you add a bunch of logic into your raw load, then that logic could mess up your data and you might not know it. However, if you have a problem in your raw data and all you are really doing is loading, then you know the problem is in the data itself. This isn't a rule, more of a guidelines. It depends how messy your data is. If 90% of your data is messy and needs cleaning, then consider doing it prior, but if only 5% is, then it might be better to do later.

  • @peekguyy3194
    @peekguyy3194 2 года назад +1

    thanks for this

  • @mohamedarif2303
    @mohamedarif2303 3 года назад

    Very nice but could you provide more videos on ETL pls thanks

  • @ninjaturtle205
    @ninjaturtle205 Год назад

    i think i now understand everything.

  • @roshanshah5028
    @roshanshah5028 5 лет назад

    Hi i have few questions related to this topic.

  • @50tigres79
    @50tigres79 4 года назад

    where is part 2?

    • @elysel9424
      @elysel9424 3 года назад

      ruclips.net/video/2qM3UlX8zTo/видео.html&ab_channel=SeattleDataGuySeattleDataGuy

  • @user-yq9kr9sy9i
    @user-yq9kr9sy9i Год назад

    Please provide an example and explain this again

  • @ninjaturtle205
    @ninjaturtle205 Год назад

    mindd blowwnnn