Very interesting video, with a lot of ideas. Because of the title, the content of the video is bit different, I thought it was more focused in presenting the "players" to construct an ETL. The idea of raw data seems to come more from the operational database than from the flat files (csv, etc), and after the cleaning/staging/mapping comes the Stage DB and then the Data warehouse. I can't wait to see part 2 !
Awesome explanation. Thank you. I have lots of questions about this topic. How do you handle additions of new data? Do the raw and stage databases get cleared out each time new data is added to the data warehouse? Also how do you handle changes in ETL logic over time? How would you handle a situation where a portion of historical data needs to be reloaded into the data warehouse possibly using new ETL logic? Should it always be possible to recreate the entire data warehouse from the flat files? If so how do you ensure that the current state of the data warehouse is the same as it would be if it were blasted away and recreated from scratch? Are there any strategies for version control of a data warehouse?
Did you ever done some data quality monitoriing system and send notifications for loading issues.? It's help to track issues on every level of loading.
Thank you so much. I needed this for an interview I have coming up. I need some formal concepts for what I am doing at a current job where things aren't exactly referred to in these ways.
As an financial auditor I want to extract data from our clients database and then manipulate it to have auditiformation. Is learning SQL language the best thing to do? Like to hear form you.
That's interesting. Is there any other way to extract the data? Learning SQL is a big lift and might not be a good time trade off. If you need to do analytics on the data then i would say yes. But if you're just auditing, it might not provide the same benefit.
Great Video! I have one clarifying question. Is there a need to create csv, xml etc files from operational DB and then load the data into Raw DB. Will it not be easier and efficient to simply load the data from operational DB to Raw DB without creating any files in between?
Personally, I prefer loading raw data as is. Regardless if there are errors. Why? Because how do you know where the errors came from? If you add a bunch of logic into your raw load, then that logic could mess up your data and you might not know it. However, if you have a problem in your raw data and all you are really doing is loading, then you know the problem is in the data itself. This isn't a rule, more of a guidelines. It depends how messy your data is. If 90% of your data is messy and needs cleaning, then consider doing it prior, but if only 5% is, then it might be better to do later.
Hi there. Got a question on raw data (flatfiles). These doesn't have identities or keys so you formulate a candidate key by combining some columns (product, location, target_year). Here's the question, if there are some columns to correct and it belongs to the combination of candidate keys, how can the data be corrected or updated? What approach is to be made ?
I am sorry for leaving this question so long. I really wasn't sure the best way to respond. If you're still curious, do you mind rephrasing the question?
@@SeattleDataGuy in the absence of a unique/id keys for flatfiles, there are suggestions to use composite keys or combine columns to be used as a candidate keys (sample product, area, product_year columns resulting to : apple-washington-1999) as identity. Question is what if one of the candidate columns have nulls or empty value (e.g. no area value resulting to : apple-1999 as a key) on the first ingestion then a later update came with fill-up value (let say area is indiana that would result to apple-indiana-1999 as a key). What is the best approach to update this kind of scenario? Would that result to a data loss since the first combination keys are incomplete then when an update came and it was completed, record with incomplete combination keys should be removed/cleaned?
This is down to earth . I do appreciate your level of understanding and passing it on to me as a beginner
Thank you for your kind words Abdul! Glad I could pass on some knowledge.
By far one of the best i have seen on data mapping and stages of ETL...Thanks Ben! Keep them coming!!
Thank you! I am trying to find time between work and consulting!
You’re a natural teacher. Great vid!
You're too kind! Thank you!
Very interesting video, with a lot of ideas. Because of the title, the content of the video is bit different, I thought it was more focused in presenting the "players" to construct an ETL.
The idea of raw data seems to come more from the operational database than from the flat files (csv, etc), and after the cleaning/staging/mapping comes the Stage DB and then the Data warehouse.
I can't wait to see part 2 !
where is part 2?
Awesome explanation. Thank you. I have lots of questions about this topic. How do you handle additions of new data? Do the raw and stage databases get cleared out each time new data is added to the data warehouse? Also how do you handle changes in ETL logic over time? How would you handle a situation where a portion of historical data needs to be reloaded into the data warehouse possibly using new ETL logic? Should it always be possible to recreate the entire data warehouse from the flat files? If so how do you ensure that the current state of the data warehouse is the same as it would be if it were blasted away and recreated from scratch? Are there any strategies for version control of a data warehouse?
Great explanation!
Thank you! Man this video is old..i need to make a new one
No part 2 ? What about the logging and error tracking video? Really nice video, thank you very much
Is a Part 2 coming? I could not find it. Very good vid!
I can’t find it either
ruclips.net/video/2qM3UlX8zTo/видео.html&ab_channel=SeattleDataGuySeattleDataGuy i think this
Wish you made more content like this
thanks for this
Glad you liked it!
Would be great if you could make this more specific with something like DBT or prefect/dagster/airflow involved.
Did you ever done some data quality monitoriing system and send notifications for loading issues.? It's help to track issues on every level of loading.
Thank you so much. I needed this for an interview I have coming up. I need some formal concepts for what I am doing at a current job where things aren't exactly referred to in these ways.
Let me know if you have any other specific questions or concepts you would like covered!
As an financial auditor I want to extract data from our clients database and then manipulate it to have auditiformation. Is learning SQL language the best thing to do? Like to hear form you.
That's interesting. Is there any other way to extract the data? Learning SQL is a big lift and might not be a good time trade off. If you need to do analytics on the data then i would say yes. But if you're just auditing, it might not provide the same benefit.
Great Video! I have one clarifying question. Is there a need to create csv, xml etc files from operational DB and then load the data into Raw DB. Will it not be easier and efficient to simply load the data from operational DB to Raw DB without creating any files in between?
This can depend. Some people like to do this for observability sake.
Where slowly changing dimension process should be done, in staging or DW ?
What about removing nulls and malformed entries -- would you recommend doing that prior to Staging or afterwards?
Excellent video by the way, thank you!
Personally, I prefer loading raw data as is. Regardless if there are errors. Why? Because how do you know where the errors came from? If you add a bunch of logic into your raw load, then that logic could mess up your data and you might not know it. However, if you have a problem in your raw data and all you are really doing is loading, then you know the problem is in the data itself. This isn't a rule, more of a guidelines. It depends how messy your data is. If 90% of your data is messy and needs cleaning, then consider doing it prior, but if only 5% is, then it might be better to do later.
Hi there.
Got a question on raw data (flatfiles).
These doesn't have identities or keys so you formulate a candidate key by combining some columns (product, location, target_year).
Here's the question, if there are some columns to correct and it belongs to the combination of candidate keys, how can the data be corrected or updated? What approach is to be made ?
I am sorry for leaving this question so long. I really wasn't sure the best way to respond.
If you're still curious, do you mind rephrasing the question?
@@SeattleDataGuy in the absence of a unique/id keys for flatfiles, there are suggestions to use composite keys or combine columns to be used as a candidate keys (sample product, area, product_year columns resulting to : apple-washington-1999) as identity. Question is what if one of the candidate columns have nulls or empty value (e.g. no area value resulting to : apple-1999 as a key) on the first ingestion then a later update came with fill-up value (let say area is indiana that would result to apple-indiana-1999 as a key). What is the best approach to update this kind of scenario? Would that result to a data loss since the first combination keys are incomplete then when an update came and it was completed, record with incomplete combination keys should be removed/cleaned?
Very nice but could you provide more videos on ETL pls thanks
Hi i have few questions related to this topic.
i think i now understand everything.
where is part 2?
ruclips.net/video/2qM3UlX8zTo/видео.html&ab_channel=SeattleDataGuySeattleDataGuy
Please provide an example and explain this again
mindd blowwnnn