Understand Slowly Changing Dimensions

Bryan Cafferky

Просмотров 20 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 10 июл 2024
This lesson answers the common interview question, 'What are SCDs?
and 'Explain the SCD types'. I'll explain Slowly Changing Dimensions in detail so you will ace this interview question but more importantly, you will know how to handle this pivotal issue in your data warehouse architecture.
Наука

Комментарии • 55

@sheldonwayne1390 2 года назад ⁺²
I'm from China and it's frustrating that you can't find any video doing a good job explaining knowledges about DWH, glad I found this on RUclips. Cheers!
@sandrafield9813 2 года назад ⁺¹
Thanks for this description! Its better than all the other ones I found. Its more clear with extra details without being convoluted and messy.
@houstonfirefox 10 месяцев назад
Great explanation Bryan w/real-world examples and analogies. New sub!
@Jel.Awesh.M 2 года назад
Thank you for sharing! A very good resource for SCDs!
@guillermopalaciosgaray 2 года назад
Thanks for your explanation! This is the best video I've found :)
@marcosoliveira8731 3 года назад
Topics like this still so relevant.
I hope you have more to share about DW Architecture.
@italomacelloneable 3 года назад ⁺⁵
Excellent content and very well explained. Thank you
@MuhammadMustafa-rd6vg Год назад
this is a good tutorial , watched dimensional modelling & SCD tutorial, you were pretty clear , explained in detail without making it boring !
@helovesdata8483 2 года назад
you've helped me to understand data warehouses more
@shivam6565 2 года назад
Thank you Bryan for this wonderful tutorial. Concept lucidly explained and I like your teaching style. Liked and subscribed.
@I677000 3 года назад ⁺¹
Awesome, :) hop to get this question asked on my interview :) now I know how to answer it:)
@cluebcke 3 года назад ⁺⁶
This is really awesome, thank you so much. Just to add, one use case for Type 5 (which I had no idea was called Type 5 before watching this) is that if you have a lot of data in a record that doesn't change much, and only a few fields that change a lot, partitioning them into different dimensions can substantially reduce table bloat (as you're not creating many copies of unchanged data) and improve query performance. There are tradeoffs of course and you really have to evaluate (and test empirically) for your use case.
@BryanCafferky 3 года назад ⁺²
Thanks. Great point too.
@zerovine8025 Год назад
Thanks for the easy to understand explanation, man. (btw nice mustache)
@a29miller 2 года назад
Well done. I really appreciate the detail
@AC-hh2cb 3 года назад ⁺¹
Great explanation, will help me at work :)
@marcosoliveira8731 3 года назад ⁺¹
THANK YOU! Real good explanation!
btw: Nice R mug :)
@sauravmisra6061 2 года назад
Thank you sir for explaining the code EDW concepts in both the videos - Dimensional modeling and SCD, so well and precisely. I also followed your Databricks and Spark and these are the best I have found so far on these topics. Thank you , sir and really appreciate your effort.
@BryanCafferky 2 года назад
Glad it is helpful. Thanks for your kind words.
@bestherk4522 3 года назад ⁺¹
Good video, Clear Explanations. Thank you.
@BryanCafferky 3 года назад
YW and thanks for watching!
@maisamabbas6872 4 месяца назад
This is the best explaination for slowly changing dimensions that I've come across on RUclips. You definitely deserve my like and subscribe. Thanks for such an amazing video. I hope you make a lot more videos on Data Engineering topics.
@BryanCafferky 4 месяца назад
Thanks
@LAPOVETS 2 года назад
thank you!
@dataofbitcoin6685 2 года назад
Amazing video, again! The best in biz.
@BryanCafferky 2 года назад
Thanks
@usman_oz 3 года назад
Thanks for the explanation. I just wanted to check if the "Dim customer" table at around 13:20 playtime shows the correct Start & End dates. Shouldn't the "Start Date" be +1 day of the previous record's End Date ?
@batmanbuddha 3 года назад ⁺¹
Awesome Explanation, even sharing to my friend too,
@BryanCafferky 3 года назад ⁺¹
Thanks. Yeah. The more that can be helped the better.
@fatallny Год назад ⁺¹
Awesome video, now everything is clear
@BryanCafferky Год назад ⁺¹
Glad to hear it. Thanks.
@elinadiary9357 Месяц назад
Thank you sir
@rickstratton1714 3 года назад ⁺¹
Good video, clear explanations and visual examples. Have to wonder what kind of cocktail parties you attend! :)
@BryanCafferky 3 года назад
Hmmm... Come to think of it, I was not invited back. :-) Thanks
@anirudhvyas6069 8 месяцев назад
nicely presented thank you!
@BryanCafferky 8 месяцев назад
You're welcome!
@aaronvr_ 3 года назад ⁺²
Great video, very well explained! Is there an efficient way to actually identify/locate/spot such changes in large data sets? Slowly changing dimensions help when dealing with pre-identified changes, but as to how one could locate them in the first place..I'm stuck!
@BryanCafferky 3 года назад ⁺²
Hi. SQL Server has a feature called change data capture. docs.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-ver15
@BryanCafferky 3 года назад ⁺²
Or you can use a merge statement that compares the data. Ideally, the source tables have a modified date/time that can be used to get only changes since the last ETL load. This is easier than using Change Data Capture, i.e. you just keep track of the last date/time you did a load and only pull data that comes after that.
@joshi1q2w3e 2 года назад ⁺²
Thank you so much for this! I’d love to see more videos on this.
Btw, how does this work for Snowflake Schemas? Are those schemas only used in OLTP Databases?
@BryanCafferky 2 года назад ⁺¹
Snowflake schemas are to be avoided if at all possible. I think the product Snowflake is causing confusion about this due to its name,
@SwapperTheFirst 2 года назад ⁺¹
@@BryanCafferky I guess these to be avoided only if (1) compute and storage are expensive (2) compute and storage are slow (3) you don't have MPP or vectorized calculations. In current environment, compute and storage cost much less than data engineer and data architect time, so data modeling approaches changed.
What was true for Teradata or Netezza in 1999 is not true for Snowflake or Big Query in 2022.
On the topic of SCD, you can find the video of Airflow original creator, where he asserts, that all these approaches with SCD are just thing of the past, and now we should create daily or weekly full snapshots of dimensional tables. I don't fully understand his approach (for example, Kimball Type 2 for me sounds very intuitive), but I don't fully grasp how to solve the SCD reporting problem, if you have (say) daily snapshots of all your dimensional tables.
@BryanCafferky 2 года назад ⁺¹
@@SwapperTheFirst SCD concepts I explained are based on Kimball. It's more about the functionality required rather than the technology used. Airflow has nothing to do with this subject so not show how the Airflow creator's comments apply. You should choose the best approach for your needs. Thanks
@javedabdool1187 11 месяцев назад ⁺¹
great video boss. I have 2 questions though maybe a lil stupid :P if eveer i missed something:
should we put the source key in the fact table?if not then why bring the source key then?
do fact table need a fact id?
@BryanCafferky 11 месяцев назад ⁺¹
Thanks. The source dimension key like ProductKey should be stored in the dim table and the dim surrogate key (created by the BI ETL) is the Pk to teh Dim table and FK to the Fact table. You don't need a fact table key per se but SQL Tables perform much better with a primary key(pk) so an Identity column suffices. See my video on dim modeling ruclips.net/video/lWPiSZf7-uQ/видео.html
@javedabdool1187 11 месяцев назад
@@BryanCafferky cools thnx!!
@nameunknown007 2 года назад ⁺²
Writes a book about Microsoft and sells it on Amazon like a legend 👍
@canyurt1 2 года назад ⁺¹
Hey Bryan - I think Type 4 resembles Data Vault initiatives. Would you not agree?
@BryanCafferky 2 года назад
I have not studied Data Vault modeling in depth but from the wiki on the subject, I would be careful equating concepts of the two methods as Data Vault modeling does not cleanse and optimize for querying per this link en.wikipedia.org/wiki/Data_vault_modeling
@joshi1q2w3e 2 года назад ⁺²
Also why would a “NULL” be better than “9999-12-31” etc. ?
@BryanCafferky 2 года назад ⁺²
a NULL means unknown whereas '9999-12-31' is a value. If you filter where date > 2022-01-01, the later would return TRUE but usually a NULL compare would not return TRUE. Forcing vlaues is an option but technically inaccurate with repercussions. Bear in mind the 9999-12-31 would appear in groupings on reports, etc. A text value of 'NULL' is just a glitch of the extract from a database and not a true NULL.
@gokukanishka 2 года назад ⁺¹
Hey Bryan , can you please make videos on fast changing dimensions and how to handle them
@houstonfirefox 10 месяцев назад
Apparently Gender went from a static dimension to a fast changing dimension 🤣
@thepassingpawn Год назад
man I spent a fortune to get an IT degree and here i am , learning it from a youtube tutorial which is far better and easier to understand than my database lectures. thanks a lot sir for your effort.
@BryanCafferky Год назад
YW. So glad it was helpful.

Следующие

Автовоспроизведение