Great video. So many of your peers online talk about data modeling from the point of view of working at Uber, Airbnb, Netflix, and Facebook-worlds that are event-oriented. As a data leader, I find this frustrating. 99% of the jobs freshers will compete for are in mid-market or large enterprises across retail, healthcare, services, finance, and more. These industries are rich in transactions from complex ERP, F&A, EHR/EMR, supply chain, and HR/employee management systems. The schemas, business rules, quality issues, temporal complexities, and more are highly intricate-not to mention the desire to fully integrate them over the value chain of the firm in a data warehouse (Inmon?). I’d love to see you address the skills needed to tackle these systems. Thanks again.
I’ve been thinking about talking on the same subject (dimensional modeling), excited to take a look at this and seeing this knowledge propagated correctly. Nice work.
If you guys want to learn more about data engineering, then sign up for my newsletter here seattledataguy.substack.com/ or join the discord here discord.gg/2yRJq7Eg3k
I’ve actually used “bridging” or “linking” tables in databases I’ve designed. I just have logic included that makes sure to grab the latest item when bridging/linking like an IS_Current or IS_Latest column.
Nice video that touch on the basics - but maybe some reflection on today’s capabilities e.g. compression/ column store, that were not around when he wrote the book, that could change some of the design patterns. But overall his principles are still valid today.
An example of where a bridge table is necessary (and I havend found a good way around it): UserID and DeviceID stitching in clickstream data. You want to follow user behavior across devices, but you will likely only discover the link between two devices much later. I.e. both devices will generate data in the big fact table, but only much later (often days) will you realize that this was actually the same user. You don't want to run "update" queries against a clickstream fact table for cost reasons, but you need some way to say that this data is all from the same user.
can we have two dimensions instead of a bridge table - users and devices. Both of these are connected to the fact table in a 1:N relationship with user_id and device_id, as the FKs. So, from the users' slicer when you select a specific user - corresponding devices will appear in the table view and when you select a device from devices' slicer corresponding users will appear in the table view. Is my understand correct or is the requirement different?
@@poojasikdar3583 whay happens if a visitor browses the site without being logged in (no userID available) and then logs in on the same device 1 week later? You have stored events with userID 'null' in your fact table (you didn't know the correct value at insertion time). Any slicing and joining on userID will now filter out these events, but they are crutial to track aquisitions. You can fix this by doing a self-join of the fact table to work out the missing userID at query time ... but the performance is dysmal and you are essentially building a dynamic bridge table.
Good info. One related question that I cannot seem to get a straight answer to online - how do interviews work in terms of laying out your data model? I understand the interviewee talks through their thought process and that's the main point, but is it on pen and paper? A whiteboard? A lucid diagram? How do you actually display your tables and do they give you a prompt right then and there or ahead of time?
it depends - they would give you a problem statement and would ask you to design a data model around it- that can be on paper and you just have to walk them through your thought process or they would ask you to share your system and structure it on a notepad or that could be over a shared file.
what's the difference of a bridge table on a star scheme? Is bridging like 2 databases that have no relevance with one another like for example a PK(customer ID) on one table and just a reference transaction number PK on the other? The question is what makes bridging different from a star scheme/snowflake scheme?
Thank you for this really great content! Which is the book you are referring to in your video? I like the structure much and am considering buying it. Thanks in advance!
Just a recommendation from my POV, I think it would be better if you focus on less examples, but spend more time on the ones you do show. For example, the section on Bridge tables was unclear, you jumped around from one example to the next within seconds without really explaining anything other than bridge tables are tricky. Instead, spend that time on one example to drive the point home more clearly.
Thank you, I think this is fair! That's on me. I'll have to go back and maybe make a bridge table video. I also appreciate that you didn't say "This video sucks" but instead provided a clear point of improvement. Thank you!
Can you make a video about Databricks for Students who are interested to learn on how much it would cost for learning it using pay as you go and what all the techniques we can use to reduce the cost. There are many videos in youtube unless we do hands on nothing goes into head
can you make videos on these 2 topics 1. Day in the life of a Data Engineer (where you'll actually show your routine by also going in your company and stuff like that) 2. Your programming setup
Great video. So many of your peers online talk about data modeling from the point of view of working at Uber, Airbnb, Netflix, and Facebook-worlds that are event-oriented. As a data leader, I find this frustrating. 99% of the jobs freshers will compete for are in mid-market or large enterprises across retail, healthcare, services, finance, and more. These industries are rich in transactions from complex ERP, F&A, EHR/EMR, supply chain, and HR/employee management systems. The schemas, business rules, quality issues, temporal complexities, and more are highly intricate-not to mention the desire to fully integrate them over the value chain of the firm in a data warehouse (Inmon?). I’d love to see you address the skills needed to tackle these systems. Thanks again.
I’ve been thinking about talking on the same subject (dimensional modeling), excited to take a look at this and seeing this knowledge propagated correctly. Nice work.
Thank you, I am doing my best to stay true to the sources and add in some real life experience
If you guys want to learn more about data engineering, then sign up for my newsletter here seattledataguy.substack.com/ or join the discord here discord.gg/2yRJq7Eg3k
I’ve actually used “bridging” or “linking” tables in databases I’ve designed. I just have logic included that makes sure to grab the latest item when bridging/linking like an IS_Current or IS_Latest column.
Nice video that touch on the basics - but maybe some reflection on today’s capabilities e.g. compression/ column store, that were not around when he wrote the book, that could change some of the design patterns. But overall his principles are still valid today.
Snapshot tables are also used when you need to track the balance of a lot of accounts, like in an accounting system.
Thanks for the video!
I would love to learn about One big table model and its use cases and its pros, cons.
Thanks
size, redundancy, performance, compute time.. - few of the factors to be mindful of when dealing with big flat tables
An example of where a bridge table is necessary (and I havend found a good way around it): UserID and DeviceID stitching in clickstream data.
You want to follow user behavior across devices, but you will likely only discover the link between two devices much later. I.e. both devices will generate data in the big fact table, but only much later (often days) will you realize that this was actually the same user. You don't want to run "update" queries against a clickstream fact table for cost reasons, but you need some way to say that this data is all from the same user.
can we have two dimensions instead of a bridge table - users and devices. Both of these are connected to the fact table in a 1:N relationship with user_id and device_id, as the FKs. So, from the users' slicer when you select a specific user - corresponding devices will appear in the table view and when you select a device from devices' slicer corresponding users will appear in the table view. Is my understand correct or is the requirement different?
@@poojasikdar3583 whay happens if a visitor browses the site without being logged in (no userID available) and then logs in on the same device 1 week later?
You have stored events with userID 'null' in your fact table (you didn't know the correct value at insertion time). Any slicing and joining on userID will now filter out these events, but they are crutial to track aquisitions.
You can fix this by doing a self-join of the fact table to work out the missing userID at query time ... but the performance is dysmal and you are essentially building a dynamic bridge table.
Yay, dimensional modeling. Kimball and Adamson and Reeves and Ross.
the increase in quality and production is awesome
Bridge table looks like the Excel version of link workbooks... always ends in tears :)
Good info. One related question that I cannot seem to get a straight answer to online - how do interviews work in terms of laying out your data model? I understand the interviewee talks through their thought process and that's the main point, but is it on pen and paper? A whiteboard? A lucid diagram? How do you actually display your tables and do they give you a prompt right then and there or ahead of time?
it depends - they would give you a problem statement and would ask you to design a data model around it- that can be on paper and you just have to walk them through your thought process or they would ask you to share your system and structure it on a notepad or that could be over a shared file.
@@poojasikdar3583i have hard time in understanding data model vs database model, can you explain it?
what's the difference of a bridge table on a star scheme? Is bridging like 2 databases that have no relevance with one another like for example a PK(customer ID) on one table and just a reference transaction number PK on the other?
The question is what makes bridging different from a star scheme/snowflake scheme?
Need that Slowly Changing Dimension video please!
Thanks for the reminder!
Hey can you post please post those other data modeling books
Is there a service or scripting language that allows me to write relationships between tables/databases in a modern material design style?
Great video. What other books you may suggest to read on this topic?
Thank you for this really great content! Which is the book you are referring to in your video? I like the structure much and am considering buying it. Thanks in advance!
Just a recommendation from my POV, I think it would be better if you focus on less examples, but spend more time on the ones you do show. For example, the section on Bridge tables was unclear, you jumped around from one example to the next within seconds without really explaining anything other than bridge tables are tricky. Instead, spend that time on one example to drive the point home more clearly.
Thank you, I think this is fair! That's on me. I'll have to go back and maybe make a bridge table video. I also appreciate that you didn't say "This video sucks" but instead provided a clear point of improvement. Thank you!
Great content! Please what is the name of the book that was shown in the video?
Yay even I need the name there are multiple books by Kimball
Can you make a video about Databricks for Students who are interested to learn on how much it would cost for learning it using pay as you go and what all the techniques we can use to reduce the cost. There are many videos in youtube unless we do hands on nothing goes into head
Do you recommend any books on data modelling for relational databases?
(Besides the text mentioned in the video)
Thanks very much for this great video!
My pleasure!
great video thank you so much
can you make videos on these 2 topics
1. Day in the life of a Data Engineer (where you'll actually show your routine by also going in your company and stuff like that)
2. Your programming setup
Sql is the most important 👀
it is very important!
Learn a lot from your content! But putting yellow bold text in the beginning really strains the eyes...