If you enjoyed discussing this question with us, explore more on our website! Yelp-Like System: www.interviewquery.com/questions/yelp-like-system? Clickstream Data: www.interviewquery.com/questions/clickstream-data? Click Data Schema: www.interviewquery.com/questions/click-data-schema?
The data model design should not consider actual implementation details, and too early to consider optimizations, which can be added and fine-tuned later on. First thing to do is to understand the functional requirements and draft how many "entities" make sense to the business and then work out the ER diagram to satisfy the requirements and future needs.
I feel the approach used here is very unorganised. Jitesh directly jumped on the optimisations, and the modelling part. Also Jay asked about the schema of Dimension & Fact tables, which were not provided explicitly. NoSql database was chosen but the question specifically asks about the join conditions and how would we maintain relationship between different tables. In my view the approach should be like : 1. Clarifying Questions : Use-cases / Volume of data / Analytics or Transactional / Is it particular location based ? / What metrics we are looking at / Final goal of system ? etc . 2. Functional and Non-functional requirements : Discussion on Scalability, Latency, Availability, Consistency etc. CAP theorem can be used. Trade-offs . High read throughput or write ? etc. 3. Key metrics involved (if any) : Daily no. of rides, avg daily active users, avg rides over a month,90days etc. 4. High-level design : Storage , ETL process (if any) , final data warehouse, data lake , relational database or nosql or data warehouse etc. 5 . low-level design : Actual table and schema modelling . Dimension and fact tables with their attributes. Partition or index keys. Any aggregate tables to calculate metrics etc. 6. Security and compliance issue ( if any). Optimisations (if any) 7 .Summarise whole approach and communicate effectively. I am not an expert, but this is what I learned after giving multiple Data system design interviews in companies (including FAANG). Thanks
After listening first 3 minutes, I do not agree with the approach. Why you have not started with Entity Relationship model? You have immediately jumped on the technical requirements, I think one should first complete the functional requirement with data modeling.
🤔 it’s probably always good to talk through ERD and lock that with the interviewer. However, in this specific example, the candidate (TC) quickly chose a NOSQL db as they were optimizing for latency. Generally speaking, NO SQL means there really isn’t a entity relationship. I do think they should have started with 1. Use cases 2a. Filter the interview to 1-2 use cases (e.g., finding a driver/requesting a trip, providing feedback on a trip, payments, etc…) 2b. Talk through limitations (e.g., one language so no L10N, I18N, no accessibility, worldwide or national scale, no saving favorite drivers/trips locally, therefore no need to have a db on the local surface, like a mobile phone) 3. Align on what we are optimizing for… that’s a bit different than use cases, but TC called these “use cases” - e.g., optimizing for speed/performance; optimizing for analytics; optimizing for revenue generation 4. Align on compliance, GDPR, CCPA, etc ## simply mention to interviewer, e.g., “As TC I will assume I must comply with some of the key data governance regulations like GDPR, so the data will have to be stored (in some fashion) where the customers and/or drivers reside.” 5. Design and get feedback 6. Optimize later and then get feedback
Dynamo DB has eventual consistency for writes, terrible choice for actual production app at scale. Transactional databases can scale very well. If Facebook can scale on Mysql DB, for sure any other app can as well. Trips is too crucial concept to sacrifice ACID for low cost latency gains with Nosql. Also in dimensional modeling it's not practical to avoid joins and put everything in one table. Scalability is also important in analytics, one need to be able to create new fact table on demand in different grains. Having conformed dimensions and being able to perform proper joins to low grain facts are very important.
@@tahiliani22 From having sat on both sides of the interview for over a decade, it doesn’t matter nearly as much as you think. If I wanted to fail this guy I would. I am not interested in a perfect response and never have been. There is no right answer. Facebook didn’t not settle on their current solution at the start. They started off with MySQL and built on that until they customised it a lot and automated a load of things around it. In an interview you have to show that you can think, have a good approach and are a team player who is willing to learn and has passion. Whilst the approach this guy took was more than OK, it wouldn’t guarantee you a job. You can always pass or fail depending on the interviewer alone. I hired two different people who I had to train up for over 18 months each. Really enjoyed working with them. They would not have been able to come up with a particularly good database design for this when I hired them. By the time I left the company there were really good, confident and independent. Yes I’ve worked on some really big sharded deployments and also diddly little DB systems. They can all be fun.
Interesting how it seemed like Jay initially was asking for the schema of the app backend transactional database yet because the interviewer clarified early on the whole question pivoted towards analytical data warehouse design
If you’re looking for a tailored interview prep for DE, check out our newly launched Data Engineering learning path. Our learning path will help you practice core skills that are commonly evaluated in data engineering interviews. Try it here: www.interviewquery.com/learning-paths/data-engineering
I don't understand why this person chose a NOSQL solution based on performance if the objective is to create a OLTP database? The main reason to choose nosql (in my ignorant opinion) is to avoid the rigidness of structured databases, not performance---right? Additionally, how do you jump from a nosql model to a relational db for analytics?
It’s interesting this did not immediately cover DQ checks, threshold , error handling, etc which is one of the key responsibilities for a data engineer. ** Of course, in this example the candidate was building a brand new system and therefore there was no data loads, ingestion, ** STILL, I cannot imagine an systems design interview not covering this topic. At most tech companies, there are two loops - 1. Design a system from a database perspective - this type of video ## this would include “coding” i.e., running SQL statements and 2. What are the design considerations for designing an entire DB platform? E.g., how to know when to do a data warehouse vs lake (batch vs streaming), ETL/ELT methods, DQ, put together together shrads (sp?), data recall, etc. Mixing both of these outcomes into one video / interview is hard to do and watching a video the user would likely get lost in the conversation.
Thanks for this video. During the actual interview, what if the interviewer is only interested in designing in relational DB style, can you please also share how to proceed in that case?
That sharding is considered before indexing is mind-boggling to me. An unsharded, but well-indexed table would run circles around this solution in terms of speed and flexibility. And no, in a dimensional model, the model does not depend on any one ML algorithm
yeah I agree he should have mentioned indexing. That being said, he was talking about partitioning, not sharding. Sharding is when you split the data in a table across several nodes (shards). You can do partitioning of a table within a single relational db on a single server. Sharding becomes relevant when the data is too large to fit on a single server.
@@thndesmondsaid I simply don't see how a data engineering interview can be considered "good" if indexing wasn't even mentioned. Partitioning is great and usually comes after and as a complement to indexing, not a replacement for it.
You can check out similar questions here: Yelp-Like System: www.interviewquery.com/questions/yelp-like-system? Clickstream Data: www.interviewquery.com/questions/clickstream-data? Click Data Schema: www.interviewquery.com/questions/click-data-schema?
In this interview, Latency is solved by NOSQL.. thats a generic answer, is nt it? and listed attributes of general expected.. regional seggregraton.. thats it?
@@alahad4723 Exactly. Insert new records in the analytics tables as needed in batches or even mico-batches. And remember that if the information in an analytics table is being used to facilitate transaction processes, then it's no longer an analytics table
I agree and I did not use the operational table twice. Idea is to solve latency use case using No-SQL and the last denormalized table for analytics. Next time, we will use a Drawing/charting tool so it's more clear. -- Jitesh
@@jiteshsoni3030 Don't mean to pick on you, Jitesh, but this video was wrong on so many levels that I found myself cringing. You're sharding on the initials of a city without any consideration of data distribution or even of dealing with cities with the same name in multiple states (almost every state has a "Springfield"). What if 80% of your trips happen in LA, NYC, and SF? Isn't your sharding going to be worthless? How do you tell the difference between Springfield, OH and Springfield, MO? What if you want to aggregate by State or by country? With no joins, your single table is going to get awfully wide. There were many such issues here, so I won't go through all of them - but I'll say that you're not wrong in answering like this. Despite the poorly designed solution, the interviewer is going to nod in agreement -- and you'll get the job. People like me will get washed out because the interviewer has nothing more than a rudimentary understanding of how indexing works and likely got into the organization to begin with by cramming videos on NoSQL and sharding as opposed to having actual experience in data engineering. This is the first video like this I've seen on DE, and frankly, I'm appalled.
This is pretty bad. They spent more time talking about partitioning than planning a scheme that would contain the basic trip info... for who, by who, when start, where start, when end, where end, ... You recommend nosql then gonna suggest a table storing minute by minute driver position. You guys should have given this more thought and did a practice run and pre-planned the scheme first and faked the interview being live/off the cuff. This was disjointed and too many "what the hell are you thinking moments"
The Transactional Table is NoSQL, but The Analytical is SQL. There would be an ETL process to move from the Transactional Table to the Analytical Table.
If you enjoyed discussing this question with us, explore more on our website!
Yelp-Like System: www.interviewquery.com/questions/yelp-like-system?
Clickstream Data: www.interviewquery.com/questions/clickstream-data?
Click Data Schema: www.interviewquery.com/questions/click-data-schema?
The data model design should not consider actual implementation details, and too early to consider optimizations, which can be added and fine-tuned later on. First thing to do is to understand the functional requirements and draft how many "entities" make sense to the business and then work out the ER diagram to satisfy the requirements and future needs.
exactly my thoughts at the start but overall good approach from the guy
I feel the approach used here is very unorganised. Jitesh directly jumped on the optimisations, and the modelling part. Also Jay asked about the schema of Dimension & Fact tables, which were not provided explicitly. NoSql database was chosen but the question specifically asks about the join conditions and how would we maintain relationship between different tables.
In my view the approach should be like :
1. Clarifying Questions : Use-cases / Volume of data / Analytics or Transactional / Is it particular location based ? / What metrics we are looking at / Final goal of system ? etc .
2. Functional and Non-functional requirements : Discussion on Scalability, Latency, Availability, Consistency etc. CAP theorem can be used. Trade-offs . High read throughput or write ? etc.
3. Key metrics involved (if any) : Daily no. of rides, avg daily active users, avg rides over a month,90days etc.
4. High-level design : Storage , ETL process (if any) , final data warehouse, data lake , relational database or nosql or data warehouse etc.
5 . low-level design : Actual table and schema modelling . Dimension and fact tables with their attributes. Partition or index keys. Any aggregate tables to calculate metrics etc.
6. Security and compliance issue ( if any). Optimisations (if any)
7 .Summarise whole approach and communicate effectively.
I am not an expert, but this is what I learned after giving multiple Data system design interviews in companies (including FAANG). Thanks
Hi, do you mind sharing some of the resources you used for preparation. Your outline is what I'm focusing my preparation on too.
After listening first 3 minutes, I do not agree with the approach. Why you have not started with Entity Relationship model? You have immediately jumped on the technical requirements, I think one should first complete the functional requirement with data modeling.
🤔 it’s probably always good to talk through ERD and lock that with the interviewer. However, in this specific example, the candidate (TC) quickly chose a NOSQL db as they were optimizing for latency. Generally speaking, NO SQL means there really isn’t a entity relationship. I do think they should have started with
1. Use cases
2a. Filter the interview to 1-2 use cases (e.g., finding a driver/requesting a trip, providing feedback on a trip, payments, etc…)
2b. Talk through limitations (e.g., one language so no L10N, I18N, no accessibility, worldwide or national scale, no saving favorite drivers/trips locally, therefore no need to have a db on the local surface, like a mobile phone)
3. Align on what we are optimizing for… that’s a bit different than use cases, but TC called these “use cases” - e.g., optimizing for speed/performance; optimizing for analytics; optimizing for revenue generation
4. Align on compliance, GDPR, CCPA, etc ## simply mention to interviewer, e.g., “As TC I will assume I must comply with some of the key data governance regulations like GDPR, so the data will have to be stored (in some fashion) where the customers and/or drivers reside.”
5. Design and get feedback
6. Optimize later and then get feedback
Dynamo DB has eventual consistency for writes, terrible choice for actual production app at scale. Transactional databases can scale very well. If Facebook can scale on Mysql DB, for sure any other app can as well. Trips is too crucial concept to sacrifice ACID for low cost latency gains with Nosql.
Also in dimensional modeling it's not practical to avoid joins and put everything in one table. Scalability is also important in analytics, one need to be able to create new fact table on demand in different grains. Having conformed dimensions and being able to perform proper joins to low grain facts are very important.
Do you know where can we get access to such content if we are preparing for Database Design interviews?
@@tahiliani22 From having sat on both sides of the interview for over a decade, it doesn’t matter nearly as much as you think. If I wanted to fail this guy I would. I am not interested in a perfect response and never have been. There is no right answer. Facebook didn’t not settle on their current solution at the start. They started off with MySQL and built on that until they customised it a lot and automated a load of things around it. In an interview you have to show that you can think, have a good approach and are a team player who is willing to learn and has passion. Whilst the approach this guy took was more than OK, it wouldn’t guarantee you a job. You can always pass or fail depending on the interviewer alone.
I hired two different people who I had to train up for over 18 months each. Really enjoyed working with them. They would not have been able to come up with a particularly good database design for this when I hired them. By the time I left the company there were really good, confident and independent.
Yes I’ve worked on some really big sharded deployments and also diddly little DB systems. They can all be fun.
Interesting how it seemed like Jay initially was asking for the schema of the app backend transactional database yet because the interviewer clarified early on the whole question pivoted towards analytical data warehouse design
Exactly. He has mixed OLTP into OLAP 😂
If you’re looking for a tailored interview prep for DE, check out our newly launched Data Engineering learning path. Our learning path will help you practice core skills that are commonly evaluated in data engineering interviews. Try it here: www.interviewquery.com/learning-paths/data-engineering
I don't understand why this person chose a NOSQL solution based on performance if the objective is to create a OLTP database? The main reason to choose nosql (in my ignorant opinion) is to avoid the rigidness of structured databases, not performance---right? Additionally, how do you jump from a nosql model to a relational db for analytics?
It feels like Jay needs to prepare more for these interviews
This is one of the most useful video for Data Engineer. Great Job (y)
Should of built a dimensional schema, it would of taken longer, but it would answer everything and give you an easy time when going to an end user
Please upload business intelligence interview questions and answers....
Have some real talk from experienced data engineer, that's very good content!
The data warehousing model was actually very impressive
This is one of the best Channel for Analytics part, God Bless you Jay 🙌🏻, Keep up the Good work 🎉🎈
who's Jitesh Soni. Totally love his clarity of thought.
any reason why Cloud DWHs werent considered such as Snowflake/Big Query ?
It’s interesting this did not immediately cover DQ checks, threshold , error handling, etc which is one of the key responsibilities for a data engineer.
** Of course, in this example the candidate was building a brand new system and therefore there was no data loads, ingestion, ** STILL, I cannot imagine an systems design interview not covering this topic. At most tech companies, there are two loops -
1. Design a system from a database perspective - this type of video ## this would include “coding” i.e., running SQL statements
and
2. What are the design considerations for designing an entire DB platform? E.g., how to know when to do a data warehouse vs lake (batch vs streaming), ETL/ELT methods, DQ, put together together shrads (sp?), data recall, etc.
Mixing both of these outcomes into one video / interview is hard to do and watching a video the user would likely get lost in the conversation.
This channel is fantastic. Thanks kindly for your content!
I agree
This is great content, good for my brain to think through some of this and think to myself why I agree and disagree with him
Thanks for this video. During the actual interview, what if the interviewer is only interested in designing in relational DB style, can you please also share how to proceed in that case?
Jitesh highlights this in the second half of the video!
That sharding is considered before indexing is mind-boggling to me. An unsharded, but well-indexed table would run circles around this solution in terms of speed and flexibility. And no, in a dimensional model, the model does not depend on any one ML algorithm
yeah I agree he should have mentioned indexing. That being said, he was talking about partitioning, not sharding. Sharding is when you split the data in a table across several nodes (shards). You can do partitioning of a table within a single relational db on a single server. Sharding becomes relevant when the data is too large to fit on a single server.
@@thndesmondsaid I simply don't see how a data engineering interview can be considered "good" if indexing wasn't even mentioned. Partitioning is great and usually comes after and as a complement to indexing, not a replacement for it.
@@Expateer Most data warehouses don't support indexes
@@zakaria5304 Wow .. now you've left me speechless
How are we talking about optimization at the very beginning?
That’s how we do it
where can I get more of these?
You can check out similar questions here:
Yelp-Like System: www.interviewquery.com/questions/yelp-like-system?
Clickstream Data: www.interviewquery.com/questions/clickstream-data?
Click Data Schema: www.interviewquery.com/questions/click-data-schema?
In this interview, Latency is solved by NOSQL.. thats a generic answer, is nt it? and listed attributes of general expected.. regional seggregraton.. thats it?
Awesome video
Rule #1: Never, ever dual-use an operational table for analytics. Also - why did he choose NoSQL this use case?
Good point. What would be the best approach for analytics? Create a separate table on the backend?
@@alahad4723 Exactly. Insert new records in the analytics tables as needed in batches or even mico-batches. And remember that if the information in an analytics table is being used to facilitate transaction processes, then it's no longer an analytics table
I agree and I did not use the operational table twice. Idea is to solve latency use case using No-SQL and the last denormalized table for analytics.
Next time, we will use a Drawing/charting tool so it's more clear.
-- Jitesh
@@jiteshsoni3030 Don't mean to pick on you, Jitesh, but this video was wrong on so many levels that I found myself cringing. You're sharding on the initials of a city without any consideration of data distribution or even of dealing with cities with the same name in multiple states (almost every state has a "Springfield"). What if 80% of your trips happen in LA, NYC, and SF? Isn't your sharding going to be worthless? How do you tell the difference between Springfield, OH and Springfield, MO? What if you want to aggregate by State or by country? With no joins, your single table is going to get awfully wide.
There were many such issues here, so I won't go through all of them - but I'll say that you're not wrong in answering like this. Despite the poorly designed solution, the interviewer is going to nod in agreement -- and you'll get the job. People like me will get washed out because the interviewer has nothing more than a rudimentary understanding of how indexing works and likely got into the organization to begin with by cramming videos on NoSQL and sharding as opposed to having actual experience in data engineering.
This is the first video like this I've seen on DE, and frankly, I'm appalled.
Where can I learn such skills?
Very useful and informative video. Thanks a lot!!
Great content but too many ads
This is pretty bad. They spent more time talking about partitioning than planning a scheme that would contain the basic trip info... for who, by who, when start, where start, when end, where end, ...
You recommend nosql then gonna suggest a table storing minute by minute driver position.
You guys should have given this more thought and did a practice run and pre-planned the scheme first and faked the interview being live/off the cuff. This was disjointed and too many "what the hell are you thinking moments"
Not a fan of this approach 6:43 minutes in.
I'm confused, he uses a NoSQL DB but query the data in SQL like RDBMS?
The Transactional Table is NoSQL, but The Analytical is SQL. There would be an ETL process to move from the Transactional Table to the Analytical Table.
There are sql apis available as well for no sql dbs like say azure cosmos db
m-m
riders-rides-drivers
each rider takes zero to many rides from drivers
each driver can provide zero to many rides for riders