Finding your channel feels like finding gold! There are ton of SD videos on youtube with shallow content basically exactly like you mention what a junior or mid level candidate would do. Going indepth for senior and staff is one of the highlight of your content. Please continue doing that. Also please don't worry about length of the video. Keep the gold coming :)
Watch this and then imagine if Evan puts together a System Design Learning Course. Just image that !!!! I mean we (the learners) will jump on it sooo quickly. This is just absolutely amazing. This is combining years of experience with hands on actual approach that works along with book contents presented in a very professional manner. Evan, think about it 🙂
Awesome walkthrough! As a junior engineer I learned a lot. Near the end you said you wanted to keep it short but I appreciate the nuances you point out for your decisions including design choices between mid and higher level candidates. Time didn't faze me at all, I watched every second!
You are the best at what you do, after listening to your videos I cannot now like and tolerate other system design videos, please make more content for system design
Really helpful videos - especially the breakdown for different expectations at mid/senior/staff levels, common things you see from candidates, and context into the actual systems like the shard limits for events streams. I used to work on Kinesis - happy you chose it!
I rarely leave comments but this is BY FAR the best system design video I have seen on youtube. The thing I like most is that it differentiates from common textbook solutions that we see everywhere and you explain why certain choices were made. Thanks!
I discovered your channel a few days and love all the videos so far. I love the solution enforce idempotent click tracking. There are a lot of SD videos but only your SD videos provide guidelines on the expectations based on candidates' levels. So far, I watch one video per day so I will run out of videos to watch very soon xD
Ah, nice, you re-uploaded! Thanks a lot for taking the feedback and acting quickly on this. And, sorry if it caused inconvenience for you 😄Thanks a lot for all of your hard work. 🙏
This video helped me ace the system design interview. The detailed explanations provided in-depth knowledge of various components, which was extremely helpful for answering follow-up questions during my interview.
This was such a pleasure to watch. Thank You. I would love to see a video on a metrics monitoring system. There will be some common components with ad-click aggregators.
you are honestly the best content on system design , can you do some playlist on the system design topics themselves ? i mean a video where you discuss replications in depth , concurrency , etc.
Great job with creating this content to help us prep for interviews! Just one thing to note, you can't send a 302 redirect for POST requests, it has to be a GET request.
these are so good , such deep dive and so clear ! Thanks because they cover so many different aspects . i am looking forward to many more videos . I am currently preparing for interviews and these are so helpful ..
Thank you for a great video. For a senior candidate it will be helpful, in my opinion, to narrate the data structures that underpin these solutions in addition to the supporting vendor products/technologies. In that, for fast aggregation of counters, one could demonstrate the use of fast but slightly imprecise counters using a count-min-sketch-like data structure, and for a slower but more accurate counter the use of a Map Reduce jobs. Aggregates and statistics over varying windows are almost a necessity for contemporary Monitoring, Analytics and ML Systems. And in such cases retaining data in-memory backed by persistent storage in the form of tree data structures keyed by aggregation windows are useful for range queries at varying granularities. For e.g.: root window [0-100). Immediate child node windows [0-50), [50-100) etc. It could be helpful to talk about idempotency within queue consumers. And also out-of-sequence arrival of events in the queue (handled through watermarking)
Excellent, as usual! Thanks so much 🙏 While concluding, you mentioned you want to keep the videos short. Please don't reduce the length. 1 hour is a sweet spot and it's necessary to capture all the important "juicy" tidbits and details you highlight. Please keep it coming 🙌
Hey, I love these videos. I only used your videos and designing data intensive applications and that was enough for an E4 offer at Meta, I love the advice you give and common pitfalls you provide.
These interview preps make you feel like if you know enough of system design knowledge, have good cross team examples for bq, and can solve leetcode medium-hard fast, you can get to higher level quicker than going through internal promotions.
I saw some videos and your content is so great. Thank you so much for clarifying the SQL vs NoSQL debate. I always thought that bringing that into an interview was irrelevant but was afraid to do it. 😅 Keep up the amazing work.
Great video! I learnt a lot! One question I have is regarding the hot shard handling. When the click processor detects there is a hot shard and decides to change the key from AdId to AdId:n, how would it let Flink know that it now needs to change the aggregation logic (and sharding logic) for that particular AdId? (I believe this would also have a race condition when it changes within a minute window, but any data integrity issue that arise from it should get resolved by the batch job)
Amazing content! Very much appreciate you posting these 🙌 System design padawan here. I have a question about the hybrid approach .. what makes us trust the "Kinesis -> Connector -> S3 -> Spark -> Worker -> OLAP" results more than the "Kinesis -> Flink -> OLAP" results? Is it a guarantee where the connector always successfully writes the data to S3? or does Flink make some kind of tradeoff for speed? kind of confused about that piece and figured i'd ask. thanks again!
+1 on this question. IMO, Spark is useful when you have really out-of-order events, like events arriving half an hour late or something. Then, by using Spark, you can reorder the events and get a more accurate count. On the other hand, for events that are only a few minutes late, you can configure Flink with the allowed lateness of a few minutes. That being said, the cost of maintaining 2 code bases and ensuring that they are updated at the same time (to avoid triggering discrepancy alerts) doesn't seem worth it for such edge cases. I'd be interested to hear @hello_interview's thoughts on this though.
This is great content. However, I would like to point out that the Lambda architecture includes both batch and speed layers, which process historical and real-time data in parallel. It’s not solely reliant on batch processing.
Thanks for the content; It's great; One query on 1. As we compute 10sec and write it to OLAP DBs ; The deep dive on how the user queries for 1min window, 1hr window, may be 1 day and all time window would be great info; 2. And what could be that OLAP database choice would be and what factors to consider. 3. Another aspect is; Freshness of the data; What does each component in the system contributes to the latency to make the event available for aggregates; Other aspects are great; thanks
I like your videos, I have learned a lot. A couple comments on this video: a. I think the system would benefit from a Redis in the click processor service, not the idempotency lock but a redirectUrl cache {adId: redirectUrl} to reduce reads in the Ad DB. It might be a MRU cache to avoid overloading the Redis. b. I'm not sure why you are pushing Kinesis so hard in this solution, I mean yes I learned something about Kinesis, but it would be more practical just to place a Kafka that can handle the load peaks and has event history as well so it is possible to write the reconciliation procedures from it. c. I learned about Flink, thanks. I used a redis aggregator in my own solution. Thank you so much for your work!
Amazing system design. I've been searching for something exactly like that, which is interview driven, not "let me show you all the ways you can do it" driven. May I just make a small suggestion: please use "etcetera", not "eccetera"...
This is awesome. Thanks Evan. I have a question on the usage of blob storage. Aren't those supposed to be used for storing unstructured data like images, audio & video files? Could you please elaborate on 1) how Spark task reads data from S3 2) how Spark job would sustain reading one day's data? Is checkpointing used in this context?
Few questions: - Even in spark solution, how would you know the keys of which one to aggregate? Either you have to emit keys of which has changed or scan the db by time and get keys which is again costly. - Aren't this falls under lambda architecture as we are using both realtime stream processing and batch processing to ensure data integrity?
Hi Evan, Absolutely gold content. I have one doubt here. For a really hot ad, you are adding some random no (1-n) to build the partition key before adding it to the kinesis stream. Now this particular ad can land into multiple spark consumers. How the different spark consumers will aggregate this data for this ad. Is there anything that i am missing? Are you referring to keyBy(adId) in Flink?
Thanks for the great videos - they are extremely helpful. I noticed at around 24 mins in you mention querying for the number of distinct userIDs. I don't think you're going to be able to serve queries like that using the pre-aggregation you suggest doing with Flink. I don't know a good solution to this problem other than dumping the list of userIDs for each minute window into your OLAP DB. You might be able to use HLL hashes for this, but depending on how granular the dimensions are in your DB, it may not be worth it.. I think it's at least worth mentioning this if we think users care about unique counts.
For Kinesis hot shards, we don't know if an ad is hot beforehand. So are these ad_id 0-N always active? Is it ok to use x10 the amount of streams we need under normal circumstances? For Flask, we have the same amount of flask servers as the Kinesis shards right? If the server dies, how will the new server keep track of the pointer from the old server? Are they statefull backups instead of stateless
This is a great question. In reality you can make predictions here. We know based on budget and historical performance which ads we’d need to be worried about before hand
Question: For the sharding while processing the events through Kinesis, the adId was suggested as the sharding key. This doesn't look like the best approach. At scale, millions of ads are being run on the platform and a good share of them have high enough volume. Going by the presented logic, the number of shards would explode. What do you think about this?
I honestly feel you should hire @Jordan Has No Life as a system design expert on your channel. The depth of system design in his videos his quite good and honestly it makes up for a senior engineer. As what's the case with Staff SWE Expectations well that depends honestly on the individual. I think It can only come from experience or reading books such as Database Internals and/or DDIA. No amount of videos can make up for the Staff SWE expectations in System Design.
Hi this was super helpful! My question is how would you handle the offline channels case? i.e. how would you aggregate data for one ad shown across multiple channels? I feel like the design wouldn't have to change that much because the adId could just remain the same and you can just add a "channel" metadata field for where it was shown.
This is another great video!! Please keep it coming. Can we use mirroring in Kafka and have spark read from the mirror and provide data to reconciliation service?
The best system design content. Thanks alot for helping me to prepare for my upcoming interview s. Can you please clarify the difference between product design and system design at Meta?
The system design videos on this channel are the best out there. Thanks for putting in so much time! I did have a question regarding the proposed reconciliation architecture: I get, that data accuracy is important and it acts as sort of an "Auditor" in our system. However, you mentioned that errors might stem from e.g. bad code pushes or out of order events in the stream. The proposed reconciliation architecture would really only fix issues that would occur *within Flink* though, right? At the end of the day, the spark job is 'still acting upon that same data from Kinesis, so in case of out of order events or bad code pushes, it would also be affected, no?
Yah if you messed something up in the server before kinesis you’d be screwed still. But you’d want to keep that as simple as possible. You can trust kinesis will be reliable, out of order won’t matter for reconciliation.
Thank you for sharing! I got a question about the decision to not use checkpointing in Flink. If you don’t enable that then where would you store the Kafka consumer offsets for recovering?
Could we use a bloom filter instead of redis to a) avoid storing a huge number of ad impression ids and b) eliminate the (albeit minimal) redis read latency?
When we do the idempotency, could we save the last time we see the ads in the redis for the adId and every later time, it will compare the current ads request to see if it has been a few days? So it knows that it should be a different impression id? impression id 1 and impression id 2 will be 3 days apart. Is it valid?
Please can you clarify this? You mentioned the count query on cassandra will be really slow. Would it really be slow? If the partition key is ad_id and the sort key is timestamp. I assume all the data for the same id will be on the same partition sorted by timestamp. Why would it be slow?
Can we partition cassandra on AdId and use timestamp as sort key? This will make our query faster for smaller time intervals, but we will still need to aggregate data if the time range is too large.
That's a very informative video! Two questions: 1. To solve idempotency issue, should this ad impression id be user unique? Otherwise, we should check if the combination of ad impression id and user id exists in the Redis to know if a user has clicked on this specific ad before. 2. You talked about Kappa and Lambda architecture and said that the solution uses hybrid of these two architectures. I am not quite familiar with those two architecture. But after doing some research, I feel this approach uses Lambda architecture since Lambda architecture has both batch layer and streaming layer, merge batch results and streaming results to show a unified result to user.
Yes the detach key (ad impression id) needs to be user unique. Good question on the architecture, couple related questions below in the comments where I share my answer. Sorry for making you scroll, just easier than re-typing :)
Why would it be a problem for Flink to be sharded across hot ad ids? Multiple rows per (ad id, minute) key would be emitted instead of just one, but an OLAP query could trivially SUM them
I rewatched and had some new thoughts. Wonder what are the costs of using streaming solution? I seems like the database for clicks that was used in batching solution is completely replaced by the streaming components, so benefits from having the previous database queries are lost? 34:52 streaming solution real time is by dumping to OLAP?
I didn't quite follow the point around not needing checkpoints in Flink. If a node goes down, and then comes back up, are we just accepting that the data is lost, and rely on the reconciliation worked to fix it? It doesn't seem obvious why checkpointing wouldn't make sense here.
@@hello_interview But how would we know where (which minute) to pick up from if we never checkpointed the state of the Flink node? As far as I understand, checkpointing will usually store something like the queue offset (or in this case maybe the last full minute we processed?) to know more-or-less where we've got up to with the previous node that failed. If we're not using checkpointing, I'm a little lost about how we'd recover
Same here. Checkpoint should be needed in this example. Say in the middle of a one-minute window, the Flink job is down and Kafka brokers save the offset where the Flink consumer group left. Once the job is recovered from the checkpoint, it continues to read from where it left, what's more important is that it can complete the one minute window ACCURATELY where it left. Otherwise, the recovered first one minute window could be not accurate, or some clicks to aggregate and report could be lost.
The last Redis piece you put there, it probably needs to be cleaned up some time. Also, what happens when it goes down, i would probably add another dedup point along the way, maybe in the reconciliation layer. Or add another layer just for that.
That's a great system design on Ad Click aggregator. I don't know much about Kinesis and Flink tbh Question: Why use kinesis when you can use SNS as a fan out to SQS? It feels similar to me.
Can you please expand on below questions? or Link a small video/article if possible 1. How will "Click Processor SVC" know which AdID is popular/hard? 40:24 2. How will "Flink" handle further aggregation of AdID:0, AdID:1,..., AdID:N to AdID 40:43
1. Could be based on past performance or budget. Realistically, companies have ML models to predict this. 2. A single job will read from different partitions and aggregate.
Hey Evan, I'm not preparing for an interview, but I find these videos incredibly helpful. I'm an L5 at Amazon trying to learn more about systems design. I see mock interviews as a great way to solidify my understanding of concepts that I'm reading about in books, because, after all, it's very difficult to get hands-on experience in actually building these big systems. I'd find it super valuable if I could self-study a design pattern, like event streaming, then do an mock interview on a related problem to test my understanding, like when to choose lambda vs. kappa vs. hybrid architecture. Does Hello Interview offer this?
Hey! Kudos for the focus on continuous learning and glad to hear you're finding the videos useful. There is absolutely nothing stoping someone from doing a mock that does not have an upcoming interview! Of course, the sessions are tailored toward making sure you know all the tricks to pass the interview, but you could always give your coach a heads up that you're more interested in just evaluating your design skills.
Thank you for the video! I am learning a lot from this! Btw I have a question on the Lambda vs Kappa architecture. If the lambda architecture is the combinations of the realtime and the batch process, then isn't your approach using just the lambda architecture?
Yah bit of nuance here, nuance that i don't think is all that important frankly, but, while we do have both real-time (flink) and batch processing (spark), the integration and reliance on real-time stream processing make it lean more towards a Kappa-like approach. The batch layer is secondary and primarily for reconciliation, not a core component. Hence, it’s a hybrid but not a pure Lambda architecture.
@@hello_interview that makes sense. In the real lambda architecture, we rely more on the batch? i.e it's going to run periodically faster to fix things up
Can the same design be used to design a Top K Service which finds top K videos per minute(Aggregation Window 1 min), Per Hour(Aggregation Window 1 hr with checkpointing), Per Day (Aggregation Window 1 day with checkpointing) and store them in a Redis Cache for the "Top K service" to query. And for longer time periods like 1 year or forever, a daily cron job can query the OLAP DB to get those and update that in the Redis Cache.
Calling Ad DB from Click Processor Svc might not be the best pattern (DB is shared between microservices), an area that could have been improved with calling Ad Placement Service or some other service responsible for the ad metadata and caching that url in Redis.
I can't access the answer keys and vote for questions page on the website. Don't know if that's by design or a bug. Btw really love how you start from the "bad" but intuitive solution first and build on top of that
Great content! I wonder how long the ad impression IDs stay in the redis database? I imagine they would expire after some time. If we imagine they expire after one day, I can imagine a malicious user who would request an ad, keep browser open for one day, then spam the ad with 1000s of clicks. Maybe the signed ad impression "token" should have an expiresAt to ignore those ad clicks, so that we can free up the redis db?
The out-of-scope non-functional requirements seem to be more like out-of-scope functional requirements. I feel that (spam detection, demographic profiling, conversion tracking) are essentially features rather than characteristics of the system. How should I be thinking about this?
If we don't use checkpointing then if Flink goes down, then after restarting the Flink how would it know from which offset to resume? Because Kafka itself does not manage offsets for Flink consumers directly. While Kafka maintains offsets in the __consumer_offsets topic for consumer groups, Flink does not rely on these offsets for its fault tolerance. Instead, it uses the offsets stored in its checkpoints.Can you please clarify here what do you mean not using checkpointing is ok in case of failures of Flink processor?
Flink would commit the offset to kafka upon flush. So if it goes down it knows where to start again based on the last completed write to the aggregated db
Instead of Spark, can you use AWS lambda serverless to do that job? Or directly send a task from click processor service to a kafka queue to the item to be batch added onto a aggregated read optimized db?
Do you think knowing OLAP is important for a senior/staff role? Having no experience with analytics, I'd just go for an RDS - guess it'd probably be fine?
This kind of content can make someone fall in love with software engineering.
Finding your channel feels like finding gold!
There are ton of SD videos on youtube with shallow content basically exactly like you mention what a junior or mid level candidate would do.
Going indepth for senior and staff is one of the highlight of your content. Please continue doing that.
Also please don't worry about length of the video. Keep the gold coming :)
@hello_interview waiting eagerly for next video
Watch this and then imagine if Evan puts together a System Design Learning Course. Just image that !!!! I mean we (the learners) will jump on it sooo quickly. This is just absolutely amazing. This is combining years of experience with hands on actual approach that works along with book contents presented in a very professional manner. Evan, think about it 🙂
Maybe one day! For now just happy with all the people learning about hello interview and getting tons of free value
This is by far the best system design interview ever seen on the internet. Keep doing the great work sir...
You somehow managed to make preparing for system design interviews really fun. Massively underrated channel
You’re the best!
Awesome walkthrough! As a junior engineer I learned a lot. Near the end you said you wanted to keep it short but I appreciate the nuances you point out for your decisions including design choices between mid and higher level candidates. Time didn't faze me at all, I watched every second!
Hell yah!
You are the best at what you do, after listening to your videos I cannot now like and tolerate other system design videos, please make more content for system design
Coming soon! 🫡
Really helpful videos - especially the breakdown for different expectations at mid/senior/staff levels, common things you see from candidates, and context into the actual systems like the shard limits for events streams. I used to work on Kinesis - happy you chose it!
How cool! That must’ve been fun to work on :)
I rarely leave comments but this is BY FAR the best system design video I have seen on youtube. The thing I like most is that it differentiates from common textbook solutions that we see everywhere and you explain why certain choices were made. Thanks!
Thanks for commenting!! Glad you enjoyed it
I discovered your channel a few days and love all the videos so far. I love the solution enforce idempotent click tracking. There are a lot of SD videos but only your SD videos provide guidelines on the expectations based on candidates' levels. So far, I watch one video per day so I will run out of videos to watch very soon xD
Ah, nice, you re-uploaded! Thanks a lot for taking the feedback and acting quickly on this. And, sorry if it caused inconvenience for you 😄Thanks a lot for all of your hard work. 🙏
Thanks so much for calling that out! Glad to get it fixed within the first day :)
Is e-commerce (design amazon / ebay) not as common as it once was?
One of the best channels on system design! Please keep going!
This video helped me ace the system design interview. The detailed explanations provided in-depth knowledge of various components, which was extremely helpful for answering follow-up questions during my interview.
I felt this was much better than the Alex Xu System Design Vol 2 on the same topic. Great Job1
High praise!
This is the best system design video I have seen on youtube till now. Really loved the in depth discussion. Would love to see more videos. 👍
Thanks a lot for uploading these videos. They are very informative. Keep doing the good work.
This was such a pleasure to watch. Thank You. I would love to see a video on a metrics monitoring system. There will be some common components with ad-click aggregators.
you are honestly the best content on system design , can you do some playlist on the system design topics themselves ?
i mean a video where you discuss replications in depth , concurrency , etc.
Will definitely consider this!
Great job with creating this content to help us prep for interviews!
Just one thing to note, you can't send a 302 redirect for POST requests, it has to be a GET request.
these are so good , such deep dive and so clear ! Thanks because they cover so many different aspects . i am looking forward to many more videos . I am currently preparing for interviews and these are so helpful ..
Thank you for a great video.
For a senior candidate it will be helpful, in my opinion, to narrate the data structures that underpin these solutions in addition to the supporting vendor products/technologies. In that, for fast aggregation of counters, one could demonstrate the use of fast but slightly imprecise counters using a count-min-sketch-like data structure, and for a slower but more accurate counter the use of a Map Reduce jobs. Aggregates and statistics over varying windows are almost a necessity for contemporary Monitoring, Analytics and ML Systems. And in such cases retaining data in-memory backed by persistent storage in the form of tree data structures keyed by aggregation windows are useful for range queries at varying granularities. For e.g.: root window [0-100). Immediate child node windows [0-50), [50-100) etc.
It could be helpful to talk about idempotency within queue consumers. And also out-of-sequence arrival of events in the queue (handled through watermarking)
Can have some future videos which go deeper on probabilistic data structures or other more foundational topics.
Love these! And can't recommend the Hello Interview mock interviews enough!
Wahoo thanks Ben!
Excellent, as usual! Thanks so much 🙏 While concluding, you mentioned you want to keep the videos short. Please don't reduce the length. 1 hour is a sweet spot and it's necessary to capture all the important "juicy" tidbits and details you highlight. Please keep it coming 🙌
If you HAVE to reduce something, please reduce the time between videos to 1 week 😛
No, seriously, thanks so much.
Haha trying 😝
Thanks so much for doing this! Greatly appreciated! By far the best system design videos I've seen.
Beautiful Design and Amazing explanation - just impressed with the elegance of the design and the beauty of software engineering.
😍
Very useful video and its best among others. I got this same question twice in my loop interview.very happy how I answered
Nice. Good job!
absolutely brilliant content mate. keep em coming. only channel for which I have a notification on.
Hey, I love these videos. I only used your videos and designing data intensive applications and that was enough for an E4 offer at Meta, I love the advice you give and common pitfalls you provide.
Crushed it. Congrats on your offer!
Simply the best resource !!!!
Thanks for the detailed explanation! Definitely learned some new things in this video.
I love this channel. Very good job sir, your strategy is really good a comprehensive. Straight to the main points. Bravo
You are a legend man. Make some more videos which are mentioned on your websites. Search, E-commerce , Hotel Booking system etc.
These interview preps make you feel like if you know enough of system design knowledge, have good cross team examples for bq, and can solve leetcode medium-hard fast, you can get to higher level quicker than going through internal promotions.
I saw some videos and your content is so great. Thank you so much for clarifying the SQL vs NoSQL debate. I always thought that bringing that into an interview was irrelevant but was afraid to do it. 😅
Keep up the amazing work.
Yah funny how that was evangelized in a couple books and then just stuck
Great video! I learnt a lot! One question I have is regarding the hot shard handling. When the click processor detects there is a hot shard and decides to change the key from AdId to AdId:n, how would it let Flink know that it now needs to change the aggregation logic (and sharding logic) for that particular AdId? (I believe this would also have a race condition when it changes within a minute window, but any data integrity issue that arise from it should get resolved by the batch job)
Looking for your next videos. Pls upload more design problems. It almost 1month you have not uploaded. Love your content.
Sorry, was traveling. Recording a video today! Up by EOW
Amazing content! Very much appreciate you posting these 🙌
System design padawan here. I have a question about the hybrid approach .. what makes us trust the "Kinesis -> Connector -> S3 -> Spark -> Worker -> OLAP" results more than the "Kinesis -> Flink -> OLAP" results? Is it a guarantee where the connector always successfully writes the data to S3? or does Flink make some kind of tradeoff for speed? kind of confused about that piece and figured i'd ask. thanks again!
I am also curious about this
+1 on this question.
IMO, Spark is useful when you have really out-of-order events, like events arriving half an hour late or something. Then, by using Spark, you can reorder the events and get a more accurate count. On the other hand, for events that are only a few minutes late, you can configure Flink with the allowed lateness of a few minutes.
That being said, the cost of maintaining 2 code bases and ensuring that they are updated at the same time (to avoid triggering discrepancy alerts) doesn't seem worth it for such edge cases.
I'd be interested to hear @hello_interview's thoughts on this though.
This is great content. However, I would like to point out that the Lambda architecture includes both batch and speed layers, which process historical and real-time data in parallel. It’s not solely reliant on batch processing.
The best The best.. Loved it. Thanks for doing this. ❤
My pleasure 😊
Incredible video with excellent drawing and explanation.
Thanks for the content; It's great; One query on
1. As we compute 10sec and write it to OLAP DBs ; The deep dive on how the user queries for 1min window, 1hr window, may be 1 day and all time window would be great info;
2. And what could be that OLAP database choice would be and what factors to consider.
3. Another aspect is; Freshness of the data; What does each component in the system contributes to the latency to make the event available for aggregates;
Other aspects are great; thanks
Why this is not lambda architecture? It has both realtime and batch processing, so what is the difference?
Thank you, btw
I like your videos, I have learned a lot.
A couple comments on this video:
a. I think the system would benefit from a Redis in the click processor service, not the idempotency lock but a redirectUrl cache {adId: redirectUrl} to reduce reads in the Ad DB. It might be a MRU cache to avoid overloading the Redis.
b. I'm not sure why you are pushing Kinesis so hard in this solution, I mean yes I learned something about Kinesis, but it would be more practical just to place a Kafka that can handle the load peaks and has event history as well so it is possible to write the reconciliation procedures from it.
c. I learned about Flink, thanks. I used a redis aggregator in my own solution.
Thank you so much for your work!
Amazing system design. I've been searching for something exactly like that, which is interview driven, not "let me show you all the ways you can do it" driven.
May I just make a small suggestion: please use "etcetera", not "eccetera"...
Thank you for the video! Isn't this exactly what Lambda architecture is and not a "hybrid between lambda and kappa"?
The final design has both real time data processing and batched processing. Which is it not lambda architecture?
Love the content! Thank you for making these!
Honestly, it is the best SD showcase I’ve ever seen. You are the best. I watched all your videos and whiteboard them myself then. Thank you!
So glad you like them and very smart to try them yourself and not just blindly consume!
These are Excellent! Please keep going.
This is awesome. Thanks Evan. I have a question on the usage of blob storage. Aren't those supposed to be used for storing unstructured data like images, audio & video files? Could you please elaborate on 1) how Spark task reads data from S3 2) how Spark job would sustain reading one day's data? Is checkpointing used in this context?
Few questions:
- Even in spark solution, how would you know the keys of which one to aggregate? Either you have to emit keys of which has changed or scan the db by time and get keys which is again costly.
- Aren't this falls under lambda architecture as we are using both realtime stream processing and batch processing to ensure data integrity?
Great content, keep going!
Hi Evan, Absolutely gold content. I have one doubt here.
For a really hot ad, you are adding some random no (1-n) to build the partition key before adding it to the kinesis stream. Now this particular ad can land into multiple spark consumers. How the different spark consumers will aggregate this data for this ad. Is there anything that i am missing?
Are you referring to keyBy(adId) in Flink?
21:48 I like how DB can be used for simplest case consistently in these approaches
Excellent walk-though!
Thanks for the great videos - they are extremely helpful.
I noticed at around 24 mins in you mention querying for the number of distinct userIDs. I don't think you're going to be able to serve queries like that using the pre-aggregation you suggest doing with Flink. I don't know a good solution to this problem other than dumping the list of userIDs for each minute window into your OLAP DB. You might be able to use HLL hashes for this, but depending on how granular the dimensions are in your DB, it may not be worth it..
I think it's at least worth mentioning this if we think users care about unique counts.
For Kinesis hot shards, we don't know if an ad is hot beforehand. So are these ad_id 0-N always active? Is it ok to use x10 the amount of streams we need under normal circumstances?
For Flask, we have the same amount of flask servers as the Kinesis shards right? If the server dies, how will the new server keep track of the pointer from the old server? Are they statefull backups instead of stateless
This is a great question. In reality you can make predictions here. We know based on budget and historical performance which ads we’d need to be worried about before hand
Best content, amazing job
Question: For the sharding while processing the events through Kinesis, the adId was suggested as the sharding key. This doesn't look like the best approach. At scale, millions of ads are being run on the platform and a good share of them have high enough volume. Going by the presented logic, the number of shards would explode. What do you think about this?
I honestly feel you should hire @Jordan Has No Life as a system design expert on your channel. The depth of system design in his videos his quite good and honestly it makes up for a senior engineer. As what's the case with Staff SWE Expectations well that depends honestly on the individual. I think It can only come from experience or reading books such as Database Internals and/or DDIA. No amount of videos can make up for the Staff SWE expectations in System Design.
We love Jordan ♥️
@@hello_interview Me Too!! That guy's an OG in System Design.
I think bloom filter would be a good choice to check on duplicate impression id. I think, it is also supported by redis.
Hi this was super helpful! My question is how would you handle the offline channels case? i.e. how would you aggregate data for one ad shown across multiple channels? I feel like the design wouldn't have to change that much because the adId could just remain the same and you can just add a "channel" metadata field for where it was shown.
This is another great video!! Please keep it coming. Can we use mirroring in Kafka and have spark read from the mirror and provide data to reconciliation service?
you know i have less familiarity there. potentially, but not totally sure.
The best system design content.
Thanks alot for helping me to prepare for my upcoming interview s.
Can you please clarify the difference between product design and system design at Meta?
www.hellointerview.com/blog/meta-system-vs-product-design :)
The system design videos on this channel are the best out there. Thanks for putting in so much time!
I did have a question regarding the proposed reconciliation architecture: I get, that data accuracy is important and it acts as sort of an "Auditor" in our system. However, you mentioned that errors might stem from e.g. bad code pushes or out of order events in the stream.
The proposed reconciliation architecture would really only fix issues that would occur *within Flink* though, right? At the end of the day, the spark job is 'still acting upon that same data from Kinesis, so in case of out of order events or bad code pushes, it would also be affected, no?
Yah if you messed something up in the server before kinesis you’d be screwed still. But you’d want to keep that as simple as possible. You can trust kinesis will be reliable, out of order won’t matter for reconciliation.
@@hello_interview got it. Thanks for the quick response. :)
If we use Cassandra with adId as primary key and timeStamp as the sort key, will the read be fast enough?
Not to aggregate over large periods
Looking forward to more great videos from you! :)
Can I use this aggregator flow for designing the top-K RUclips videos system? What are the major differences except ranking?
Will Spark Streaming also not cut it? I thought Spark Streaming is the way.
when will you post the next interview video? waiting for it about 1 month!!! really appreciate the effort.
Tomorrow!!
Amazing video.
How does it become fault tolerant if we are not check pointing? Using reconciliation worker?
Thank you for sharing!
I got a question about the decision to not use checkpointing in Flink. If you don’t enable that then where would you store the Kafka consumer offsets for recovering?
Could we use a bloom filter instead of redis to a) avoid storing a huge number of ad impression ids and b) eliminate the (albeit minimal) redis read latency?
You’d probably still have the bloom filter in redis tbh. So latency not a concern. But if you had some memory constraint then this is reasonable
Thank you for this!
When we do the idempotency, could we save the last time we see the ads in the redis for the adId and every later time, it will compare the current ads request to see if it has been a few days? So it knows that it should be a different impression id? impression id 1 and impression id 2 will be 3 days apart. Is it valid?
Please can you clarify this? You mentioned the count query on cassandra will be really slow. Would it really be slow? If the partition key is ad_id and the sort key is timestamp. I assume all the data for the same id will be on the same partition sorted by timestamp. Why would it be slow?
Can we partition cassandra on AdId and use timestamp as sort key? This will make our query faster for smaller time intervals, but we will still need to aggregate data if the time range is too large.
you'd want to do this for the map-reduce approach, yes
Very helpful!
good content. thanks.
That's a very informative video! Two questions: 1. To solve idempotency issue, should this ad impression id be user unique? Otherwise, we should check if the combination of ad impression id and user id exists in the Redis to know if a user has clicked on this specific ad before. 2. You talked about Kappa and Lambda architecture and said that the solution uses hybrid of these two architectures. I am not quite familiar with those two architecture. But after doing some research, I feel this approach uses Lambda architecture since Lambda architecture has both batch layer and streaming layer, merge batch results and streaming results to show a unified result to user.
Yes the detach key (ad impression id) needs to be user unique. Good question on the architecture, couple related questions below in the comments where I share my answer. Sorry for making you scroll, just easier than re-typing :)
I believe lambda uses probabilistic data structure
Why would it be a problem for Flink to be sharded across hot ad ids? Multiple rows per (ad id, minute) key would be emitted instead of just one, but an OLAP query could trivially SUM them
True!
I rewatched and had some new thoughts. Wonder what are the costs of using streaming solution? I seems like the database for clicks that was used in batching solution is completely replaced by the streaming components, so benefits from having the previous database queries are lost?
34:52 streaming solution real time is by dumping to OLAP?
The final solution _is_ literally lambda architecture.
I didn't quite follow the point around not needing checkpoints in Flink. If a node goes down, and then comes back up, are we just accepting that the data is lost, and rely on the reconciliation worked to fix it? It doesn't seem obvious why checkpointing wouldn't make sense here.
We have retention on our stream, so we’ll just pick back up reading the data from the start of the minute again (or as far back as we missed)
@@hello_interview But how would we know where (which minute) to pick up from if we never checkpointed the state of the Flink node? As far as I understand, checkpointing will usually store something like the queue offset (or in this case maybe the last full minute we processed?) to know more-or-less where we've got up to with the previous node that failed.
If we're not using checkpointing, I'm a little lost about how we'd recover
My interpretation is the stream has a cursor that tells system where it's out of date and recover starting point
Same here. Checkpoint should be needed in this example. Say in the middle of a one-minute window, the Flink job is down and Kafka brokers save the offset where the Flink consumer group left. Once the job is recovered from the checkpoint, it continues to read from where it left, what's more important is that it can complete the one minute window ACCURATELY where it left. Otherwise, the recovered first one minute window could be not accurate, or some clicks to aggregate and report could be lost.
The last Redis piece you put there, it probably needs to be cleaned up some time. Also, what happens when it goes down, i would probably add another dedup point along the way, maybe in the reconciliation layer. Or add another layer just for that.
That's a great system design on Ad Click aggregator. I don't know much about Kinesis and Flink tbh
Question: Why use kinesis when you can use SNS as a fan out to SQS? It feels similar to me.
SQS could maybe replace kinesis (some throughput considerations there), but not sure I understand the question.
Can you please expand on below questions? or Link a small video/article if possible
1. How will "Click Processor SVC" know which AdID is popular/hard? 40:24
2. How will "Flink" handle further aggregation of AdID:0, AdID:1,..., AdID:N to AdID 40:43
1. Could be based on past performance or budget. Realistically, companies have ML models to predict this.
2. A single job will read from different partitions and aggregate.
why not use kafka for storing/doing streaming aggregations into the OLAP?
Ah, spoke too soon lol :)
Hey Evan, I'm not preparing for an interview, but I find these videos incredibly helpful. I'm an L5 at Amazon trying to learn more about systems design. I see mock interviews as a great way to solidify my understanding of concepts that I'm reading about in books, because, after all, it's very difficult to get hands-on experience in actually building these big systems. I'd find it super valuable if I could self-study a design pattern, like event streaming, then do an mock interview on a related problem to test my understanding, like when to choose lambda vs. kappa vs. hybrid architecture. Does Hello Interview offer this?
Hey! Kudos for the focus on continuous learning and glad to hear you're finding the videos useful. There is absolutely nothing stoping someone from doing a mock that does not have an upcoming interview! Of course, the sessions are tailored toward making sure you know all the tricks to pass the interview, but you could always give your coach a heads up that you're more interested in just evaluating your design skills.
Thank you for the video! I am learning a lot from this!
Btw I have a question on the Lambda vs Kappa architecture. If the lambda architecture is the combinations of the realtime and the batch process, then isn't your approach using just the lambda architecture?
Yah bit of nuance here, nuance that i don't think is all that important frankly, but, while we do have both real-time (flink) and batch processing (spark), the integration and reliance on real-time stream processing make it lean more towards a Kappa-like approach. The batch layer is secondary and primarily for reconciliation, not a core component. Hence, it’s a hybrid but not a pure Lambda architecture.
@@hello_interview that makes sense. In the real lambda architecture, we rely more on the batch? i.e it's going to run periodically faster to fix things up
Can the same design be used to design a Top K Service which finds top K videos per minute(Aggregation Window 1 min), Per Hour(Aggregation Window 1 hr with checkpointing), Per Day (Aggregation Window 1 day with checkpointing) and store them in a Redis Cache for the "Top K service" to query. And for longer time periods like 1 year or forever, a daily cron job can query the OLAP DB to get those and update that in the Redis Cache.
Actually checkout our website! We have a written breakdown for topK there
Calling Ad DB from Click Processor Svc might not be the best pattern (DB is shared between microservices), an area that could have been improved with calling Ad Placement Service or some other service responsible for the ad metadata and caching that url in Redis.
I can't access the answer keys and vote for questions page on the website. Don't know if that's by design or a bug.
Btw really love how you start from the "bad" but intuitive solution first and build on top of that
Should be fixed!
Great content! I wonder how long the ad impression IDs stay in the redis database? I imagine they would expire after some time. If we imagine they expire after one day, I can imagine a malicious user who would request an ad, keep browser open for one day, then spam the ad with 1000s of clicks. Maybe the signed ad impression "token" should have an expiresAt to ignore those ad clicks, so that we can free up the redis db?
Exactly. The signature can have a timestamp to expire the impression after a few hours or day
The out-of-scope non-functional requirements seem to be more like out-of-scope functional requirements. I feel that (spam detection, demographic profiling, conversion tracking) are essentially features rather than characteristics of the system. How should I be thinking about this?
Honestly, fair point
If we don't use checkpointing then if Flink goes down, then after restarting the Flink how would it know from which offset to resume? Because Kafka itself does not manage offsets for Flink consumers directly. While Kafka maintains offsets in the __consumer_offsets topic for consumer groups, Flink does not rely on these offsets for its fault tolerance. Instead, it uses the offsets stored in its checkpoints.Can you please clarify here what do you mean not using checkpointing is ok in case of failures of Flink processor?
Flink would commit the offset to kafka upon flush. So if it goes down it knows where to start again based on the last completed write to the aggregated db
@@hello_interview Then why would we ever need to use checkpointing and not use Kafka offsets always?
Isn't lambda architecture an extension to kappa?
Instead of Spark, can you use AWS lambda serverless to do that job? Or directly send a task from click processor service to a kafka queue to the item to be batch added onto a aggregated read optimized db?
I should have watched the entire video before asking this question
Do you think knowing OLAP is important for a senior/staff role? Having no experience with analytics, I'd just go for an RDS - guess it'd probably be fine?
Yah you’d be fine likely :)
@@hello_interview Thanks, also thank you for all the resources you've created. Amazing.