System Design Interview: Design an Ad Click Aggregator w/ a Ex-Meta Staff Engineer
HTML-код
- Опубликовано: 16 июн 2024
- 00:00 - Intro
01:55 - The Approach
4:16 - Requirements
10:49 - System Interface & Data Flow
14:12 - High Level Design
29:43 - Deep Dives
52:10 - Conclusion
A step-by-step breakdown of the popular FAANG+ system design interview question, Design an Ad Click Aggregator, which is asked at top companies like Meta, Google, Amazon, Microsoft, and more.
Evan, a former Meta Staff Engineer and current co-founder of Hello Interview, walks through the problem from the perspective of an interviewer who has asked it well over 50 times.
Resources:
1. Detailed write up of the problem: www.hellointerview.com/learn/...
2. System Design In a Hurry: www.hellointerview.com/learn/...
3. Excalidraw used in the video: link.excalidraw.com/l/56zGeHi...
4. Vote for the question you want us to do next: www.hellointerview.com/learn/...
Checkout the previous video breakdowns:
Ticketmaster: • System Design Intervie...
Uber: • System Design Intervie...
Dropbox: • System Design Intervie...
Connect with me on LinkedIn: / evan-king-40072280
Preparing for your upcoming interviews and want to practice with top FAANG interviewers like Evan? Book a mock interview at www.hellointerview.com.
Good luck with your upcoming interviews!
Finding your channel feels like finding gold!
There are ton of SD videos on youtube with shallow content basically exactly like you mention what a junior or mid level candidate would do.
Going indepth for senior and staff is one of the highlight of your content. Please continue doing that.
Also please don't worry about length of the video. Keep the gold coming :)
@hello_interview waiting eagerly for next video
Watch this and then imagine if Evan puts together a System Design Learning Course. Just image that !!!! I mean we (the learners) will jump on it sooo quickly. This is just absolutely amazing. This is combining years of experience with hands on actual approach that works along with book contents presented in a very professional manner. Evan, think about it 🙂
Maybe one day! For now just happy with all the people learning about hello interview and getting tons of free value
One of the best channels on system design! Please keep going!
Thanks a lot for uploading these videos. They are very informative. Keep doing the good work.
This is by far the best system design interview ever seen on the internet. Keep doing the great work sir...
absolutely brilliant content mate. keep em coming. only channel for which I have a notification on.
This was such a pleasure to watch. Thank You. I would love to see a video on a metrics monitoring system. There will be some common components with ad-click aggregators.
Honestly, it is the best SD showcase I’ve ever seen. You are the best. I watched all your videos and whiteboard them myself then. Thank you!
So glad you like them and very smart to try them yourself and not just blindly consume!
Thanks for the detailed explanation! Definitely learned some new things in this video.
Thanks so much for doing this! Greatly appreciated! By far the best system design videos I've seen.
Really helpful videos - especially the breakdown for different expectations at mid/senior/staff levels, common things you see from candidates, and context into the actual systems like the shard limits for events streams. I used to work on Kinesis - happy you chose it!
How cool! That must’ve been fun to work on :)
Love these! And can't recommend the Hello Interview mock interviews enough!
Wahoo thanks Ben!
This video helped me ace the system design interview. The detailed explanations provided in-depth knowledge of various components, which was extremely helpful for answering follow-up questions during my interview.
I saw some videos and your content is so great. Thank you so much for clarifying the SQL vs NoSQL debate. I always thought that bringing that into an interview was irrelevant but was afraid to do it. 😅
Keep up the amazing work.
Yah funny how that was evangelized in a couple books and then just stuck
Incredible video with excellent drawing and explanation.
Hey, I love these videos. I only used your videos and designing data intensive applications and that was enough for an E4 offer at Meta, I love the advice you give and common pitfalls you provide.
Crushed it. Congrats on your offer!
I love this channel. Very good job sir, your strategy is really good a comprehensive. Straight to the main points. Bravo
These are Excellent! Please keep going.
Ah, nice, you re-uploaded! Thanks a lot for taking the feedback and acting quickly on this. And, sorry if it caused inconvenience for you 😄Thanks a lot for all of your hard work. 🙏
Thanks so much for calling that out! Glad to get it fixed within the first day :)
Is e-commerce (design amazon / ebay) not as common as it once was?
Excellent walk-though!
Looking for your next videos. Pls upload more design problems. It almost 1month you have not uploaded. Love your content.
Sorry, was traveling. Recording a video today! Up by EOW
Looking forward to more great videos from you! :)
Thank you for a great video.
For a senior candidate it will be helpful, in my opinion, to narrate the data structures that underpin these solutions in addition to the supporting vendor products/technologies. In that, for fast aggregation of counters, one could demonstrate the use of fast but slightly imprecise counters using a count-min-sketch-like data structure, and for a slower but more accurate counter the use of a Map Reduce jobs. Aggregates and statistics over varying windows are almost a necessity for contemporary Monitoring, Analytics and ML Systems. And in such cases retaining data in-memory backed by persistent storage in the form of tree data structures keyed by aggregation windows are useful for range queries at varying granularities. For e.g.: root window [0-100). Immediate child node windows [0-50), [50-100) etc.
It could be helpful to talk about idempotency within queue consumers. And also out-of-sequence arrival of events in the queue (handled through watermarking)
Can have some future videos which go deeper on probabilistic data structures or other more foundational topics.
you are honestly the best content on system design , can you do some playlist on the system design topics themselves ?
i mean a video where you discuss replications in depth , concurrency , etc.
Will definitely consider this!
These interview preps make you feel like if you know enough of system design knowledge, have good cross team examples for bq, and can solve leetcode medium-hard fast, you can get to higher level quicker than going through internal promotions.
Great video!
21:48 I like how DB can be used for simplest case consistently in these approaches
Amazing content! Very much appreciate you posting these 🙌
System design padawan here. I have a question about the hybrid approach .. what makes us trust the "Kinesis -> Connector -> S3 -> Spark -> Worker -> OLAP" results more than the "Kinesis -> Flink -> OLAP" results? Is it a guarantee where the connector always successfully writes the data to S3? or does Flink make some kind of tradeoff for speed? kind of confused about that piece and figured i'd ask. thanks again!
I think bloom filter would be a good choice to check on duplicate impression id. I think, it is also supported by redis.
when will you post the next interview video? waiting for it about 1 month!!! really appreciate the effort.
Tomorrow!!
this is amazing!
The system design videos on this channel are the best out there. Thanks for putting in so much time!
I did have a question regarding the proposed reconciliation architecture: I get, that data accuracy is important and it acts as sort of an "Auditor" in our system. However, you mentioned that errors might stem from e.g. bad code pushes or out of order events in the stream.
The proposed reconciliation architecture would really only fix issues that would occur *within Flink* though, right? At the end of the day, the spark job is 'still acting upon that same data from Kinesis, so in case of out of order events or bad code pushes, it would also be affected, no?
Yah if you messed something up in the server before kinesis you’d be screwed still. But you’d want to keep that as simple as possible. You can trust kinesis will be reliable, out of order won’t matter for reconciliation.
@@hello_interview got it. Thanks for the quick response. :)
The best system design content.
Thanks alot for helping me to prepare for my upcoming interview s.
Can you please clarify the difference between product design and system design at Meta?
www.hellointerview.com/blog/meta-system-vs-product-design :)
Thanks pls update more
On it! One every 3 weeks :)
Thanks for the great videos - they are extremely helpful.
I noticed at around 24 mins in you mention querying for the number of distinct userIDs. I don't think you're going to be able to serve queries like that using the pre-aggregation you suggest doing with Flink. I don't know a good solution to this problem other than dumping the list of userIDs for each minute window into your OLAP DB. You might be able to use HLL hashes for this, but depending on how granular the dimensions are in your DB, it may not be worth it..
I think it's at least worth mentioning this if we think users care about unique counts.
Thank you and great video. Can we assume the click processor service will scale? Can we make this serverless or shard it and place it behind another LLB?
Yah I breeze over this at one point in the video. Easy for it to horizontally scale.
I rewatched and had some new thoughts. Wonder what are the costs of using streaming solution? I seems like the database for clicks that was used in batching solution is completely replaced by the streaming components, so benefits from having the previous database queries are lost?
34:52 streaming solution real time is by dumping to OLAP?
Hey Evan, I'm not preparing for an interview, but I find these videos incredibly helpful. I'm an L5 at Amazon trying to learn more about systems design. I see mock interviews as a great way to solidify my understanding of concepts that I'm reading about in books, because, after all, it's very difficult to get hands-on experience in actually building these big systems. I'd find it super valuable if I could self-study a design pattern, like event streaming, then do an mock interview on a related problem to test my understanding, like when to choose lambda vs. kappa vs. hybrid architecture. Does Hello Interview offer this?
Hey! Kudos for the focus on continuous learning and glad to hear you're finding the videos useful. There is absolutely nothing stoping someone from doing a mock that does not have an upcoming interview! Of course, the sessions are tailored toward making sure you know all the tricks to pass the interview, but you could always give your coach a heads up that you're more interested in just evaluating your design skills.
Fantastic work. Is it safe to assume this is a regional setup, and we need cross region synchronous replication mechanisms on the choice of OLAP to allow the query service layer to be consistent across regions? I mean, a write locally, read anywhere type of architecture for OLAP needs to be called out in the deep dive right?
Yah this does come up sometimes in the interview depending on the company. Most common at good. Write locally read globally is sufficient
@@hello_interview Great. Thank you for your reply.
Great Vid, redis cache needs expiry, how do we manage eviction?
For the idempotency keys? Just have a reasonable TTL based on cost constraints
Thank you for the video! I am learning a lot from this!
Btw I have a question on the Lambda vs Kappa architecture. If the lambda architecture is the combinations of the realtime and the batch process, then isn't your approach using just the lambda architecture?
Yah bit of nuance here, nuance that i don't think is all that important frankly, but, while we do have both real-time (flink) and batch processing (spark), the integration and reliance on real-time stream processing make it lean more towards a Kappa-like approach. The batch layer is secondary and primarily for reconciliation, not a core component. Hence, it’s a hybrid but not a pure Lambda architecture.
@@hello_interview that makes sense. In the real lambda architecture, we rely more on the batch? i.e it's going to run periodically faster to fix things up
hi, I may missed this: is it possible to put Apache Kafka + ksqlDB to build aggregations with Materialized Windowed Tables, where you can also use flash interval? Is it acceptable for such interview?
Please can you clarify this? You mentioned the count query on cassandra will be really slow. Would it really be slow? If the partition key is ad_id and the sort key is timestamp. I assume all the data for the same id will be on the same partition sorted by timestamp. Why would it be slow?
Instead of Spark, can you use AWS lambda serverless to do that job? Or directly send a task from click processor service to a kafka queue to the item to be batch added onto a aggregated read optimized db?
I should have watched the entire video before asking this question
Awesome content. Would it be possible for you to make a video on a video service like Netflix with focus on uploading and streaming?
I think it’s already on the list, but you can vote for it via the link in the description!
Do we need to add a hop by forwarding events into kinesis? Is it perhaps a better idea to fan out forward the click event to the processor for redirects, as well as kinesis for better throughput?
Not sure I follow, mind rephrasing? :)
In order for the event to get to kinesis, it looks like it goes through a middle service.
Is it possible to route it directly to kinesis from the load balancer as opposed to through the middle service
The last Redis piece you put there, it probably needs to be cleaned up some time. Also, what happens when it goes down, i would probably add another dedup point along the way, maybe in the reconciliation layer. Or add another layer just for that.
Can you cover an authentication related system? Previously mentioned using tokens, but would like something more in depth pls
Feel free to vote for new questions for me to do here! www.hellointerview.com/learn/system-design/answer-keys/vote
For Kinesis hot shards, we don't know if an ad is hot beforehand. So are these ad_id 0-N always active? Is it ok to use x10 the amount of streams we need under normal circumstances?
For Flask, we have the same amount of flask servers as the Kinesis shards right? If the server dies, how will the new server keep track of the pointer from the old server? Are they statefull backups instead of stateless
This is a great question. In reality you can make predictions here. We know based on budget and historical performance which ads we’d need to be worried about before hand
I can't access the answer keys and vote for questions page on the website. Don't know if that's by design or a bug.
Btw really love how you start from the "bad" but intuitive solution first and build on top of that
Should be fixed!
That's a very informative video! Two questions: 1. To solve idempotency issue, should this ad impression id be user unique? Otherwise, we should check if the combination of ad impression id and user id exists in the Redis to know if a user has clicked on this specific ad before. 2. You talked about Kappa and Lambda architecture and said that the solution uses hybrid of these two architectures. I am not quite familiar with those two architecture. But after doing some research, I feel this approach uses Lambda architecture since Lambda architecture has both batch layer and streaming layer, merge batch results and streaming results to show a unified result to user.
Yes the detach key (ad impression id) needs to be user unique. Good question on the architecture, couple related questions below in the comments where I share my answer. Sorry for making you scroll, just easier than re-typing :)
I believe lambda uses probabilistic data structure
I honestly feel you should hire @Jordan Has No Life as a system design expert on your channel. The depth of system design in his videos his quite good and honestly it makes up for a senior engineer. As what's the case with Staff SWE Expectations well that depends honestly on the individual. I think It can only come from experience or reading books such as Database Internals and/or DDIA. No amount of videos can make up for the Staff SWE expectations in System Design.
We love Jordan ♥️
@@hello_interview Me Too!! That guy's an OG in System Design.
Evan, will be be seeing the rest of write ups in video format too in the coming days?
One every 3 weeks or so
@@hello_interview when can we see the next video?
25:20 I thought for DB, time series databases can write fast and also handle ranged based queries quickly? Or some wide column databases
yah can be a good consideration. don't know enough about the ins and outs of popular TS DBs to offer a strong justification either way
Thanks! I see it getting name dropped in a lot of books , but outside the books I haven't see it a lot
Can the same design be used to design a Top K Service which finds top K videos per minute(Aggregation Window 1 min), Per Hour(Aggregation Window 1 hr with checkpointing), Per Day (Aggregation Window 1 day with checkpointing) and store them in a Redis Cache for the "Top K service" to query. And for longer time periods like 1 year or forever, a daily cron job can query the OLAP DB to get those and update that in the Redis Cache.
Actually checkout our website! We have a written breakdown for topK there
The out-of-scope non-functional requirements seem to be more like out-of-scope functional requirements. I feel that (spam detection, demographic profiling, conversion tracking) are essentially features rather than characteristics of the system. How should I be thinking about this?
Honestly, fair point
Question: For the sharding while processing the events through Kinesis, the adId was suggested as the sharding key. This doesn't look like the best approach. At scale, millions of ads are being run on the platform and a good share of them have high enough volume. Going by the presented logic, the number of shards would explode. What do you think about this?
why not use kafka for storing/doing streaming aggregations into the OLAP?
Ah, spoke too soon lol :)
Can you tell the tool that you're using for drawing?? TIA
Excalidraw. File linked in description.
The extra compute in click processor to check legitimacy of ad impressions based on signed impressions is still likely vulnerable to DOS attacks. Perhaps that should have been stated as out of scope.
Fair!
I think if you use gateways like Amazon managed ones, they do a great job of preventing DOS attacks as well. which is an additional property they service additionally to routing
Calling Ad DB from Click Processor Svc might not be the best pattern (DB is shared between microservices), an area that could have been improved with calling Ad Placement Service or some other service responsible for the ad metadata and caching that url in Redis.
A bit unrelated question.. Do you feel it's worth to try to do some self-learning about ML/AI and attempt to switch to that area?
And how do you feel about the market in that regard
Thanks
Hmm, maybe possible for a start up. Would be really difficult to pull off for FAANG or FAANG adjacent. The easier path is to work on a ML infra team and spend time closer to the modeling to learn that way. This is actually what I did. I don't claim to be an ML engineer, but I got a lot of exposure working on a team alongside ML PHDs and doing applied ML off and on. Then the switch internally becomes more natural.
Thank you!
Would a question like this be asked for Product Architecture or System Design interview?
System design in meta world.
Can you please create video on adblocker
Add it to the backlog via the link in the description!
why we need olap here query service can directly query from flink
Talk about this a bit in the video. Main two reasons I’d advise against it is contention and isolation. If the click stream breaks advertisers can still query data. Aggregated db goes down we still track clicks.
good good
Can u please upload system design interview for all basic topic.
What did you have in mind?
@@hello_interview I really like the way u differentiate the expectation and learning required for different level. Topic like Rate Limiter, Amazon (scale during big billion days), Payments etc design could be very much helpful.
I didn't quite follow the point around not needing checkpoints in Flink. If a node goes down, and then comes back up, are we just accepting that the data is lost, and rely on the reconciliation worked to fix it? It doesn't seem obvious why checkpointing wouldn't make sense here.
We have retention on our stream, so we’ll just pick back up reading the data from the start of the minute again (or as far back as we missed)
@@hello_interview But how would we know where (which minute) to pick up from if we never checkpointed the state of the Flink node? As far as I understand, checkpointing will usually store something like the queue offset (or in this case maybe the last full minute we processed?) to know more-or-less where we've got up to with the previous node that failed.
If we're not using checkpointing, I'm a little lost about how we'd recover
My interpretation is the stream has a cursor that tells system where it's out of date and recover starting point
Would this show up in product design interviews?
Unlikely.
Re-upload ? I thought I saw this video 10 hours back or I am dreaming.? 🤣🤣
Yes, the old version of the video had some editing issues
Yup, missed including a deep dive white editing :)
why do you put OLAP as a DB instead of giving the DB name. OLAP is not a DB and it gets confusing when you say that as a Staff engineer
It’s a quality of a database. Choosing a specific technology is often times less interesting than articulating the qualities that you’d select for. I mentioned some example, specific databases verbally if I remember correctly.
I don't have knowledge about Flint and Kinesis, infact never heard about it prior to going thru this video, does that almost mean that I am gonna tank the Staff level interview? What's the best way to handle such a scenario?