System Design Interview: Design a Web Crawler w/ a Ex-Meta Staff Engineer

Поделиться
HTML-код
  • Опубликовано: 28 июн 2024
  • 00:00 - Intro
    01:58 - The Approach
    4:08 - Requirements
    10:31 - System Interface & Data Flow
    14:48 - High Level Design
    18:20 - Deep Dives
    1:04:09 - Conclusion
    A step-by-step breakdown of the popular FAANG+ system design interview question, Design a Web Crawler, which is asked at top companies like Meta, Google, Amazon, Microsoft, and more.
    Evan, a former Meta Staff Engineer and current co-founder of Hello Interview, walks through the problem from the perspective of an interviewer who has asked it well over 50 times.
    Resources:
    1. Detailed write up of the problem: www.hellointerview.com/learn/...
    2. System Design In a Hurry: www.hellointerview.com/learn/...
    3. Excalidraw used in the video: link.excalidraw.com/l/56zGeHi...
    4. Vote for the question you want us to do next: www.hellointerview.com/learn/...
    Checkout the previous video breakdowns:
    Ticketmaster: • System Design Intervie...
    Uber: • System Design Intervie...
    Dropbox: • System Design Intervie...
    Ad Click Aggregator: • System Design Intervie...
    Connect with me on LinkedIn: / evan-king-40072280
    Preparing for your upcoming interviews and want to practice with top FAANG interviewers like Evan? Book a mock interview at www.hellointerview.com.
    Good luck with your upcoming interviews!

Комментарии • 77

  • @_launch_it_
    @_launch_it_ 11 дней назад +10

    I had an interview last Friday (June 14) and I followed your exact steps. The question was to design the Ticketmaster. The Redis cache solution was the best. Thank you for these amazing videos

  • @jk26643
    @jk26643 9 дней назад +2

    Please please keep posting more! It educates so many people and you make the world better!! :) Absolutely the best system design series!

  • @crackITTechieTalks
    @crackITTechieTalks 11 дней назад +2

    I often don't comment for the videos. But couldn't stop commenting your video just to say "What a valuable content". Thanks a lot for all your videos!! Keep doing this..

  • @rupeshjha4717
    @rupeshjha4717 9 дней назад +1

    Bro, pls don't stop posting this kind of contents, really loved it so far with all of your videos.
    Able to relate with the kind of small impactful problems and solutions you mentioned during your videos, which indirectly impact the interviews

  • @vigneshraghuraman
    @vigneshraghuraman 12 дней назад +2

    by far the best System design interview content I've come across - please continue making these. you are doing an invaluable service!

  • @qwer81660
    @qwer81660 12 дней назад +3

    By far the most inspiring, relevant and practical system design interview content. I found them really useful to perform strongly in my system design interviews

  • @TheKarateKidd
    @TheKarateKidd 6 дней назад

    This is the first video of yours I watched and I loved it. Your pace is just right and you explain things well, so I didn't feel overwhelmed like I usually do when I watch systems design videos. Thank you!

  • @user-qc8mx8hu5c
    @user-qc8mx8hu5c 7 дней назад +1

    Again the best System Design interview overview I ever met. Please keep doing it for us!

  • @davidoh0905
    @davidoh0905 12 дней назад +1

    This is such a great example for any kind of data application that needs asynchronous processing! Widely applicable!

  • @alirezakhosravian9449
    @alirezakhosravian9449 11 дней назад

    I'm watching your videos to get prepared for my interview 4 days later, I hope I'll be able to handle it :DDD , so far the best SD videos I could ever find on youtube.

  • @Global_nomad_diaries
    @Global_nomad_diaries 12 дней назад +3

    Soo soo soo much thankful I am for all this content.

  • @krishnabansal7531
    @krishnabansal7531 3 дня назад

    Suggestions:
    Please mention what are the clarifying questions to be asked for a specific problem. Even if the problem is well known, the panel still expects to ask few clarifying questions, specially for a senior candidate.
    Also, if you can cover company specific expectations (if any) for top MAANG companies, that would be excellent.

  • @FizuliValizada
    @FizuliValizada 11 дней назад

    Nice, thanks for the content. I also really appreciated the videos from the mock interview. I found that much more useful and would love to see more of those.

    • @hello_interview
      @hello_interview  11 дней назад +1

      Tougher there for privacy reasons. Requires explicit sign off from coach and candidate, but I'll see what I can do :)

  • @chongxiaocao5737
    @chongxiaocao5737 12 дней назад

    Finally a new update! Apprecaite!

  • @dibll
    @dibll 5 дней назад

    Hope you can create videos of the write ups done by other authors on HelloInterview in the near future. Love the content. Thank you!!

  • @zy3394
    @zy3394 7 дней назад

    love your content , learned a lot, please keep updating more. ❤

  • @TheKarateKidd
    @TheKarateKidd 6 дней назад

    One of the first things that came to mind in the beginning of this problem is dynamic webpages. Most websites don't display the majority of their content on simple HTML. To be honest if I was interviewing a senior or above level candidate, not mentioning dynamic content early on would be seen as a red flag. I'm glad you included it at the end of your video, but I do think it is important enough to be mentioned early on.

  • @CS2dAVE
    @CS2dAVE 11 дней назад

    S Tier system design content! Another exceptional video 👏

  • @davidoh0905
    @davidoh0905 12 дней назад +1

    Just in time!!!!

  • @zayankhan3223
    @zayankhan3223 10 дней назад

    This is one of the best system design videos on the interview. Kudos to you. I would like to understand a little more on how do we handle duplicate content? What if the content is 80% same on two pages? Hash will work only when pages are exactly the same.

  • @vamsikrishnabollepalli4908
    @vamsikrishnabollepalli4908 День назад

    Can you also provide system design interview flow and product design interview flow for each problem?

  • @sanketpatil493
    @sanketpatil493 9 дней назад

    Can not thank you enough for all this valuable content. Just amazing work!
    Btw can you share some good resources for preparing for the system designs interview? Books, courses, engineering blogs, etc.
    A dedicated video would be much more helpful!

    • @hello_interview
      @hello_interview  9 дней назад

      Im certainly biased, but i think our content is some of (if not the) best out there. so I would start at www.hellointerview.com/learn/system-design/in-a-hurry/introduction.
      Some useful blogs on system design too depending on your level which can be found at www.hellointerview.com/blog
      all written by either me or my co-founder (ex meta sr. hiring manager)

  • @georgepesmazoglou4365
    @georgepesmazoglou4365 11 дней назад +1

    Great design! I wonder why there was never a mention of doing the whole thing with spark, using offline batch jobs rather than realtime services?

    • @afge00
      @afge00 11 дней назад

      I was thinking about batch as well

    • @hello_interview
      @hello_interview  11 дней назад

      Interesting. You know, as many times I’ve asked this, no one has every proposed it. Top of my head I see no obvious reason why you couldn’t get it to work, especially for just a one off.

    • @georgepesmazoglou4365
      @georgepesmazoglou4365 11 дней назад +1

      @@hello_interview I do crawling for a large company, typically you would do something like the video's design when you care about data freshness, if you don't care about that, like the LLM use case you, would do a sparky thing where you just split the work to a bunch of workers, you can have the html fetching and processing parts in different stages. Your inputs can be the URLs and previous crawled pages and join them, so that you crawl only new urls, or recrawl URLs only after some time since their last crawl. The main disadvantage compared to your design is that you are not as fault tolerant as you can't do much in terms of checkpointing. Also it is less fun to discuss:)

  • @sonmanutd
    @sonmanutd 6 дней назад

    Wow, the amount of Depth here is absolutely insane. How can you compressed so much information into a 1 hour interview? I learn so much information from this video that I never see else where, and it is all presented so elegant and natural. The speaker speaks clearly, no ums and ahs, no speed up? You must be a great engineer at work!
    One thing that I am a bit unsatisfied is about duplicated content. Is it even possible that we actually have completely duplicated content? Even when there are two different web pages, I think that they might just have a few location that the content is different. That would completely break our hash function right?
    Do you know of any hash function that would allow two webpages that are mostly similar to be close together? Do you see any role in word2vec or vector storage here?

    • @ronakshah725
      @ronakshah725 7 часов назад

      I think this is a great question! I want to attempt to answer this, but I’m no expert haha.
      As the goal of this particular system is to train language models, it’s nice to understand if optimizing for “similar” web pages is necessary for our top level goal.
      In general, it could be helpful to prioritize learning based on chards of text, that appear in many pages. But we have to remember that connecting back to the source could also be required later, for things like citations. So we have to be a bit smart about this. TL;DR it’s a can of worms and I would try to better understand the priority of this compared to existing requirements of the system.

    • @ronakshah725
      @ronakshah725 7 часов назад

      This isn’t skirting off the question, but it’s a good step towards delivering our final solution.

  • @letsgetyucky
    @letsgetyucky 12 дней назад +2

    commenting for the algo. thanks for excellent and free content!

    • @hello_interview
      @hello_interview  12 дней назад

      Legend 🫡

    • @letsgetyucky
      @letsgetyucky 12 дней назад +1

      ​@@hello_interview Feedback: really enjoyed the video! Would love if future videos were also mostly skewed towards deep dives. Suggesting other topics to research yourself (or hash out with others in the comments) is also super valuable. Finally, calling out the anti patterns that are being regurgitated (e.g. bloom filters) is very valuable as well.

    • @davidoh0905
      @davidoh0905 12 дней назад

      @@letsgetyucky is bloom filters a anti-pattern!? just curious!

    • @letsgetyucky
      @letsgetyucky 12 дней назад

      @@davidoh0905 during the deep dive Evan says that Bloom Filters are commonly used in the interviews because it's they are used in solutions in the popular interview prep books. But the interview prep books don't do a great job of discussing the tradeoffs behind using a Bloom Filters vs more practical solutions. It's a nice theoretical solution, but in a real world system you could do something simpler and just bruteforce the problem.

  • @dibll
    @dibll 7 часов назад

    Not related to this Video in particular but I have question about partitioning - Lets say we have a DB with 2 columns firstname and lastname. When we say we want to prefix the partition key which is firstname with lastname, Does that mean all similar lastnames will be on same node , if yes what will happen to firstNames how they will be arranged? Thanks

  • @davidoh0905
    @davidoh0905 12 дней назад

    If Kafka does not support retry out of the box, what does that exactly mean? if you do not commit, does it not get move the offset, which could potentially serve as retry like(?) Also, could you compare this with some other queueing service that allows for retry like SQS maybe? Comparison on when to use Kafka vs SQS would be really good too! message broker vs task queue might be their most frequent use cases but might be good to provide justifications in this scenario!

  • @healing1000
    @healing1000 5 дней назад

    Thank you!
    to avoid duplicate URLS, do we need to discuss using a cache or Is it ok to only use the data base

    • @hello_interview
      @hello_interview  3 дня назад

      Same convo as the duplicate content. Cache is certainly an option. The DB index enough imo.

  • @nanlala3171
    @nanlala3171 11 дней назад

    I saw you used many AWS services during your design. Is it a good practice to use specific products and their features (dlq/SQS, GSI / dynamo db) in the design? What if the interviewer never used these products and had no concept of these services/features.

    • @hello_interview
      @hello_interview  11 дней назад +2

      Depends on the company, in general, yes. But, importantly, don't just say the technology. This important part is that you understand the features and why they'd be useful. For example,
      Bad: I'll use DynamoDB here
      Good: I need a DB that can XYZ. DynamoDB can do this, so I'll choose it.

  • @Global_nomad_diaries
    @Global_nomad_diaries 12 дней назад +1

    Can this be asked in product architecture interview at Meta or just system design?

    • @hello_interview
      @hello_interview  12 дней назад

      Should be system design not product architecture in meta world. But, you never know, some interviewers go rogue.

  • @mdyuki1016
    @mdyuki1016 11 дней назад

    what's the reason not storing URLs in databases like MySQL. for retrying, just add some column like "retry times"

    • @hello_interview
      @hello_interview  11 дней назад

      I mention this at somepoint I believe when discussing the alternate approach of having a "URL Scheduler Service." They have to get back on the queue somehow, so either directly or via a scheduler where state is in the DB.

  • @mularys
    @mularys 12 дней назад +2

    Here are my concerns: your solution is so nice, but if everyone is going to talk about the same thing during the interview, especially when one is driving the process, will it raise any red flags on the hiring committee side as they might think candidates are referring to the same sources?

    • @hello_interview
      @hello_interview  12 дней назад +5

      This is not meant to be a script. If your plan is to regurgitate this back to an interviewer I’d recommend not doing that. Instead it’s a teaching resource to learn about process, technologies, and potential deep dives. If you get this problem, then sure, talk about some of this stuff, but also let it be a conversation with the interviewer

    • @rostyslavmochulskyi159
      @rostyslavmochulskyi159 12 дней назад

      But if there an issue if you answer all/most of interviewer questions correctly? I believe it is an issue if you memorise this, but can’t go any further, but if you can there is nothing wrong.

    • @mularys
      @mularys 12 дней назад

      @@hello_interview Yeah, makes sense. You present a good framework to structure the talking points that candidates can bring up. And I found it pretty useful. My system design question is the top-k video and I followed the key points you mentioned. My target is E5 and the interviewer just had a handful of follow-up questions (90% of the time I was talking). Eventually, I passed that round with a "strong hire". Of course, I added my points of view during the interview, but I feel like I was just taking something off the shelf.

  • @evalyly9313
    @evalyly9313 2 дня назад

    So for being able to give the right estimation of the back of the envelope calculation, the base knowledge is that the person knows that an AWS instance capacity is 400Gbps. I don't have this knowledge in mind, is that ok we can ask or search during interview or is this something we should keep in mind?

    • @hello_interview
      @hello_interview  2 дня назад

      I think it’s useful to have some basic specs as a note maybe on your desk when interviewing. But it’s also ok to ask. The intuition that caches can have up to around 100gb and dbs up to around 100TB is good intuition to have though.

  • @Sandeepg255
    @Sandeepg255 8 часов назад

    I think at 39:03, you are saying that set the visibility timeout of the message to now - crawlDelay, but visibility timeout concept is for a queue, then how are you planning to set it at message level ?

    • @hello_interview
      @hello_interview  7 часов назад +1

      You can set them at the message level with SQS! From the docs, “Every Amazon SQS queue has the default visibility timeout setting of 30 seconds. You can change this setting for the entire queue. Typically, you should set the visibility timeout to the maximum time that it takes your application to process and delete a message from the queue. When receiving messages, you can also set a special visibility timeout for the returned messages without changing the overall queue timeout.”

  • @zfarahx
    @zfarahx 12 дней назад +1

    Another bump for the algo!

  • @bobberman09
    @bobberman09 2 дня назад

    can you post the 2nd top voted one (youtube) earlier? At least written version :) Also very interested in the stock exchange question, but I see that's further down.

    • @hello_interview
      @hello_interview  2 дня назад

      The written coming this week or early next at the latest! Almost done :)

  • @bhaskardabhi
    @bhaskardabhi 9 дней назад

    Wont there be a case that even though HTML will be diff but the hash will be same? is it even possible?

    • @hello_interview
      @hello_interview  9 дней назад

      Not worth even considering. Hash collisions are so unlikely they’re not worth discussing

  • @trueinviso1
    @trueinviso1 11 дней назад

    I wonder if questions about the type of content we are scraping matters? i.e. ignore suspicious sites or offensive content

  • @mohitaggarwal949
    @mohitaggarwal949 11 дней назад

    If we store Hash in URL table in DynamoDB , how does it handle a case of copied webpages which will have different URLs and same HTML ?

    • @hello_interview
      @hello_interview  11 дней назад

      Check the hash before storing in s3 and putting on parsing queue

    • @shyamvani
      @shyamvani 10 дней назад

      you need to store the hash of the page contents for the url and not the hash of the url itself.

  • @krishnabansal7531
    @krishnabansal7531 8 дней назад

    I hope someone asks me Web Crawler question.

  • @annoyingorange90
    @annoyingorange90 11 дней назад

    really good video but please stop panning uselessly :D appreciate ur work!