Twitter Search/ElasticSearch Design Deep Dive with Google SWE! | Systems Design Interview Question 8

Поделиться
HTML-код
  • Опубликовано: 23 янв 2025
  • НаукаНаука

Комментарии • 61

  • @Snehilw
    @Snehilw 2 года назад +5

    Amazing content as always dude. Love how much in depth you go in all of your videos! My favorite channel of all by far! Have recommended this to several friends.

  • @mickeyp1291
    @mickeyp1291 Год назад +1

    as allways great videos - 18:00 forgot all that stuff about ESs caching so thanks for the reminder, gonna reread that part in ES docs. great job knowing about lucene most of my applicants have no clue about ES definitely not that lucene is not a db but a search engine (hate the json syntax, but what can you do) again, super fun to listen to your vids and watch this content

  • @yashagarwal8249
    @yashagarwal8249 11 месяцев назад +3

    Will Search Service pull the actual documents from the DB once it receives the documents Ids from cache/search index?

  • @cc-to2jn
    @cc-to2jn 2 года назад +2

    dude u along with neetcode are my goto. Great content, and clear explanations.

  • @Ms452123
    @Ms452123 2 года назад +14

    Sshhh Man's been hiding the gun show this whole time. Giga Chad on the low

  • @kamalsmusic
    @kamalsmusic 2 года назад +3

    If we use the local index (meaning each node stores term -> [doc id's] and multiple nodes can reference the same term), does this mean we need to query all the nodes to answer a search query? How do we know which nodes have the term we are interested in if we are not partitioning by term?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 года назад

      Yes, you have to query them all and aggregate. It's unfortunate, but there's typically too much data to shard by term as opposed to document.

    • @axings1
      @axings1 Год назад

      @@jordanhasnolife5163 could we first partition by term, then further partition into multiple shards if a single term has too much data?

  • @RandomShowerThoughts
    @RandomShowerThoughts Год назад +1

    16:00 we can also use debezium (for certain databases) and that would write to kafka and listen on that topic

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +1

      I'll have to look into this! Haven't had the privilege of using Kafka during my career so haven't heard of debezium

    • @RandomShowerThoughts
      @RandomShowerThoughts Год назад +1

      @@jordanhasnolife5163 it’s pretty cool, I used it at my last company. We used debezium to capture changes from the database using the WAL, it would then write to a Kafka topic and we can read off it.
      The one downside here is that it writes all the messages into a single topic, and a single partition to ensure ordering. So the approach you mentioned of writing directly to Kafka will allow us to write to multiple partitions if needed (allowing more parallelization)

  • @RandomShowerThoughts
    @RandomShowerThoughts Год назад +1

    16:00 exactly, I was thinking the same thing. Typically write to the source of truth, and use a queue to send it out to the various locations

    • @mickeyp1291
      @mickeyp1291 Год назад

      today you assume the queue is the source of truth, then you spill into s3

  • @neethijoe
    @neethijoe 8 месяцев назад +1

    Don't we need a parser/lexer service between kafka and search index that parses the tweets, hashes it to the correct partitions of the search index ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 месяцев назад

      Something like elastic search will do this for us, hence why I don't explicitly include it.

  • @shivamsinha642
    @shivamsinha642 2 года назад +2

    liked solely for the description

  • @AAASHIVAM
    @AAASHIVAM 3 месяца назад +1

    Hey Jordan, Question here:
    1. Can we have hierarchical shards. Can we have a sharding strategy to segregate nodes by search terms. i.e. aa -> going to shard1, ab -> shard2, abc to abz -> shard3, ac -> shard4 and so on.
    2. With this strategy we will be writing document ids for terms to particular shards.
    3. In case of any hotspotting we can further split it.
    4. In case of extremely popular terms like trump, we can have it span over multiple shards.
    5. We can keep track of the shards using gossip protocol or zookeeper.
    From what I understood, in our current solution, we are doing a scatter gather where we are querying each partition and aggregating it. If we have hundreds of partitions, then we will have to filter out a lot of results. I understand that we can do it in parallel, but if we are only interested in top 2 results, then we will end up fetching top 2 results locally in every partition and then filter it out.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      1) Yes you can
      2) Agreed, but you have to be smart about how this is done, if you think of twitter you have a shit ton of tweets with the word "the" in it, how do you smartly partition these?
      3) See #2
      4) See #2
      5) Sure
      Agree with your general sentiment, and agree that it theoretically makes more sense. In practice, I believe the majority of search terms that are used are extremely popular anyways, and so just keeping a local index of recently published tweets with some scoring of each local document (keep in mind relative scoring is tough as well when doing global partitioning because you need access to the whole document to do TF-IDF on a particular search term, meaning the document itself needs to be sent to potentially hundreds of shards on write).

  • @SwapnilSuhane
    @SwapnilSuhane 8 месяцев назад +1

    great depth of core search design discuss with bit comedy ;)

  • @lv0320
    @lv0320 4 месяца назад +1

    Is it recommended to partition the search index by term or by the tweet_id/user-id?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 месяца назад

      I don't think by term is realistic at Twitter's scale to be honest

    • @lv0320
      @lv0320 4 месяца назад +1

      @@jordanhasnolife5163 I see, so when a user searches for a term, the request would hit all partitions to get subset results and then they would be aggregated?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 месяца назад

      @@lv0320 Ideally, there should be some partitions that correspond to recent data. That way you don't have to hit all of them.

  • @anupamdey4893
    @anupamdey4893 2 года назад +1

    Love your content ! Keep up the good work !!

  • @maxmanzhos8411
    @maxmanzhos8411 11 месяцев назад +1

    Wrote a long comment about how a posting list (documents containing a term) is implemented as a skip-list + encoding as per apache/lucene github repo Lucene99PostingsFormat. As I was wondering why we can't use similar idea for follower/following list storage in news feed problem (from System Design 2). But it's only viable if you either store the data in Lucene (I guess no one does that with this purpose in mind) or if you have a full control over DB code, so that you can do such advanced customization over a column (also not practical).
    nice guns

    • @jordanhasnolife5163
      @jordanhasnolife5163  11 месяцев назад +1

      Interesting, haven't heard of that data structure but would agree that it may be an overoptimization.
      Thanks, I work hard on the guns haha

  • @vinaybabu2635
    @vinaybabu2635 2 месяца назад +1

    Hey Jordan, do you have any course on Udemy ? If not highly recommend u do that

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад

      I do not! Everything that I post I want to be free! If I'm going to sell something, hopefully it can provide some real utility to you guys :)

  • @neek6327
    @neek6327 2 года назад +1

    Hey man, qq. I was wondering if you thought it would be important in an interview to mention how we know which machine holds which partition? I was thinking maybe we could have a distributed search/index service that maintains the mappings between the partition -> machine. And that mapping could be made consistent across the “search/index service” nodes via a consensus algo or maybe zk. Does this make sense at all or am I missing something? Maybe it’s the local secondary indexes that take care of the problem I’m describing and I just don’t understand 🤷‍♂️

    • @neek6327
      @neek6327 2 года назад

      Like rather than relying on local index, if we knew which machine held which partition couldn’t we just go directly to the correct shard and perform a binary search?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 года назад +2

      Yes you would use zookeeper or a gossip protocol to keep track of which docs are held on which partition. Though this shouldn't really matter since we have to query each partition anyways.

    • @neek6327
      @neek6327 2 года назад +1

      Hmm sorry, maybe this is going over my head. Why is it that we need to query each partition if we know exactly the partition that contains the word we’re looking for?
      Like say someone is searching the word “gigachad” and we know that machine 1 holds the partition range with that word in it. Couldn’t we go directly to machine 1 and perform a binary search there rather than querying all the shards?
      Maybe my understanding is off?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 года назад +2

      @@neek6327 We aren't partitioning that way here - we are partitioning by groups of document Ids, not term. While in theory, partitioning by term is optimal, the reality is that there are often too many document IDs associated with one term to fit on a given machine, and as a result we have no choice really but to use local indexes on a group of documents.

    • @neek6327
      @neek6327 2 года назад +1

      Got it, that makes sense. Thanks 🙏

  • @idobleicher
    @idobleicher Год назад +1

    I liked your videos, new sub!

  • @kyabia2333
    @kyabia2333 Год назад +1

    amazing, very helpful

  • @AmolGautam
    @AmolGautam 10 месяцев назад +1

    Thanks giga bro

  • @FarhanKhan-wu3fq
    @FarhanKhan-wu3fq Год назад +2

    Did you really just "NOPQRS"ed to figure out what comes after P?

  • @raj_kundalia
    @raj_kundalia 11 месяцев назад +1

    thank you!

  • @eudaimonian9473
    @eudaimonian9473 2 года назад +1

    Gigachad42 in da house

  • @DileepBC-r2x
    @DileepBC-r2x Год назад +2

    i dont understand most of the things. but thanks for the video.

  • @LucasMarianoCarmo
    @LucasMarianoCarmo 17 дней назад +1

    5:13 lol

  • @RandomShowerThoughts
    @RandomShowerThoughts Год назад +1

    Grokking the system design sucks at this question ngl, searched for a solution right after reading it

  • @sumsha18
    @sumsha18 2 месяца назад +1

    Another great video. However, if possible, could you please merge this architecture diagram with the one your created on ruclips.net/video/bOhQLr7nbhQ/видео.html That way, users would be able to correlate how we extended our architecture to cover "search" use case. It seems we have shown "Log based MB" i.e. Kafka while on the other video, we mentioned both the "In-memory" as well as Kafka in the architecture and that may confuse people. Thank you very much for sharing your wonderful and quite practical designs.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад

      I think you might find what you're looking for in the twitter video of the 2.0 series!

  • @RandomShowerThoughts
    @RandomShowerThoughts Год назад +1

    00:40 lmaooooo the day in my life as a software engineer videos are cringey af

  • @art4eigen93
    @art4eigen93 2 года назад +2

    interviewee: Api design going to be pretty tiny
    Interviewer: How much tiny?
    Interviewee: You know....