22: Recommendation Engine (YouTube, TikTok) | Systems Design Interview Questions With Ex-Google SWE

Поделиться
HTML-код
  • Опубликовано: 21 ноя 2024

Комментарии • 55

  • @siddharthgupta6162
    @siddharthgupta6162 6 месяцев назад +17

    Hey Jordan, I recently joined Google as an SSE and I wanted to express my sincere gratitude for your system design videos, especially the ones comparing multiple solutions. Those comparisons were exactly what the interviewers were looking for in my feedback.

  • @WaxPaxler
    @WaxPaxler 6 месяцев назад +8

    Congrats on the sponsor bro! Keep up the good work

  • @brunoalfred
    @brunoalfred 6 месяцев назад +3

    Just found you channel few days ago. And i'm watching most of your prev videos. Liked the one Message Brokers.

  • @JLJConglomeration
    @JLJConglomeration 6 месяцев назад +6

    Incredible ad read 😂

  • @bhaveshupadhyay6657
    @bhaveshupadhyay6657 4 месяца назад +2

    My boi getting sponsors. Well deserved. To the moon 🚀

  • @zhonglin5985
    @zhonglin5985 6 месяцев назад +2

    TIL what embedding is! Congrats on the sponsor BTW!!!

  • @savy6682
    @savy6682 Месяц назад +1

    Hi @jordan, Great video, few questions:
    1. In neighbour index sharded by vector hash, are we storing entity vector v1 : [neighbour vector 1, v2, v3...] in vector forms. Why do we need neighbour index sharded by vector hash at all?
    2. You explained storing an entitiy's close neighbours in a max heap by distance for easy updating. Where exactly would we be that, in neighbor index sharded by vector hash? Can't we store it directly in neighbour index sharded by entity id?
    3. in final diagram there is an arrow from entity db (which i am assuming stores entity id to its embedding vector mapping?), but in what scenario would it be fetching embedding from entity db? recommendation service seems to be only calling neighbour index cache and db

    • @jordanhasnolife5163
      @jordanhasnolife5163  Месяц назад

      1) Why do we need a neighbor index? or why do we need it sharded by vector hash? In retrospect, I shouldn't have said sharded by vector hash, it should probably just be sharded by entity ID. Why we need a neighbor index in general is to speed up the result of a query we'd run all the time: "tell me the nearest neighbors of this entity".
      2.Yes agreed we should store it directly in the neighbor index.
      3. Agreed, depends how much we denormalize our data in the neighbor index. If we don't at all, we'd have to hit the entity DB for some information about what we're showing the user.

    • @savy6682
      @savy6682 Месяц назад +1

      @@jordanhasnolife5163 thanks. in 1., what i meant was why do we need it in two forms vector v1 : [neighbour vector 12, v2, v3...] and entity id1: [neighbor entity 12, e2, e3]... But now that I think about it, your original design in this video makes more sense now lol. Reason being, if we do shard it by entity ID then when adding a new entity, checking and updating other nearby entity vectors (for their neighbors) might become a cross partition write if sharded by entity id originally (instead of vector hash).. I think this might have been your reasoning 5months ago when you made this video. Let me know if that makes sense.

    • @jordanhasnolife5163
      @jordanhasnolife5163  Месяц назад

      @@savy6682 Honestly I'm leaning towards just sharding by entity ID now lol. I don't mind the cross partition write too much since it's being done asynchronously or in a batch job and isn't latency sensitive.

    • @savy6682
      @savy6682 Месяц назад +1

      @@jordanhasnolife5163 hmm I see your point.. BTW thanks for taking time addressing each comment on each video!!

  • @marksun6420
    @marksun6420 4 месяца назад +1

    Great video! I did try to digest and understand what you are talking about :) Still got one question: why sharing the vector database by the vector hash won't result in a hot partitioning problem, the same way shading the neighbor index by the vector hash will?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 месяца назад

      I think that in theory it could, but most of the additions to the vector data base are being done asynchronously in the background, and so we have more flexibility to temporarily stop all writes and rebalance as needed. We'd want to shard in a similar fashion though, where vectors with close proximity are near one another.

  • @kevinburke3941
    @kevinburke3941 2 месяца назад +1

    Great video! Appreciate all your hard work. I have two questions about your final design:
    1. What is the difference between the entity database indexed by entity ID and the neighbor index, which is sharded on entity hash range? Are they not both storing mappings from entities to their neighbors?
    2. The drawing shows the above mentioned databases pushing data to the recommendation service, but it sounds like the recommendation service is making requests to the neighbor index to retrieve neighbors for a particular entity? Basically just curious why you drew the arrows pointing from the neighbor index to the recommendation service instead of the other way around.
    Thanks!

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад +1

      1) The entity database is just a mapping of the actual vector itself. It is not in fact indexed by the id of the entity, but rather it's "geo hash" (generalized to n dimensions). The neighbor index is a small cache indexed on entity id that lists the ids of the other vectors that are its nearest neighbors.
      2) Ah probably just a typo on my end.

  • @RaviMenu
    @RaviMenu 6 месяцев назад +1

    Great video Jordan! Can you do one for ACID based system like Digital wallet or Bank ? or a combination of both like bank to wallet and wallet to wallet may be ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      Where do you see the challenge here? At least to me, this initially just feels like you'll need ACID databases, or two phase commit when making a transaction between two accounts on different partitions.

  • @priteshacharya
    @priteshacharya 6 месяцев назад +1

    Great video Jordan! Learned a lot from this video. One question on Recommendation Service -> Neighbor index flow at 41:18 . Since we are sharding Neighbor index by entity_id, the recommendation services, in case of cash miss, has to scatter and gather right? Entity 12, 13, 62 (examples in the slide) could be in different partition

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      They would have to fetch the neighbors for their last x watched videos. So for each of those x videos, all of its neighbors will be on the same node, but otherwise we may have to hit up to x different partitions.

  • @skullTT
    @skullTT 2 дня назад

    some other videos talked about recommendation engine from ML aspect, content filtering, collaborative filtering. What is the relationship between them and this embedding approach

  • @hakapuu
    @hakapuu 5 месяцев назад +1

    Great video! I see that you used a heap for new entries into the closest neighbor index. Isnt insertion time into a heap the same O(Logn) as would be in a db index which uses B+ trees? Do understand that in the index we might need to replace multiple rows vs using a heap that wont happen. Is that the optimization here? Trying to understand how this speeds up things.

  • @OptimizingLiving
    @OptimizingLiving 6 месяцев назад +1

    Hey Jordon, great content. Thank you for making these videos in depth. QQ- Do you think we can use graph databases such as neo4j instead of neighbor index for faster reads.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      I think that you could, but consider this - for every vector (which is an arbitrary set of points), you'd need to create an edge to other vectors, so that you can traverse the graph. How do you decide which ones to do that for? Even then, let's imagine you could - you'd still have to run a breadth first search to find the closest vectors. I'd think that pre-caching your answers here will just about always be the fastest option.

  • @sharad20073024
    @sharad20073024 6 месяцев назад +7

    can you please share your slides as well. it will be really helpful.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад +3

      Planning on doing this in bulk after finishing my current series, this will be in the next 1-3 months.

  • @aforty1
    @aforty1 6 месяцев назад +1

    Thank you for these videos!

  • @adw6579
    @adw6579 6 месяцев назад +1

    Hey Jordan. What books would you recommend I read? I have already finished DDIA.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      Hey! I'd probably start reading some white papers! As for which ones, there are like 10-20 tools on LinkedIn who only post links of other people's content on their pages, hopefully one of them is decent

  • @vipulspartacus7771
    @vipulspartacus7771 6 месяцев назад +3

    Hi Jordan, can you please share your ipad notes. Maybe they are not perfect but they serve as some sort of reference to revise

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад +1

      Planning on doing this in bulk after finishing my current series, this will be in the next 1-3 months.

    • @vipulspartacus7771
      @vipulspartacus7771 6 месяцев назад +1

      @@jordanhasnolife5163 Hi, don't mean to rush you but I have some important interviews in coming weeks and having your notes will really help me prep better. Can you share them in any form. I understand there can be mistakes or typos in them but I want to be able to quickly revise all the overarching concepts and designs

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      @@vipulspartacus7771 Hi Vipul - understand your rush here, it will take me a few hours to properly export everything, which is the reason for the delay. I haven't sat down and done it. Additionally, once I do, I'd like to publicize that a bit, as I hope that they can help me build my following if we're being fully transparent here. My original slides contain all of the same information.

    • @vipulspartacus7771
      @vipulspartacus7771 6 месяцев назад +1

      ​@@jordanhasnolife5163 Sure Jordan, I understand, I look forward to it. Once again, really appreciate the content

  • @Anonymous-ym6st
    @Anonymous-ym6st 2 месяца назад +1

    quick question on the native solution, why using parquet here?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад

      It's pretty nice for column compression, if we can get any, and I do believe the files should be immutable once written.
      Do you have a different proposal?

  • @alphabeta644
    @alphabeta644 4 месяца назад +1

    Is my understanding correct that there will be as many Bloom filters in recommendation service as there are users that connect to it? Secondly, as I keep watching more and more videos, my specific Bloom filter would quickly fill up in some days or maybe months. How does our system deal with Bloom filter? Basically filling up all the slots because of plethora of videos that I might have seen over months

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 месяца назад

      We can have as many or as few as we want since they're just an approximation. We'd have to experiment in practice. Eventually, you just clear it, and let it get filled back up again :)

  • @skullTT
    @skullTT 2 дня назад

    besides vector db, which database should we choose for other data including entity history db, neighbor index. Can we use Cassandra because for history db because it is write heavy and append only. KV store for neighbor index.

  • @Anonymous-ym6st
    @Anonymous-ym6st 2 месяца назад +1

    at 19:00, what do we mean by saying "add as an index entry"? is keep vector1 as index, and v2:v3:v4 (nearby) as a column? or v2:v3:v4 as an index entry? (know limited about vector DB, but trying to understand each vector is represented as geohash, and can be indexed on a single vector?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад

      When I say create an index I just mean create a database table for each entityId to its closest entity ids.

  • @midicine2114
    @midicine2114 6 месяцев назад +1

    Techlead catching well deserved strays

  • @sandeepreddy6295
    @sandeepreddy6295 6 месяцев назад +1

    Hey Jordan, why an 'In-memory-broker' instead of a broker like Kafka?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      Hey! For the sake of this video, it probably doesn't have to be, but I'd say check out my video on how to design youtube

  • @ava9xx3js9j
    @ava9xx3js9j 6 месяцев назад +1

    Hey Jordan! Are you looking to adopt by any chance? Jk Love your content ❤
    Are you a full time RUclipsr now?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад +1

      Nope, still working for better or for worse lol
      Happy to adopt - you any good at cooking?

  • @jamesliu551
    @jamesliu551 3 месяца назад +1

    Dude getting a sponsor, leaving all us dumb ass behind

  • @htm332
    @htm332 6 месяцев назад +1

    does this design account for popularity/ trendiness of a given entity? For example if a random video from an unknown creator becomes suddenly extremely popular
    (happens a lot on tiktok) it should be recommended whereas an hour previous it was unpopular and irrelevant thus should not have been recommended

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      It does not, and good point! I think for something like this you'd want to see the Top K video, and basically keep a cache of which videos are "trending" in the last x hours to apply a score boost.

  • @philipjung-i9v
    @philipjung-i9v 5 месяцев назад +2

    agree tech lead is a sham that shafted his followers (as a millionaire)

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 месяцев назад +2

      I might do it too (as a non millionaire)

    • @philipjung-i9v
      @philipjung-i9v 5 месяцев назад

      @@jordanhasnolife5163 yes jordan responded to me! it would be an honor to be your victim, sempai

  • @jordanhasnolife5163
    @jordanhasnolife5163  6 месяцев назад +1

    To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/Jordanhasnolife/ . You’ll also get 20% off an annual premium subscription.