Hey Jordan, I recently joined Google as an SSE and I wanted to express my sincere gratitude for your system design videos, especially the ones comparing multiple solutions. Those comparisons were exactly what the interviewers were looking for in my feedback.
Hi @jordan, Great video, few questions: 1. In neighbour index sharded by vector hash, are we storing entity vector v1 : [neighbour vector 1, v2, v3...] in vector forms. Why do we need neighbour index sharded by vector hash at all? 2. You explained storing an entitiy's close neighbours in a max heap by distance for easy updating. Where exactly would we be that, in neighbor index sharded by vector hash? Can't we store it directly in neighbour index sharded by entity id? 3. in final diagram there is an arrow from entity db (which i am assuming stores entity id to its embedding vector mapping?), but in what scenario would it be fetching embedding from entity db? recommendation service seems to be only calling neighbour index cache and db
1) Why do we need a neighbor index? or why do we need it sharded by vector hash? In retrospect, I shouldn't have said sharded by vector hash, it should probably just be sharded by entity ID. Why we need a neighbor index in general is to speed up the result of a query we'd run all the time: "tell me the nearest neighbors of this entity". 2.Yes agreed we should store it directly in the neighbor index. 3. Agreed, depends how much we denormalize our data in the neighbor index. If we don't at all, we'd have to hit the entity DB for some information about what we're showing the user.
@@jordanhasnolife5163 thanks. in 1., what i meant was why do we need it in two forms vector v1 : [neighbour vector 12, v2, v3...] and entity id1: [neighbor entity 12, e2, e3]... But now that I think about it, your original design in this video makes more sense now lol. Reason being, if we do shard it by entity ID then when adding a new entity, checking and updating other nearby entity vectors (for their neighbors) might become a cross partition write if sharded by entity id originally (instead of vector hash).. I think this might have been your reasoning 5months ago when you made this video. Let me know if that makes sense.
@@savy6682 Honestly I'm leaning towards just sharding by entity ID now lol. I don't mind the cross partition write too much since it's being done asynchronously or in a batch job and isn't latency sensitive.
Great video! I did try to digest and understand what you are talking about :) Still got one question: why sharing the vector database by the vector hash won't result in a hot partitioning problem, the same way shading the neighbor index by the vector hash will?
I think that in theory it could, but most of the additions to the vector data base are being done asynchronously in the background, and so we have more flexibility to temporarily stop all writes and rebalance as needed. We'd want to shard in a similar fashion though, where vectors with close proximity are near one another.
Great video! Appreciate all your hard work. I have two questions about your final design: 1. What is the difference between the entity database indexed by entity ID and the neighbor index, which is sharded on entity hash range? Are they not both storing mappings from entities to their neighbors? 2. The drawing shows the above mentioned databases pushing data to the recommendation service, but it sounds like the recommendation service is making requests to the neighbor index to retrieve neighbors for a particular entity? Basically just curious why you drew the arrows pointing from the neighbor index to the recommendation service instead of the other way around. Thanks!
1) The entity database is just a mapping of the actual vector itself. It is not in fact indexed by the id of the entity, but rather it's "geo hash" (generalized to n dimensions). The neighbor index is a small cache indexed on entity id that lists the ids of the other vectors that are its nearest neighbors. 2) Ah probably just a typo on my end.
Great video Jordan! Can you do one for ACID based system like Digital wallet or Bank ? or a combination of both like bank to wallet and wallet to wallet may be ?
Where do you see the challenge here? At least to me, this initially just feels like you'll need ACID databases, or two phase commit when making a transaction between two accounts on different partitions.
Great video Jordan! Learned a lot from this video. One question on Recommendation Service -> Neighbor index flow at 41:18 . Since we are sharding Neighbor index by entity_id, the recommendation services, in case of cash miss, has to scatter and gather right? Entity 12, 13, 62 (examples in the slide) could be in different partition
They would have to fetch the neighbors for their last x watched videos. So for each of those x videos, all of its neighbors will be on the same node, but otherwise we may have to hit up to x different partitions.
some other videos talked about recommendation engine from ML aspect, content filtering, collaborative filtering. What is the relationship between them and this embedding approach
Great video! I see that you used a heap for new entries into the closest neighbor index. Isnt insertion time into a heap the same O(Logn) as would be in a db index which uses B+ trees? Do understand that in the index we might need to replace multiple rows vs using a heap that wont happen. Is that the optimization here? Trying to understand how this speeds up things.
Hey Jordon, great content. Thank you for making these videos in depth. QQ- Do you think we can use graph databases such as neo4j instead of neighbor index for faster reads.
I think that you could, but consider this - for every vector (which is an arbitrary set of points), you'd need to create an edge to other vectors, so that you can traverse the graph. How do you decide which ones to do that for? Even then, let's imagine you could - you'd still have to run a breadth first search to find the closest vectors. I'd think that pre-caching your answers here will just about always be the fastest option.
Hey! I'd probably start reading some white papers! As for which ones, there are like 10-20 tools on LinkedIn who only post links of other people's content on their pages, hopefully one of them is decent
@@jordanhasnolife5163 Hi, don't mean to rush you but I have some important interviews in coming weeks and having your notes will really help me prep better. Can you share them in any form. I understand there can be mistakes or typos in them but I want to be able to quickly revise all the overarching concepts and designs
@@vipulspartacus7771 Hi Vipul - understand your rush here, it will take me a few hours to properly export everything, which is the reason for the delay. I haven't sat down and done it. Additionally, once I do, I'd like to publicize that a bit, as I hope that they can help me build my following if we're being fully transparent here. My original slides contain all of the same information.
It's pretty nice for column compression, if we can get any, and I do believe the files should be immutable once written. Do you have a different proposal?
Is my understanding correct that there will be as many Bloom filters in recommendation service as there are users that connect to it? Secondly, as I keep watching more and more videos, my specific Bloom filter would quickly fill up in some days or maybe months. How does our system deal with Bloom filter? Basically filling up all the slots because of plethora of videos that I might have seen over months
We can have as many or as few as we want since they're just an approximation. We'd have to experiment in practice. Eventually, you just clear it, and let it get filled back up again :)
besides vector db, which database should we choose for other data including entity history db, neighbor index. Can we use Cassandra because for history db because it is write heavy and append only. KV store for neighbor index.
at 19:00, what do we mean by saying "add as an index entry"? is keep vector1 as index, and v2:v3:v4 (nearby) as a column? or v2:v3:v4 as an index entry? (know limited about vector DB, but trying to understand each vector is represented as geohash, and can be indexed on a single vector?
does this design account for popularity/ trendiness of a given entity? For example if a random video from an unknown creator becomes suddenly extremely popular (happens a lot on tiktok) it should be recommended whereas an hour previous it was unpopular and irrelevant thus should not have been recommended
It does not, and good point! I think for something like this you'd want to see the Top K video, and basically keep a cache of which videos are "trending" in the last x hours to apply a score boost.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/Jordanhasnolife/ . You’ll also get 20% off an annual premium subscription.
Hey Jordan, I recently joined Google as an SSE and I wanted to express my sincere gratitude for your system design videos, especially the ones comparing multiple solutions. Those comparisons were exactly what the interviewers were looking for in my feedback.
Legend!! Congrats man and enjoy the new role!
Congrats on the sponsor bro! Keep up the good work
Just found you channel few days ago. And i'm watching most of your prev videos. Liked the one Message Brokers.
Incredible ad read 😂
My boi getting sponsors. Well deserved. To the moon 🚀
TIL what embedding is! Congrats on the sponsor BTW!!!
Hi @jordan, Great video, few questions:
1. In neighbour index sharded by vector hash, are we storing entity vector v1 : [neighbour vector 1, v2, v3...] in vector forms. Why do we need neighbour index sharded by vector hash at all?
2. You explained storing an entitiy's close neighbours in a max heap by distance for easy updating. Where exactly would we be that, in neighbor index sharded by vector hash? Can't we store it directly in neighbour index sharded by entity id?
3. in final diagram there is an arrow from entity db (which i am assuming stores entity id to its embedding vector mapping?), but in what scenario would it be fetching embedding from entity db? recommendation service seems to be only calling neighbour index cache and db
1) Why do we need a neighbor index? or why do we need it sharded by vector hash? In retrospect, I shouldn't have said sharded by vector hash, it should probably just be sharded by entity ID. Why we need a neighbor index in general is to speed up the result of a query we'd run all the time: "tell me the nearest neighbors of this entity".
2.Yes agreed we should store it directly in the neighbor index.
3. Agreed, depends how much we denormalize our data in the neighbor index. If we don't at all, we'd have to hit the entity DB for some information about what we're showing the user.
@@jordanhasnolife5163 thanks. in 1., what i meant was why do we need it in two forms vector v1 : [neighbour vector 12, v2, v3...] and entity id1: [neighbor entity 12, e2, e3]... But now that I think about it, your original design in this video makes more sense now lol. Reason being, if we do shard it by entity ID then when adding a new entity, checking and updating other nearby entity vectors (for their neighbors) might become a cross partition write if sharded by entity id originally (instead of vector hash).. I think this might have been your reasoning 5months ago when you made this video. Let me know if that makes sense.
@@savy6682 Honestly I'm leaning towards just sharding by entity ID now lol. I don't mind the cross partition write too much since it's being done asynchronously or in a batch job and isn't latency sensitive.
@@jordanhasnolife5163 hmm I see your point.. BTW thanks for taking time addressing each comment on each video!!
Great video! I did try to digest and understand what you are talking about :) Still got one question: why sharing the vector database by the vector hash won't result in a hot partitioning problem, the same way shading the neighbor index by the vector hash will?
I think that in theory it could, but most of the additions to the vector data base are being done asynchronously in the background, and so we have more flexibility to temporarily stop all writes and rebalance as needed. We'd want to shard in a similar fashion though, where vectors with close proximity are near one another.
Great video! Appreciate all your hard work. I have two questions about your final design:
1. What is the difference between the entity database indexed by entity ID and the neighbor index, which is sharded on entity hash range? Are they not both storing mappings from entities to their neighbors?
2. The drawing shows the above mentioned databases pushing data to the recommendation service, but it sounds like the recommendation service is making requests to the neighbor index to retrieve neighbors for a particular entity? Basically just curious why you drew the arrows pointing from the neighbor index to the recommendation service instead of the other way around.
Thanks!
1) The entity database is just a mapping of the actual vector itself. It is not in fact indexed by the id of the entity, but rather it's "geo hash" (generalized to n dimensions). The neighbor index is a small cache indexed on entity id that lists the ids of the other vectors that are its nearest neighbors.
2) Ah probably just a typo on my end.
Great video Jordan! Can you do one for ACID based system like Digital wallet or Bank ? or a combination of both like bank to wallet and wallet to wallet may be ?
Where do you see the challenge here? At least to me, this initially just feels like you'll need ACID databases, or two phase commit when making a transaction between two accounts on different partitions.
Great video Jordan! Learned a lot from this video. One question on Recommendation Service -> Neighbor index flow at 41:18 . Since we are sharding Neighbor index by entity_id, the recommendation services, in case of cash miss, has to scatter and gather right? Entity 12, 13, 62 (examples in the slide) could be in different partition
They would have to fetch the neighbors for their last x watched videos. So for each of those x videos, all of its neighbors will be on the same node, but otherwise we may have to hit up to x different partitions.
some other videos talked about recommendation engine from ML aspect, content filtering, collaborative filtering. What is the relationship between them and this embedding approach
Great video! I see that you used a heap for new entries into the closest neighbor index. Isnt insertion time into a heap the same O(Logn) as would be in a db index which uses B+ trees? Do understand that in the index we might need to replace multiple rows vs using a heap that wont happen. Is that the optimization here? Trying to understand how this speeds up things.
The optimization is that this is in memory
Hey Jordon, great content. Thank you for making these videos in depth. QQ- Do you think we can use graph databases such as neo4j instead of neighbor index for faster reads.
I think that you could, but consider this - for every vector (which is an arbitrary set of points), you'd need to create an edge to other vectors, so that you can traverse the graph. How do you decide which ones to do that for? Even then, let's imagine you could - you'd still have to run a breadth first search to find the closest vectors. I'd think that pre-caching your answers here will just about always be the fastest option.
can you please share your slides as well. it will be really helpful.
Planning on doing this in bulk after finishing my current series, this will be in the next 1-3 months.
Thank you for these videos!
Hey Jordan. What books would you recommend I read? I have already finished DDIA.
Hey! I'd probably start reading some white papers! As for which ones, there are like 10-20 tools on LinkedIn who only post links of other people's content on their pages, hopefully one of them is decent
Hi Jordan, can you please share your ipad notes. Maybe they are not perfect but they serve as some sort of reference to revise
Planning on doing this in bulk after finishing my current series, this will be in the next 1-3 months.
@@jordanhasnolife5163 Hi, don't mean to rush you but I have some important interviews in coming weeks and having your notes will really help me prep better. Can you share them in any form. I understand there can be mistakes or typos in them but I want to be able to quickly revise all the overarching concepts and designs
@@vipulspartacus7771 Hi Vipul - understand your rush here, it will take me a few hours to properly export everything, which is the reason for the delay. I haven't sat down and done it. Additionally, once I do, I'd like to publicize that a bit, as I hope that they can help me build my following if we're being fully transparent here. My original slides contain all of the same information.
@@jordanhasnolife5163 Sure Jordan, I understand, I look forward to it. Once again, really appreciate the content
quick question on the native solution, why using parquet here?
It's pretty nice for column compression, if we can get any, and I do believe the files should be immutable once written.
Do you have a different proposal?
Is my understanding correct that there will be as many Bloom filters in recommendation service as there are users that connect to it? Secondly, as I keep watching more and more videos, my specific Bloom filter would quickly fill up in some days or maybe months. How does our system deal with Bloom filter? Basically filling up all the slots because of plethora of videos that I might have seen over months
We can have as many or as few as we want since they're just an approximation. We'd have to experiment in practice. Eventually, you just clear it, and let it get filled back up again :)
besides vector db, which database should we choose for other data including entity history db, neighbor index. Can we use Cassandra because for history db because it is write heavy and append only. KV store for neighbor index.
at 19:00, what do we mean by saying "add as an index entry"? is keep vector1 as index, and v2:v3:v4 (nearby) as a column? or v2:v3:v4 as an index entry? (know limited about vector DB, but trying to understand each vector is represented as geohash, and can be indexed on a single vector?
When I say create an index I just mean create a database table for each entityId to its closest entity ids.
Techlead catching well deserved strays
Hey Jordan, why an 'In-memory-broker' instead of a broker like Kafka?
Hey! For the sake of this video, it probably doesn't have to be, but I'd say check out my video on how to design youtube
Hey Jordan! Are you looking to adopt by any chance? Jk Love your content ❤
Are you a full time RUclipsr now?
Nope, still working for better or for worse lol
Happy to adopt - you any good at cooking?
Dude getting a sponsor, leaving all us dumb ass behind
does this design account for popularity/ trendiness of a given entity? For example if a random video from an unknown creator becomes suddenly extremely popular
(happens a lot on tiktok) it should be recommended whereas an hour previous it was unpopular and irrelevant thus should not have been recommended
It does not, and good point! I think for something like this you'd want to see the Top K video, and basically keep a cache of which videos are "trending" in the last x hours to apply a score boost.
agree tech lead is a sham that shafted his followers (as a millionaire)
I might do it too (as a non millionaire)
@@jordanhasnolife5163 yes jordan responded to me! it would be an honor to be your victim, sempai
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/Jordanhasnolife/ . You’ll also get 20% off an annual premium subscription.