At 12:55 you should push onto the heap before you pop, otherwise if the value you push is smaller than the value you are popping, your result will be incorrect.
@ 21:08 short is 2bytes so its 16bits not 8 and hence we have like 65k terms ( i have to point this minor insignificant mistake or i cannot go to bed since I'm a internet police)
really loved how you were suggesting solutions and evaluating them. Most solutions out there just touch a naive sql query solution and jump directly to a trie.
Hi Jordan, yet another great video! I had a few questions 1) My understanding is we have one node per letter in the alphabet - assuming a distributed load and no hotspots - and it stores all the children as in memory as per 36:11. So basically if the node responsible for 'a' went down, the tree under 'a' would be redistributed to another existing node. That node would have to reconsume all messages from spark since the beginning of time and then its restored. This may take some time and until this finishes any suggestions for this letter may not work - this can be prevented with some replication. Have I understood correctly? 2) We are using hadoop here but I guess 4TB of memory is quite a lot and so if we wanted to do some cost saving we could push to s3 to avoid having hadoop nodes just consuming resources for most of the day and then pull it on demand for the batch job daily? But anyway it probably wouldn't be 4Tb per day it would be 4TB for the first time it runs and then afterwards its just the per day delta? Love your sense of humour btw :)
Agree with 1! But yeah I'd just opt for having a few replicas of each. If one goes down you can spin up another replica and load it up from our cached data. It's immutable so this makes it pretty easy to do. For #2, also agreed S3 probably makes more sense in practice, you can load the data into some EC2s and run a batch job or something.
Question: @jordan @41:21, it is unclear to me how our queries are processed with the trie distributed across multiple servers. Lets say there are only two servers, one contains trie for words [a, b, aa,... abc] and other for [abcd,... zzzz]. In this case when I start typin and have typed "a", then I get connected to first server via websocket, and then when I type b and c, I continue to be connected and traverse the trie down to character 'c' on first server. Now, when I type 'd' what happens? Is it correct that: a) The first server (to which I am connected) will see that 'c' has no children, thus will break the websocket connection, therefore b) client will reconnect saying "gimme stuff for abcd", and load balancer directs me to second server, and c) the second server's trie has root node of "abc", so it goes down one step to 'd' and return the values? Is my understanding correct above?
I'd think that at the node at which the partition begins on system 1 it would say something like "no longer on this system" so that the client knows it needs to reach back out to the load balancer.
Great video Jordan. Had this question to ask: How does the design take care of server going down that was handling "app" (from word apple) partition. Trie data stored on that server would be lost so I believe data would be replicated across multiple nodes? Also, how would the switchover take place in this scenario? Zookeeper listening to all the active/replica nodes and switching over when it stops getting heartbeat? Is that correct understanding?
Yes, we'd want all nodes to have replicas. Zookeeper can listen to them and round robin requests to these nodes. In this case, all nodes are replicas as we don't write to any of them, they're all read only.
Question: 40:46 Why do we need something like Flink to consume the messages from Kafka, if we are already sharding (in Kafka) the incoming data based on the search term range? Can you also clarify why we need an aggregation service in front of Kafka? Overall a great video, Jordan. Thank you!
You always need something to consume kafka messages, they can't just go straight to HDFS. In this case, we're just using flink to store messages in memory until we get enough of them, flush them to a parquet file, and then send to hadoop. In theory, we could also do some pre-aggregations/counting in flink too if we're really optimizing.
Since their are 1 billion searches every day and maintaining web socket connection looks overhead to me becoz Search Suggestions are short-lived interactions overall. Instead we can simply use protocol like HTTP/1.1 with Keep-Alive or HTTP/2 and send multiple request on same connection with few ms delay. Need opinions on this as will this work?
Another Great video. Thanks for being consistent and helping us all. One question: At 35:00, you mentioned we might have to reestablish websocket when user start query starting with another char then what it started with. If we use Redis cache to store the trie and Webserver calling Redis to return the result, will that solve this problem? Is there any downside to it (other than extra hop from server to Redis, though we can have local webserver caching as well for very popular queries)?
So we have a websocket from user to server, and then server just makes normal requests to redis? 1) unclear if the trie can fit on a single redis node, we need to switch connections because of sharding 2) I think the web server itself is now just sending a bunch of headers to redis, so not sure about the performance improvements we get here Realistically, I'm overoptimizing, but who knows!
@@jordanhasnolife5163 Ah yeah, makes sense. Redis need to stateful as well and doesn't give us any benefit, instead will make things more complicated. Thanks.
Another great video! Thanks for making it. I am a bit confused about the update path. 1. It looks like we are creating new trie from the logs (containing search term with freq in kafka) instead of updating the existing trie. Lets say we want to account for last few days of search, then to build the trie shouldn't we feed the copy of existing trie as well (along with recent search logs) to hdfs to calculate top suggestions for each prefix? 2. Instead of app server just getting data about top suggestion for each prefix from hdfs, is it possible for us to compute the trie as well offline and then load it in server? If yes, can you also please suggest tools to use for computing trie offline and loading from offline to server memory ?
1) HDFS already has the last few days of data available. It doesn't have to delete that just because we computed another trie from it. You wouldn't have to send the existing trie. 2) Considering that you can't really represent a trie in a text file like that, I'm not quite sure. I guess in theory, you could compute it on one server from the hdfs data, then serialize it to JSON or something, then send it out to all of the other servers. But then even, you're just building a trie from the JSON rather than the frequencies which frankly has a similar time complexity.
@@jordanhasnolife5163 I'm not sure I agree. Elasticsearch supports term and phrase suggestions as special use cases, and it gives users control over general relevance features. I work on a search team, and our design for this feature is centered around an Elasticsearch cluster w/ special typeahead indices, an ETL from BQ to that cluster, and a service to query the cluster. I don't know if our design is the industry standard, and it depends on exactly what you're trying to do, but I think this is definitely one of the ES use cases. (Typeahead isn't just about popularity either, there could be many different heuristics you need to use to rate which suggestions are the best. There may be machine learning models involved to help determine that as well.)
Hey Jordan, thanks for this video! I'm curious about how trie can be stored in different databases, could you share something related? Additionally, I think we can have cache in front of suggestion service, or probably what you mean by saying suggestion service already include that? Thanks!
I guess it can't really, hence our issue. You can be cheeky and use the tactic I used to store it in spark, however then you lose some of the nice time complexity As for Q2, the suggestion service is effectively already a cache - it's in memory and has cached the top suggestions for each search term
I think the main point is that HDFS is designed for processing large sequential data write like GB level, not good for small data write/read, thus message queue is better for keep event with small data size
34:04 I don't see how the server has to be stateful. If I'm at "ap" and type in another "p", the server doesn't need to know that I was at "ap" previously, because now "app" is a new prefix query. So the server can be stateless. Is this not correct?
@@jordanhasnolife5163 I think traversing a trie won't be that bad, as on avg we assumed there will be 20 chars, so only 20 nodes traversal on higher side, which should be quick in memory. I think overhead of maintaining websockets will be more than what we are achieving here. Though, I understand latency will be reduced mainly because of persistent connection with reduced overhead of headers.
@@jordanhasnolife5163 sorry didn't get that, we need trie for efficient storage as compared to hashmap of all prefixes, right? Other solutions will definitely be having higher latency.
@@krish000back I could see it being the case that the hashmap of all prefixes still could be stored in memory if each term is only on average like 20 characters in length. Could we fit it in like 256gb of ram? (or you can just partition it and not maintain websockets)
Our servers are inherently not stateless. They need to keep a pointer to their current location in the trie, or else we won't have constant time operations when a user types an additional character.
The calculation of runtime at 13:40 is wrong. Runtime is O(log(n) + m) where is n is total words in dictionary, and m is total words matched by the query
I assume you're using this to account for the actual computation cost of finding the start and end of where we're reading. Good point. That being said, for the m piece I think it would be m * log(k), where k is the number of typeahead suggestions to return, as we need to formulate our heap.
A follow up video with a few more concepts would be great 1. We may not always need to reconnect a new web socket. That layer can be decoupled with suggestion service 2. How do we handle cases where updating the trie without impacting read performance- I’d guess something like update replica async and point traffic to it when ready? 3. What can be cached on client vs always query backend 4. Personalization / contextual results
@@jordanhasnolife5163 it's more good idea to create a video and just tagging the title as "Type Ahead Suggestion" and then not explaining any core concepts related to "Type Ahead Suggestion" and just going over few simple tree diagrams.
@@lv0320 1) agree 2) agree, store a second copy and switch over when done 3) pretty much anything and everything can hopefully be cached on the client, which speaks to 4) we can perhaps do some of that personalization piece locally (in the case of typeahead, likely tougher if we wanted to tailor search results)
My girlfriend no joke asked if you were gonna steal me from her because of how much I talk about your channel. Keep it up
I'm Mr steal ya man
At 12:55 you should push onto the heap before you pop, otherwise if the value you push is smaller than the value you are popping, your result will be incorrect.
This is true, nice catch
Great vid as always! Would be cool to see how sentence suggestions are working, how words are connected to each other etc.
@ 21:08 short is 2bytes so its 16bits not 8 and hence we have like 65k terms ( i have to point this minor insignificant mistake or i cannot go to bed since I'm a internet police)
Lmao, well done, you've owned me
really loved how you were suggesting solutions and evaluating them. Most solutions out there just touch a naive sql query solution and jump directly to a trie.
Hi Jordan, yet another great video!
I had a few questions
1) My understanding is we have one node per letter in the alphabet - assuming a distributed load and no hotspots - and it stores all the children as in memory as per 36:11. So basically if the node responsible for 'a' went down, the tree under 'a' would be redistributed to another existing node. That node would have to reconsume all messages from spark since the beginning of time and then its restored. This may take some time and until this finishes any suggestions for this letter may not work - this can be prevented with some replication. Have I understood correctly?
2) We are using hadoop here but I guess 4TB of memory is quite a lot and so if we wanted to do some cost saving we could push to s3 to avoid having hadoop nodes just consuming resources for most of the day and then pull it on demand for the batch job daily? But anyway it probably wouldn't be 4Tb per day it would be 4TB for the first time it runs and then afterwards its just the per day delta?
Love your sense of humour btw :)
Agree with 1! But yeah I'd just opt for having a few replicas of each. If one goes down you can spin up another replica and load it up from our cached data. It's immutable so this makes it pretty easy to do.
For #2, also agreed S3 probably makes more sense in practice, you can load the data into some EC2s and run a batch job or something.
Wow, a video where the capacity estimates actually matter. Really nice to see you compare these to memory amounts of client / servers.
I perform capacity estimates every weekend when figuring out how much late night food I should eat to not explode the next morning
Question: @jordan @41:21, it is unclear to me how our queries are processed with the trie distributed across multiple servers. Lets say there are only two servers, one contains trie for words [a, b, aa,... abc] and other for [abcd,... zzzz]. In this case when I start typin and have typed "a", then I get connected to first server via websocket, and then when I type b and c, I continue to be connected and traverse the trie down to character 'c' on first server. Now, when I type 'd' what happens? Is it correct that: a) The first server (to which I am connected) will see that 'c' has no children, thus will break the websocket connection, therefore b) client will reconnect saying "gimme stuff for abcd", and load balancer directs me to second server, and c) the second server's trie has root node of "abc", so it goes down one step to 'd' and return the values?
Is my understanding correct above?
I'd think that at the node at which the partition begins on system 1 it would say something like "no longer on this system" so that the client knows it needs to reach back out to the load balancer.
This is the best video I've seen explaining type ahead, thanks a lot for making great content!
Great video Jordan. Had this question to ask:
How does the design take care of server going down that was handling "app" (from word apple) partition. Trie data stored on that server would be lost so I believe data would be replicated across multiple nodes? Also, how would the switchover take place in this scenario? Zookeeper listening to all the active/replica nodes and switching over when it stops getting heartbeat? Is that correct understanding?
Yes, we'd want all nodes to have replicas. Zookeeper can listen to them and round robin requests to these nodes. In this case, all nodes are replicas as we don't write to any of them, they're all read only.
Question: 40:46 Why do we need something like Flink to consume the messages from Kafka, if we are already sharding (in Kafka) the incoming data based on the search term range? Can you also clarify why we need an aggregation service in front of Kafka?
Overall a great video, Jordan. Thank you!
You always need something to consume kafka messages, they can't just go straight to HDFS. In this case, we're just using flink to store messages in memory until we get enough of them, flush them to a parquet file, and then send to hadoop.
In theory, we could also do some pre-aggregations/counting in flink too if we're really optimizing.
Elasticsearch uses a form of Trie called FST. FST memory usage is significantly better than a trie.
Thanks for sharing!
Since their are 1 billion searches every day and maintaining web socket connection looks overhead to me becoz Search Suggestions are short-lived interactions overall.
Instead we can simply use protocol like HTTP/1.1 with Keep-Alive or HTTP/2 and send multiple request on same connection with few ms delay.
Need opinions on this as will this work?
Sure I'm less familiar here but if there's less overhead to set them up than websockets and they're bidirectional seems reasonable enough to me
Another Great video. Thanks for being consistent and helping us all.
One question: At 35:00, you mentioned we might have to reestablish websocket when user start query starting with another char then what it started with. If we use Redis cache to store the trie and Webserver calling Redis to return the result, will that solve this problem? Is there any downside to it (other than extra hop from server to Redis, though we can have local webserver caching as well for very popular queries)?
So we have a websocket from user to server, and then server just makes normal requests to redis?
1) unclear if the trie can fit on a single redis node, we need to switch connections because of sharding
2) I think the web server itself is now just sending a bunch of headers to redis, so not sure about the performance improvements we get here
Realistically, I'm overoptimizing, but who knows!
@@jordanhasnolife5163 ah yeah makes sense, we will need redis to be stateful as well. No benefit in that case. Thanks.
@@jordanhasnolife5163 Ah yeah, makes sense. Redis need to stateful as well and doesn't give us any benefit, instead will make things more complicated. Thanks.
Another great video! Thanks for making it.
I am a bit confused about the update path.
1. It looks like we are creating new trie from the logs (containing search term with freq in kafka) instead of updating the existing trie. Lets say we want to account for last few days of search, then to build the trie shouldn't we feed the copy of existing trie as well (along with recent search logs) to hdfs to calculate top suggestions for each prefix?
2. Instead of app server just getting data about top suggestion for each prefix from hdfs, is it possible for us to compute the trie as well offline and then load it in server? If yes, can you also please suggest tools to use for computing trie offline and loading from offline to server memory ?
1) HDFS already has the last few days of data available. It doesn't have to delete that just because we computed another trie from it. You wouldn't have to send the existing trie.
2) Considering that you can't really represent a trie in a text file like that, I'm not quite sure. I guess in theory, you could compute it on one server from the hdfs data, then serialize it to JSON or something, then send it out to all of the other servers. But then even, you're just building a trie from the JSON rather than the frequencies which frankly has a similar time complexity.
Why not something like Elasticsearch for prefix searching with the same range based partitioning?
It's going to be slower: that's on disk, and now I have to perform a binary search for my word rather than just traversing down a trie
@@jordanhasnolife5163 I'm not sure I agree. Elasticsearch supports term and phrase suggestions as special use cases, and it gives users control over general relevance features. I work on a search team, and our design for this feature is centered around an Elasticsearch cluster w/ special typeahead indices, an ETL from BQ to that cluster, and a service to query the cluster. I don't know if our design is the industry standard, and it depends on exactly what you're trying to do, but I think this is definitely one of the ES use cases. (Typeahead isn't just about popularity either, there could be many different heuristics you need to use to rate which suggestions are the best. There may be machine learning models involved to help determine that as well.)
Hey Jordan, thanks for this video! I'm curious about how trie can be stored in different databases, could you share something related? Additionally, I think we can have cache in front of suggestion service, or probably what you mean by saying suggestion service already include that? Thanks!
I guess it can't really, hence our issue. You can be cheeky and use the tactic I used to store it in spark, however then you lose some of the nice time complexity
As for Q2, the suggestion service is effectively already a cache - it's in memory and has cached the top suggestions for each search term
Pretty solid video. Really loved the optimisation you discussed.
Curious why we need stream processing (Kafka -> Flink -> HDFS) to upload newly entered work to HDFS? Why cannot' we upload them to HDFS directly?
hdfs stores full files, not an individual string of text. We need to aggregate the queries first
@@jordanhasnolife5163 Is it better if we use spark streaming consumer instead of flink here? We can so batching using this and write a batch to HDFS
I think the main point is that HDFS is designed for processing large sequential data write like GB level, not good for small data write/read, thus message queue is better for keep event with small data size
This is a great video, thanks for taking efforts to explain everything in such depth.
What are your thoughts on using GraphDBs like Neo4j to store the trie?
I think that if we can avoid storing this guy on disk, we should! It's a pretty inefficient operation to jump from random spot to random spot on disk.
You should probably replace Flink with Spark Streaming since you already planning on using Spark downstream.
Yeah in reality I think that's reasonable, but for the sake of the systems design interview I like to be idealistic.
34:04 I don't see how the server has to be stateful. If I'm at "ap" and type in another "p", the server doesn't need to know that I was at "ap" previously, because now "app" is a new prefix query. So the server can be stateless. Is this not correct?
Yes but then you have to traverse the trie again and ruin your time complexity
@@jordanhasnolife5163 I think traversing a trie won't be that bad, as on avg we assumed there will be 20 chars, so only 20 nodes traversal on higher side, which should be quick in memory. I think overhead of maintaining websockets will be more than what we are achieving here. Though, I understand latency will be reduced mainly because of persistent connection with reduced overhead of headers.
@@krish000back Practically speaking agreed it isn't so bad. At that point though, you'd perhaps not even need a trie.
@@jordanhasnolife5163 sorry didn't get that, we need trie for efficient storage as compared to hashmap of all prefixes, right? Other solutions will definitely be having higher latency.
@@krish000back I could see it being the case that the hashmap of all prefixes still could be stored in memory if each term is only on average like 20 characters in length. Could we fit it in like 256gb of ram? (or you can just partition it and not maintain websockets)
Your jokes make the grind slightly less terrible :))
Why can't we use AJAX for query service instead of websockets, won't they be faster in this case and also our servers can stateless?
Our servers are inherently not stateless. They need to keep a pointer to their current location in the trie, or else we won't have constant time operations when a user types an additional character.
Top tier videos! Can you do Design a Parking Lot?
I already have!
Use public transport man, its getting costly to park cars these days!
@@alphabeta644 😂
Thanks for making the video. It was interesting and helpful.
Great video! Keep with the good job i really enjoyed it
Merry Christmas 🎄🎁
Same to you!
The calculation of runtime at 13:40 is wrong.
Runtime is O(log(n) + m) where is n is total words in dictionary, and m is total words matched by the query
I assume you're using this to account for the actual computation cost of finding the start and end of where we're reading. Good point.
That being said, for the m piece I think it would be m * log(k), where k is the number of typeahead suggestions to return, as we need to formulate our heap.
So many open questions left, need more clarity.
It's a good idea to leave a comment like this and then not ask any questions.
A follow up video with a few more concepts would be great
1. We may not always need to reconnect a new web socket. That layer can be decoupled with suggestion service
2. How do we handle cases where updating the trie without impacting read performance- I’d guess something like update replica async and point traffic to it when ready?
3. What can be cached on client vs always query backend
4. Personalization / contextual results
@@jordanhasnolife5163 it's more good idea to create a video and just tagging the title as "Type Ahead Suggestion" and then not explaining any core concepts related to "Type Ahead Suggestion" and just going over few simple tree diagrams.
@@lv0320 1) agree 2) agree, store a second copy and switch over when done 3) pretty much anything and everything can hopefully be cached on the client, which speaks to 4) we can perhaps do some of that personalization piece locally (in the case of typeahead, likely tougher if we wanted to tailor search results)
@@crazeeealgorithms3236 I think you'd be looking for videos not related to systems design interviews in that event.