Nice video. One correction: when you're showing the number of possible URL's possible (at around 2:45 mark), you have shown it as 1^62, 2^62 .. etc. when it should be 62^1, 62^2 ... 62^7
the way to calculate this, in my eyes, work like this. I will take a simple example: 3 characters, a,b,c (instead of 62), each URL will have 2 characters. We have (ab) (ac) (ba) (bc) (ca) (cb). This is basically Arrangements of 3 taken by 2 => A(3,2) = 3!/(3-2)! = 6 If we accpept also combinations like (aa) (bb) (cc) then we will add also those 3 unique possibilities meaning we have 9 possible arrangements, thus we have the formula: Total URL number=> A(n,k) + n Where in your case n=62 and k=1...7.
? Isn't it simpler with 62^n meaning 62 * 62 * ... * 62, with n times 62. As an example, for n letters I have 62 choices, for two letter 62 * 62, so 62 choices for the first letter and for each letter 62 choices for the next letter, so we multiply which is in short 62^2. And that can be extended for any n.
@@querela92 "Isn't it simpler...". Hmm..you want simpler or precise? "And that can be extended for any n"...actually it can`t. Let`s take a larger number and let`s see where your mistake is: We have 62 elements that will be taken by 7. Your calculus will be: 62^7 = 3.521.614.606.208 But in reality it will be 62 + A(62,7) = 2.478.652.606.142 This means you will end up paying for a space, transactions, cache, etc on a server larger that your really need with: 1.042.962.000.066. This is about 1.04 TB of data you will never need. This is the difference between mathematics and "let`s make it simple".
i am not at a point in my career where i need to be an expert in system design, but by following such high quality content, i believe I'll definitely be an expert when the time comes. Amazing video, and thank you for providing this content for free!
I believe the specific problem is a perfect match for the Actor model. The way that I would solve this is going via one of the popular actor frameworks (akka or orleans). Both of these frameworks provide out of the box the functionality that you get from the load balancer, zookepper and distributed cache. All you need is you actor cluster and a persistent storage. The persistent storage can be a database (SQL or NoSQL) or a cloud storage account (e.g. aws bucket, azure blob etc.). i believe this approach will reduce the operational complexity and cost significantly while achieving similar results in terms of scalability and fault tolerance.
it seems a bit overcomplicated to have to reach out to some service to get some count to generate a URL. There's a 1 in 3 trillion chance that some random shortened url would collide. I think it's reasonable to just generate the url at random and try to save it. if it exists in the db just try again. The user can wait that extra amount of time that the server needs to re-generate the url in the tiny chance that case arises
that chance increases the more urls are stored. After only a year you'll have 31B urls stored. If you continue to generate a URL at random, you'll then have a 1/100 chance of a collision.
Just adding a single character in length drops that probability to 1/620, random URLs definitely seem to be the way to go here, simplifying the design. That's a 4E-9 probability that you won't be able to insert a URL within three attempts, basically negligible. You can opt to fail the request after 3 failed insertion attempts and it wouldn't measurably impact your service's reliability.
Zookeeper seems redundant given a database sequence would have accomplished the same and would be a miniscule load compared to the R/W activity for the urls themselves. This explanation also glosses over the non-functional requirements for HA and low latency, since the concern for geographically distributed and redundant deployments was not addressed.
instead of zookeeper why not simply have the distributed servers to pick the starting letter, e.g. server 1 can pick tiny urls beginning with 0-9, server 2 urls beginning with a-f, server 3 url g-m e.t.c that way u don't have to deal with having another point of failure, if one of those servers die it's fine you'll simply not produce a certain range until that server comes back up.
I like this, but wouldn’t this mean the state of “last generated url” is stored in-memory in the application servers? What happens when the servers are redeployed? That state would be wiped out and there’s no way for each server to resume from where it left off
2:11 Don't overlook these estimates folks. this is likely the main reason you are being asked this question. There are lots of traps if you just pick a random hashing function and cut parts of it at that volume. The rest of the question is prett straightforward
Besides of service failure, there are regular service updates during development of new features, which will kill all instances of replicated service and invalidate potentially large amount of address ranges. Personally, I prefer using hashing algorithms for such tasks, dealing with collisions on writes, since lower latencies typically needed only during reads and reads are much more frequent. For example, you can calculate hash of incoming long URL and cut it to required length, which is secure, pretty robust and prevents duplicates in database.
Once you cut up the hash to the required length there is no longer any guarantee that it's unique. We are talking about a 1000 new URL's per second so "pretty robust" isn't going to cut it. Dropping that many keys seems like a strange solution, but even if you reboot a 1000 servers every day for a year, you still have more then enough keys left for years to come. But if it could be a problem, you could also ask for a range of 1 to 100.000 and lose a lot less. Also keep in mind that you don't actually lose the entire range, because some have already been used. You could also schedule a server to go down instead of just dropping it. That it will reboot once all keys are release. But still, I don't think it's really an issue to lose those keys. I do agree with the latency problem mostly being a factor on reads. That is when a user manually creates a new URL and has to wait a second after clicking post. But for generating links via an API the latency is important both on insert and read.
@@degosoft With hash there is no uniqueness guarantee anyway, this is only about collision chance and key distribution. That's why I mentioned collision management. So, by "pretty robust" I mean "robust enough for the most cases". Of course, there is no silver bullet, all solutions have pros and cons. I'm just talking the way I prefer to solve the problem in real-world applications. And there is one more thing about using tiny URLs in real-world applications: they often not intended to live forever. In some cases, shorter URL with lifetime, for example, one year, may better suit your needs. Since the information at the source URL will no longer be valuable at that point. In this case, reclaiming space with hashed URLs will be a bit easier, just the matter of removing old entries from database by creation date. Another advantage to this approach is ability to set any tinyURL length at any time, even by request parameter. And, of course, absence of failproof zookeeper cluster reduces structural complexity of the system.
@@configaddNothing is guaranteed, but the chances for a sha256 collision are so incredibly small that it's practically none existent. And the chances get even smaller when we only use URL's as the input. But if you discard 25 of the 32 bytes in the output to get to a 7 byte shortcode, then you increase the odds by a lot. And bytes aren't even useful, so we need to encode the output bytes to something readable first which makes the default length even longer and thus you drop even more precision. Don't get me wrong, hashing is great and I've used it a lot for similar projects. But if "SHORT" is te requirement then hashing is not the way to go.
if browser redirection is the only requirement, why not just keep all shortened urls as files on a CDN and expose that (you could have just one js line that tells the browser to change location and if you want analytics you could just add another few lines to make a request to the analytics service), by picking a wide enough range you don't really need to worry about uniqueness and you can pick the short urls at random and not check if it's already in use
Two suggestions: 1. Instead of Counter or Zookeeper, you can use a hash function that takes the URL and produce a 7 chars string. Thus don't need to solve the problem of counters and Zookeepers and it's more secure. 2. An additional point from the interviewer might be to Add the ability to use custom short URLs. This is really a good point to discuss
Would it make sense at all to derive the shortened URL from the long url (Hashing and salting etc.), it may increase the chances of collision with such short URLs, but it would resolve the countcache issue, because if we use those numbers as salts, it doesn't really matter if the count cache returns the same value twice. In fact, if we can get away with a pure hash, you can reduce duplicates in the database, because multiple people shortening the same site always get the same result.
8 месяцев назад
A salt could help with resistance, but cannot prevent collisions at a reasonable level. So you could never fulfill a request to create a new URL with an accidental collision.
Instead of Zookeeper why not use a hashing algorithm to hash the long url down to the short url, this would not only make the shorturl (nearly) unique, it would also always give the same shortened url for the same long url which avoids duplicate entries in the database
This data doesn't change, so the caching is a lot simpler. In fact, I would forego Redis altogether and just hold a cache on the web servers themselves. That's less computers you need to pay for and fewer points of failure.
This doesn’t work well when scaling very wide. If we had 100 webservers, we’d be caching the data up to 100 times. A centralized cache is a much more effective solution
@@drewbrown1534 That makes sense, i think it would also be better to handle the LRU strategy that we may adopt to keep just the most used URLs in cache.
I think you got the correct answer for getting the number of unique combinations given the number of characters but the formula is wrong. 1^ 62 and 2^62 for example are incorrect formula but the answer is correct.
Excellent video! Could another single point of failure be each webserver? For example, during a GET request, if the webserver handling the range is down, the client won't be able to get the long url. My suggestion is to use a cluster of webservers covering each range. Each cluster would manage itself with a consensus algorithm like Raft or Paxos. That would make each webserver more available while maintaining consistency and resistance to network partitions. This would require more webservers, but it could be argued that since we're using only 7 or so webservers now, having 3-4 webservers in each cluster won't make too much of a difference in cost. We could also increase the range of each cluster, but then we might be sacrificing throughput-it's just a tradeoff between the cost and throughput now.
Great comment! A single point of failure (SPOF) is one that causes the entire system to go down. So while you’re correct that if a server goes down while responding to a request, that request will fail. However, it is not a SPOF as the system will still be up, as if the user were to send another request it would simply be routed to a different server. Having clusters of servers could be a great extension to the system. However, it does introduce more complexity that could be argued is over engineering the solution without adding a significant improvement to the overall system. But as you already mentioned it’s all about tradeoffs!
I think you made the assumption that web servers are limited to their own range for responding to GET short-url requests, but that isn't said anywhere. The zookeeper gives them a range of free IDs to use for shortening urls, but each web server can accept any existing short url and will contact a cache and if not in there, the database to get the long url. So, if a web server dies, retrying the request would just be delegated to another web server that will return the correct long url. Even for shortening, if a web server dies and you retry the POST, you'll get a new shortened url from another web server (with an ID from another range)
It's not discussed imo really clearly in the video. Here was my guess: If two incoming POST requests come in at the same time, and there are multiple web servers handling requests, it is possible for both to generate the same key and cause a collision. At a TPS of 1000, there is a high likelyhood of collision. Thus, the count cache and the zoo manager are ways to avoid this scenario of collisions between different webservers.
suggestion for alternative to zookeeper: you could have the create-url endpoint generate a random short URL to match to the long URL, to remove any count cache or zookeeper, so long as the space is big enough there should be no collisions
I would say for 1k writes per second this is kinda of over kill. A single RDS instance could handle this thing without breaking a sweat instead of needing things like a zookeeper.
If you implemented this with zookeeper I would recommend no hire. Way over engineered. Hash the url, salt on collision, store in dynamo and cache with redis.
Thanks for the video. Would that be a valid assumption that you don't really need to create short URL during the request? Aka you could run your 7T numbers ahead of user requests and create short URLs and store them waiting for long URL to come and just plop that in? That would solve few points immediately: 1. Faster processing 2. No need to keep the zoo of web servers and maintaining them. 3. You could mix them non-sequentially so that would alleviate the attack issue Thanks, just learning here! Curious what you think I guess the only point of failure is that your speed of making those short URLs should be faster than your hypothetical speed of using those URLs, though I'd expect that there would be a very big head start initially. I guess this is where you would need to add another web servers
Thanks for the comment! It’s an interesting proposition alright. In terms of generating the short URL, that takes a negligible amount of time and so precomputing that wouldn’t have any noticeable impact on performance. Not sure I fully understand the second part of the comment. Are you suggesting storing all the short URLs on a single server? By having all the short URLs on one server while convenient introduces a single point of failure and so thats why horizontal scaling is recommended. And then, once there is more than one server involved, a service for maintaining distributed synchronisation is typically recommended. It’s all about trade offs but typically a single point of failure is something you always want to avoid.
For what it's worth, as a Staff+ software engineer, if you submitted to me a "URL Shortener" task as depending on Zookeeper, you're not be getting the job, since you're Introducing a huge amount of operational complexity which can be avoided by changing the algorithm you use for generating URLs (e.g., you could use a single bit for a server ID value - this is what Snowflake IDs do)
What about hashing the input url and use that to create the short url? You could take the md5 sum of content and create a custom encoding for ti using the 64 chars md5 is 45 characters in hex and 25 chars in base64. The benefit of using a hash function would be to elimnate the zookeeper
A random suffix is a poor solution to the security aspect. You can have unpredictable URLs with a minor compute overhead simply by encrypting the range-based number to get the actual URL. This will never introduce a collision. You only concern should be publicising the key you use for encryption.
This is a bad design. Your teaching people to create bloated systems by adding zookeeper when its not needed. And some math is wrong. This system is much more bloated and more expensive to maintain than it should be because of the bad design.
My idea for generating URLs would be to hash the URLs first, convert the hexadecimal output to base 62, and then crop to however many characters you'd want in a URL. Then, servers can check if that short link already exists in the database. If it exists under a different long url (=collision), then add a special character and re-do the hash, remembering to strip that special character later when entering it in the database. Is there any huge flaw in this method ? I'm a data scientist, so this is absolutely not my area of expertise.
One advantage that I *can* see over the method in the video, if I've understood it correctly, is that my method is more secure, it's pretty much impossible for someone to move characters around and find other URLs. Also, not that it matters much, but it also avoids wasting database space. I'm sure as a public service tons of people would be shortening links like google or the rickroll, meaning potentially thousands or even millions of short links would be wasted in our database when a single one would suffice, potentially slowing down performance.
@@Squirvel That's the beauty of hashing functions, it doesn't have to assume anything about the distribution. If you change a single character, the whole hash is different.
I have a question: Instead of using 7 characters for 10 years, why don't add one more character to make it work for 6 centuries? Is it too computationally expensive?
How about removing the DB step completely? We'd probably expect the URL entry to be gone after configured TTL, high enough to let the user forget about the URL. This makes me wonder the db step is not needed, possible it would be better to ask another question -- How long should the URL entry live? :D
why not shard the counters? each is doing a set of ranges? and if some are offline u just loose some ranges. the counters just could be balances by round robin dns
Right off the bat - why not use BASE64 URL variant? Why only 62 chars? BASE64 is already there. Hash, encode as BASE64 - done. Without inventing new custom "standards".
You've failed the interview. It's not a "super simple REST API". It's just a HTTP API - the "create" in your URL is a big hint that you're not following REST design principles. There's nothing wrong with just doing a super simple HTTP API. You should consider tech interviews to be like an exam and be precise in your usage of terminology. What is REST could easily be a warm up question. It's the sort of thing many people don't know!
I’m curious, I can’t find anything online that states that you can’t have “create” in a REST API. Have you got any good sources on REST APIs that I can read?
@@Ziimzy Fielding's original paper doesnt mention everything that is not REST but it does tell you how. URLs are supposed to identify resources and you're meant to manipulate those resources through the verbs already provided. Beyond that, Googling for "verbs in url not rest" gives tonnes of info
@@Ziimzy You don't use the keyword create because by definition, if it is a POST endpoint, the fact that it is a POST type request then it is a creation.
Let's say you if you making url number 1,000,000 and convert in into base62 then it would be 4C92, if you are going sequentially I can tell that next number is 1,000,001, which I could predict and try the next link with 4C93. If you add random suffix, it gets harder to do that.
🔍 Full Write Up + Bonus Section (What Top Tech Interviewers Really Want to See) → www.techprep.app/system-design 🎯
Nice video. One correction: when you're showing the number of possible URL's possible (at around 2:45 mark), you have shown it as 1^62, 2^62 .. etc. when it should be 62^1, 62^2 ... 62^7
Great spot thanks for that!
Lol yeah, 1^62 = 1
Yeah, that's very correct.
So, it's now like 62⁷! , so that we get the total number of ways it can be written.
Still some insane number.
yeah this just confused me lol
Correction: Number of possible URLs (2:45 mark) should be:
- 62^1
- 62^2
...
- 62^6
- 62^7
SO to @mohisham for spotting!
the way to calculate this, in my eyes, work like this. I will take a simple example: 3 characters, a,b,c (instead of 62), each URL will have 2 characters.
We have (ab) (ac) (ba) (bc) (ca) (cb). This is basically Arrangements of 3 taken by 2 => A(3,2) = 3!/(3-2)! = 6
If we accpept also combinations like (aa) (bb) (cc) then we will add also those 3 unique possibilities meaning we have 9 possible arrangements, thus we have the formula:
Total URL number=> A(n,k) + n
Where in your case n=62 and k=1...7.
?
Isn't it simpler with 62^n meaning 62 * 62 * ... * 62, with n times 62. As an example, for n letters I have 62 choices, for two letter 62 * 62, so 62 choices for the first letter and for each letter 62 choices for the next letter, so we multiply which is in short 62^2. And that can be extended for any n.
@@querela92 "Isn't it simpler...". Hmm..you want simpler or precise?
"And that can be extended for any n"...actually it can`t.
Let`s take a larger number and let`s see where your mistake is:
We have 62 elements that will be taken by 7.
Your calculus will be: 62^7 = 3.521.614.606.208
But in reality it will be 62 + A(62,7) = 2.478.652.606.142
This means you will end up paying for a space, transactions, cache, etc on a server larger that your really need with: 1.042.962.000.066.
This is about 1.04 TB of data you will never need.
This is the difference between mathematics and "let`s make it simple".
Exactly what I was thinking, thanks for the correction.
i am not at a point in my career where i need to be an expert in system design, but by following such high quality content, i believe I'll definitely be an expert when the time comes. Amazing video, and thank you for providing this content for free!
I believe the specific problem is a perfect match for the Actor model. The way that I would solve this is going via one of the popular actor frameworks (akka or orleans). Both of these frameworks provide out of the box the functionality that you get from the load balancer, zookepper and distributed cache. All you need is you actor cluster and a persistent storage. The persistent storage can be a database (SQL or NoSQL) or a cloud storage account (e.g. aws bucket, azure blob etc.). i believe this approach will reduce the operational complexity and cost significantly while achieving similar results in terms of scalability and fault tolerance.
I really like the idea of a zookeeper to keep track of the short url identifiers. I have never thought of it. And now it's mine. :D
etcd can be another one, simpler than zookeeper.
it seems a bit overcomplicated to have to reach out to some service to get some count to generate a URL. There's a 1 in 3 trillion chance that some random shortened url would collide. I think it's reasonable to just generate the url at random and try to save it. if it exists in the db just try again. The user can wait that extra amount of time that the server needs to re-generate the url in the tiny chance that case arises
That, or pregenerate x urls in a shared queue and just pop one when needed..
that chance increases the more urls are stored. After only a year you'll have 31B urls stored. If you continue to generate a URL at random, you'll then have a 1/100 chance of a collision.
@@awepicness This is also assuming urls don't expire.
I suggest you look into the Birthday Paradox
Just adding a single character in length drops that probability to 1/620, random URLs definitely seem to be the way to go here, simplifying the design. That's a 4E-9 probability that you won't be able to insert a URL within three attempts, basically negligible. You can opt to fail the request after 3 failed insertion attempts and it wouldn't measurably impact your service's reliability.
Nice, I wasn't aware about the zookeeper component. Thanks for that! keep up with this good content!
This is the best TinyURL system design video. It's so clear and straight forward. keep up with this good content!!!
Appreciate it!
Zookeeper seems redundant given a database sequence would have accomplished the same and would be a miniscule load compared to the R/W activity for the urls themselves.
This explanation also glosses over the non-functional requirements for HA and low latency, since the concern for geographically distributed and redundant deployments was not addressed.
instead of zookeeper why not simply have the distributed servers to pick the starting letter, e.g. server 1 can pick tiny urls beginning with 0-9, server 2 urls beginning with a-f, server 3 url g-m e.t.c that way u don't have to deal with having another point of failure, if one of those servers die it's fine you'll simply not produce a certain range until that server comes back up.
What happens if you want to add additional servers at a later point?
@@mariusheller8231do the first two or more. You can update the existing servers to then also enforce the longer prefix
only scale those servers?@@mariusheller8231
I like this, but wouldn’t this mean the state of “last generated url” is stored in-memory in the application servers? What happens when the servers are redeployed? That state would be wiped out and there’s no way for each server to resume from where it left off
@@mariusheller8231Presumably you'd have a way to change the ranges for each server later.
2:11 Don't overlook these estimates folks. this is likely the main reason you are being asked this question. There are lots of traps if you just pick a random hashing function and cut parts of it at that volume. The rest of the question is prett straightforward
Besides of service failure, there are regular service updates during development of new features, which will kill all instances of replicated service and invalidate potentially large amount of address ranges. Personally, I prefer using hashing algorithms for such tasks, dealing with collisions on writes, since lower latencies typically needed only during reads and reads are much more frequent. For example, you can calculate hash of incoming long URL and cut it to required length, which is secure, pretty robust and prevents duplicates in database.
Once you cut up the hash to the required length there is no longer any guarantee that it's unique. We are talking about a 1000 new URL's per second so "pretty robust" isn't going to cut it.
Dropping that many keys seems like a strange solution, but even if you reboot a 1000 servers every day for a year, you still have more then enough keys left for years to come. But if it could be a problem, you could also ask for a range of 1 to 100.000 and lose a lot less. Also keep in mind that you don't actually lose the entire range, because some have already been used. You could also schedule a server to go down instead of just dropping it. That it will reboot once all keys are release. But still, I don't think it's really an issue to lose those keys.
I do agree with the latency problem mostly being a factor on reads. That is when a user manually creates a new URL and has to wait a second after clicking post. But for generating links via an API the latency is important both on insert and read.
@@degosoft With hash there is no uniqueness guarantee anyway, this is only about collision chance and key distribution. That's why I mentioned collision management. So, by "pretty robust" I mean "robust enough for the most cases". Of course, there is no silver bullet, all solutions have pros and cons. I'm just talking the way I prefer to solve the problem in real-world applications. And there is one more thing about using tiny URLs in real-world applications: they often not intended to live forever. In some cases, shorter URL with lifetime, for example, one year, may better suit your needs. Since the information at the source URL will no longer be valuable at that point. In this case, reclaiming space with hashed URLs will be a bit easier, just the matter of removing old entries from database by creation date. Another advantage to this approach is ability to set any tinyURL length at any time, even by request parameter. And, of course, absence of failproof zookeeper cluster reduces structural complexity of the system.
@@configaddNothing is guaranteed, but the chances for a sha256 collision are so incredibly small that it's practically none existent. And the chances get even smaller when we only use URL's as the input. But if you discard 25 of the 32 bytes in the output to get to a 7 byte shortcode, then you increase the odds by a lot. And bytes aren't even useful, so we need to encode the output bytes to something readable first which makes the default length even longer and thus you drop even more precision.
Don't get me wrong, hashing is great and I've used it a lot for similar projects. But if "SHORT" is te requirement then hashing is not the way to go.
thank you for the content, I remember going through that problem in a interview when I was a junior.
Interesting enough I got asked this a few weeks ago in an interview.
Thanks much for the design. Easy to understand and not much complexity in solving what the end user really needs..
Thanks!! That's exactly what the aim is 👍
if browser redirection is the only requirement, why not just keep all shortened urls as files on a CDN and expose that (you could have just one js line that tells the browser to change location and if you want analytics you could just add another few lines to make a request to the analytics service), by picking a wide enough range you don't really need to worry about uniqueness and you can pick the short urls at random and not check if it's already in use
Two suggestions:
1. Instead of Counter or Zookeeper, you can use a hash function that takes the URL and produce a 7 chars string. Thus don't need to solve the problem of counters and Zookeepers and it's more secure.
2. An additional point from the interviewer might be to Add the ability to use custom short URLs. This is really a good point to discuss
Great suggestions!
How would you handle collisions?
@@sixoclock406 For MD5, the probability of just two hashes accidentally colliding is approximately 1.47*10^-29
That would not work due to collisions.
@@tejinder83 What's the probability of collisions of a good hashing function? (MD5 for example)
Would it make sense at all to derive the shortened URL from the long url (Hashing and salting etc.), it may increase the chances of collision with such short URLs, but it would resolve the countcache issue, because if we use those numbers as salts, it doesn't really matter if the count cache returns the same value twice.
In fact, if we can get away with a pure hash, you can reduce duplicates in the database, because multiple people shortening the same site always get the same result.
A salt could help with resistance, but cannot prevent collisions at a reasonable level. So you could never fulfill a request to create a new URL with an accidental collision.
then how about appending random string to the url before hashing and if it collides then just redo
Instead of Zookeeper why not use a hashing algorithm to hash the long url down to the short url, this would not only make the shorturl (nearly) unique, it would also always give the same shortened url for the same long url which avoids duplicate entries in the database
The only way to prevent collisions with the hashing method is to make the url longer which is counter-productive to the goals of the service.
Hashids would def. be a good idea to use :)
This data doesn't change, so the caching is a lot simpler. In fact, I would forego Redis altogether and just hold a cache on the web servers themselves. That's less computers you need to pay for and fewer points of failure.
This doesn’t work well when scaling very wide. If we had 100 webservers, we’d be caching the data up to 100 times. A centralized cache is a much more effective solution
@@drewbrown1534 That makes sense, i think it would also be better to handle the LRU strategy that we may adopt to keep just the most used URLs in cache.
I think you got the correct answer for getting the number of unique combinations given the number of characters but the formula is wrong. 1^ 62 and 2^62 for example are incorrect formula but the answer is correct.
Excellent video! Could another single point of failure be each webserver? For example, during a GET request, if the webserver handling the range is down, the client won't be able to get the long url.
My suggestion is to use a cluster of webservers covering each range. Each cluster would manage itself with a consensus algorithm like Raft or Paxos. That would make each webserver more available while maintaining consistency and resistance to network partitions. This would require more webservers, but it could be argued that since we're using only 7 or so webservers now, having 3-4 webservers in each cluster won't make too much of a difference in cost. We could also increase the range of each cluster, but then we might be sacrificing throughput-it's just a tradeoff between the cost and throughput now.
Great comment!
A single point of failure (SPOF) is one that causes the entire system to go down. So while you’re correct that if a server goes down while responding to a request, that request will fail. However, it is not a SPOF as the system will still be up, as if the user were to send another request it would simply be routed to a different server.
Having clusters of servers could be a great extension to the system. However, it does introduce more complexity that could be argued is over engineering the solution without adding a significant improvement to the overall system. But as you already mentioned it’s all about tradeoffs!
That just sounds like DynamoDB with extra steps.
I think you made the assumption that web servers are limited to their own range for responding to GET short-url requests, but that isn't said anywhere. The zookeeper gives them a range of free IDs to use for shortening urls, but each web server can accept any existing short url and will contact a cache and if not in there, the database to get the long url.
So, if a web server dies, retrying the request would just be delegated to another web server that will return the correct long url. Even for shortening, if a web server dies and you retry the POST, you'll get a new shortened url from another web server (with an ID from another range)
So the zookeeper only serves to assign a range to the server? How does the server track which values in the range are used, by counting?
Really nice video, will definitely watch more of your videos, keep it up!
what is the best type of database to use for this? relational or non-relational?
Great explanation. I'm looking forward to more content from your channel!
Appreciate the support! Could I use this comment on the website?
@@TechPrepYTYes.
I dont get the count cache, whats the purpose of it?
It's not discussed imo really clearly in the video. Here was my guess: If two incoming POST requests come in at the same time, and there are multiple web servers handling requests, it is possible for both to generate the same key and cause a collision. At a TPS of 1000, there is a high likelyhood of collision. Thus, the count cache and the zoo manager are ways to avoid this scenario of collisions between different webservers.
suggestion for alternative to zookeeper: you could have the create-url endpoint generate a random short URL to match to the long URL, to remove any count cache or zookeeper, so long as the space is big enough there should be no collisions
Could you please clarify how would the convertion from a base 10 number to a base62 length 7 string be?
Nice video! It would be nice if there were also basic videos explaining, for example, what a zookeeper is. Clicked on subscribe!
I would say for 1k writes per second this is kinda of over kill.
A single RDS instance could handle this thing without breaking a sweat instead of needing things like a zookeeper.
If you implemented this with zookeeper I would recommend no hire. Way over engineered. Hash the url, salt on collision, store in dynamo and cache with redis.
Thanks for the video. Would that be a valid assumption that you don't really need to create short URL during the request? Aka you could run your 7T numbers ahead of user requests and create short URLs and store them waiting for long URL to come and just plop that in? That would solve few points immediately: 1. Faster processing 2. No need to keep the zoo of web servers and maintaining them. 3. You could mix them non-sequentially so that would alleviate the attack issue
Thanks, just learning here! Curious what you think
I guess the only point of failure is that your speed of making those short URLs should be faster than your hypothetical speed of using those URLs, though I'd expect that there would be a very big head start initially. I guess this is where you would need to add another web servers
Thanks for the comment!
It’s an interesting proposition alright. In terms of generating the short URL, that takes a negligible amount of time and so precomputing that wouldn’t have any noticeable impact on performance.
Not sure I fully understand the second part of the comment. Are you suggesting storing all the short URLs on a single server?
By having all the short URLs on one server while convenient introduces a single point of failure and so thats why horizontal scaling is recommended. And then, once there is more than one server involved, a service for maintaining distributed synchronisation is typically recommended.
It’s all about trade offs but typically a single point of failure is something you always want to avoid.
What is the point of useless counting? Are you counting with fingers? Things are supposed to be extensible right?
Hi! In the first naive solution, why are you using a load balancer if we only have one web server?
Nice, keep posting please!
It will be great if we get a tutorial on hand on coding experience tutorial based on this system design . is there any chance will you make ?
It's on the roadmap!
For what it's worth, as a Staff+ software engineer, if you submitted to me a "URL Shortener" task as depending on Zookeeper, you're not be getting the job, since you're Introducing a huge amount of operational complexity which can be avoided by changing the algorithm you use for generating URLs (e.g., you could use a single bit for a server ID value - this is what Snowflake IDs do)
What about checking if the long URL already exists during a POST request?
What about hashing the input url and use that to create the short url?
You could take the md5 sum of content and create a custom encoding for ti using the 64 chars
md5 is 45 characters in hex and 25 chars in base64. The benefit of using a hash function would be to elimnate the zookeeper
the whole point is to have a short url, 25 chars would be way too long
A random suffix is a poor solution to the security aspect. You can have unpredictable URLs with a minor compute overhead simply by encrypting the range-based number to get the actual URL. This will never introduce a collision. You only concern should be publicising the key you use for encryption.
This is a bad design.
Your teaching people to create bloated systems by adding zookeeper when its not needed.
And some math is wrong.
This system is much more bloated and more expensive to maintain than it should be because of the bad design.
great video, great content :) subscribed
(but pls fix clicking sounds of laptop hearing on onboard mic)
It's fixed! Thanks for the recommendation!
My idea for generating URLs would be to hash the URLs first, convert the hexadecimal output to base 62, and then crop to however many characters you'd want in a URL. Then, servers can check if that short link already exists in the database. If it exists under a different long url (=collision), then add a special character and re-do the hash, remembering to strip that special character later when entering it in the database. Is there any huge flaw in this method ? I'm a data scientist, so this is absolutely not my area of expertise.
One advantage that I *can* see over the method in the video, if I've understood it correctly, is that my method is more secure, it's pretty much impossible for someone to move characters around and find other URLs. Also, not that it matters much, but it also avoids wasting database space. I'm sure as a public service tons of people would be shortening links like google or the rickroll, meaning potentially thousands or even millions of short links would be wasted in our database when a single one would suffice, potentially slowing down performance.
I think this solution is assuming a random distribution or close to random distribution for inputs. This probably doesn't hold true
@@Squirvel That's the beauty of hashing functions, it doesn't have to assume anything about the distribution. If you change a single character, the whole hash is different.
I have a question:
Instead of using 7 characters for 10 years, why don't add one more character to make it work for 6 centuries? Is it too computationally expensive?
Great . Well I have a doubt here. Most of them enforce authentication for the long url redirects so it won’t have any attacks from unauthorised users?
How about removing the DB step completely? We'd probably expect the URL entry to be gone after configured TTL, high enough to let the user forget about the URL. This makes me wonder the db step is not needed, possible it would be better to ask another question -- How long should the URL entry live? :D
Imagine going through chat history and seeing that link had expired 😢. Or embedded link stops working.
why not shard the counters? each is doing a set of ranges? and if some are offline u just loose some ranges.
the counters just could be balances by round robin dns
Thanks a lot!
hi, im junior and interested to learn these kind of things. where do you think i should start?
Depends on what you want to learn!
Ok I like baldy, great content
Why not ise timestamp with server id for url generation?? My first though is to use time stamp + server id hash. Can someone explain this to me
I'd separate the Website (URL Creation) and the routing. This would decouple the performance requirements for each type of server.
Great! Can you share the system design PDF book to everyone
Yes, the text version (with the slides) of the tutorials will be available on the website soon!
Right off the bat - why not use BASE64 URL variant? Why only 62 chars? BASE64 is already there. Hash, encode as BASE64 - done. Without inventing new custom "standards".
You dont want special characters in your shortened URL right? So base64 wouldn't work. Remove the special characters in base64 and you now have base62
@@blesswind584 There is a URL variant of BASE64
You forget to check if the created tiny url is unique or not
It isn't better to use a PUT instead of the POST for idempotency?
Creating a shortened URL isn't necessarily idempotent.
great
You've failed the interview. It's not a "super simple REST API". It's just a HTTP API - the "create" in your URL is a big hint that you're not following REST design principles. There's nothing wrong with just doing a super simple HTTP API.
You should consider tech interviews to be like an exam and be precise in your usage of terminology. What is REST could easily be a warm up question. It's the sort of thing many people don't know!
I’m curious, I can’t find anything online that states that you can’t have “create” in a REST API. Have you got any good sources on REST APIs that I can read?
@@Ziimzy Fielding's original paper doesnt mention everything that is not REST but it does tell you how. URLs are supposed to identify resources and you're meant to manipulate those resources through the verbs already provided. Beyond that, Googling for "verbs in url not rest" gives tonnes of info
@@Ziimzy You don't use the keyword create because by definition, if it is a POST endpoint, the fact that it is a POST type request then it is a creation.
@@alexiso2740 Ah I see, that makes sense
how does adding a random suffix protect the URLs from predicting URLs? can you explain with an example?
Let's say you if you making url number 1,000,000 and convert in into base62 then it would be 4C92, if you are going sequentially I can tell that next number is 1,000,001, which I could predict and try the next link with 4C93. If you add random suffix, it gets harder to do that.
0:45 seconds into the video and you don't know what a REST API is.
Sad.
Your audio has some low freqs
I think that’s the sound of his keyboard. He should probably invest in a microphone arm to keep it off the desk.
How is the collision happening, I don't get that part
The count caches are distributed. It's essentially a random number generator...
No more than 1 server instance can return a duplicate number
shkeama?
6:50 Cassandra is a row oriented database, not column ? Not sure why it has been use as an argument to emphasize the choice.
no bro. this is not REST. REST must NOT contain verbs. Dislike from soyjak
What is this math?
Szkima XD
I think it's called a map in a C++ program, running o toaster hosted in your mom's basement: 10trillion req a day.
What are waste of resources. All of that for what, handling, like at most 500 clicks per second?
He mentioned in the beginning: 1000 new URLs created per second, 10000 queries per second.
Amazing crisp video!!
Thank you!
Do i need to become a Network Engineer to be a Web-developer??
No, just understand the key concepts of network like OSI stack and troubleshooting.