Capacity Planning and Estimation: How much data does YouTube store daily?
HTML-код
- Опубликовано: 14 янв 2020
- Back-of-the-envelope calculations are often expected in system design questions. They help logically state the parameters influencing a result, and estimating the capacity requires multiple estimations on the way. Also lets us individually state our assumptions.
Eg: Estimate the hardware requirements to set up a system like RUclips.
Eg: Estimate the number of petrol pumps in the city of Mumbai.
Chapters
00:06 Storage Requirements
01:20 Supplementary storage requirements
03:54 Back of Envelope calculations
05:38 RUclips caching estimation
08:58 RUclips video processing estimation
12:14 Conclusion
------STORAGE
Let's start with storage requirements:
About 1 billion active users.
I assume 1/1000 produces a video a day.
Which means 1 million new videos a day.
What's the size of each video?
Assume the average length of a video to be 10 minutes.
Assume a 10 minute video to be of size 1 GB. Or...
A video is a bunch of images. 10 minutes is 600 seconds. Each second has 24 frames. So a video has 25*600 = 150,000 frames.
Each frame is of size 1 MB. Which means (1.5 * 10^5) * (10^6) bytes = 150 GB.
This estimate is very inaccurate, and hence we must either revise our estimate or hope the interviewer corrects us. Normal video of 10 minutes is about 700 MB.
As each video is of about 1GB, we assume the storage requirement per day is 1GB * 1 million = 1 PB.
This is the bare minimum storage requirement to store the original videos. If we want to have redundancy for fault tolerance and performance, we have to store copies. I'll choose 3 copies.
That's 3 petabytes of raw data storage.
What about video formats and encoding? Let's assume a single type of encoding, mp4, and the formats will take a 720p video and store it in 480, 360, 240 and 144p respectively. That means approximately half the video size per codec.
If X is the original storage requirement = 1 PB,
We have X + X/2 + X/4 + X/8 == 2*X.
With redundancy, that's 2X * 3 = 6*X.
That's 6 PB(processed) + 3PB (raw) == 10 PB of data. About 100 hard drives. The cost of this system is about 1 million per day.
For a 3 year plan, we can expect a 1 billion dollar storage price.
Now let's look at the real numbers:
Video upload speed = 3 * 10^4 minutes per minute.
That's 3 * 10^4 *1440 video footage per day = 4.5 * 10^7 minutes.
Video encoding can reduce a 1-hour film to 1 GB. So 1 million GB is the requirement. That's 1 PB.
So the original cost is similar to what the real numbers say.
If we are off by order of magnitude, it's good. However, being off by 3 or more orders of magnitude is too much. We can then highlight the following:
Where our assumption was wrong, or
Which factor we didn't take into account.
References:
Designing Data-Intensive Applications - amzn.to/2yQIrxH
highscalability.com/youtube-ar...
• Seattle Conference on ...
Numbers everyone should know: • Building Software Syst...
• Scalability at RUclips
en.wikipedia.org/wiki/Back-of...
Capacity planning with AWS: • A quick how-to on capa...
System Design Course:
interviewready.io/
Along with video lectures, this course has architecture diagrams, capacity planning, API contracts and evaluation tests. It's a complete package.
Use the coupon code 'earlybird' for a 20% discount.
System Design Playlist: • System Design for Begi...
Become a channel member!
/ @gkcs
You can follow me on:
Facebook: / gkcs0
Quora: www.quora.com/profile/Gaurav-...
LinkedIn: / gaurav-sen-56b6a941
Twitter: / gkcs_
#CapacityPlanning #SystemDesign #RUclips
9:30 I am confused 10^7min/60 to convert into hours right? then dividing by 3 is wrong...cz 10^4*1000/60 is what you want to compute and 1000/60 is way far from 1/3 so you should get 1000processors, not 20
Damn, I think this is it. Couldn't even find the bug during editing.
Thanks for this Prashant!
@@gkcs Pinned Comment? Can I apply for the job now? Gaurav Sen Pvt Ltd :D
@@prashantgupta6885 Hahaha 😁
@@prashantgupta6885 now you can't, your comment is now unpinned lol 😂
@@4n81t Dunno how that happened. Pinned it again 😁
I was interviewed for RUclips recently, and this was the exact same question I was asked. Gave a similar reply. Love your solution, and the fact that you uploaded this video! Subscribed!
Thanks 😁
I was really looking for a way to calculate number of processors based on the bandwidth estimation. And there you have it. Thanks man! Love it. :)
This is great stuff...!!! So good to see these videos being accessible easily here.
Thanks for the upload gaurav :) , thanks for the tips to approach such problems
Your brainstorming videos on designing systems and infrastructures are really helpful.
Thanks!
[1:30, 2:00] Hi Gaurav - the part where you account a multiplier for the storage requirement due to replication across data centers is really smart! I haven't seen this mentioned in many books.
So nicely he explains concepts..!!
Thank you so much for gr8 info..!!
Seeing u after so long..!!
Thank you 😁
Gaurav, thank you for your elaborate work! Cheers 😌
This is one of the best video 😍
In terms of system designing 🙏
Great. Keep up. I like your way of expressing things
Great work. Keep up doing good work like this.
Your uploads are informative, good job man.
😁
Legend!🙇♂️🙇♂️, You are inspirational gaurav, Thank you for the Amazing content!!❤
Love you bro. you are always there with something new and different from other youtubers. you are real. ❤❤❤❤
Great estimation Gaurav Sir
@10:19 - You mentioned the processor has to read data from somewhere and write back to some place, right ? Reading happens from the same storage ( 30 TB without HA ) in to cache and then back to the same storage, if not , would it require more storage than the number you came up with before ? I could be assuming wrong here.
Wow! Very impressive explanation
that was Incredible explanation. GJ
It is a really a good conceptual video, Always like the concepts you pick and showcase
Thanks 😁
Awesome video! Binge watching all your videos gaurav bhaiya, can please do a video on creating a good resume for students studying in Tier 1 and Tier 2 engineering college who want to join product based companies...i mean what type of projects do we need to have on resume, etc .
Needed this so badly, Thanks again for the awesome video..:)
Thanks for this video Gaurav. Could you also help in understanding as why SQL has been chosen as DB for you tube considering this large scale of data and performance requirement
Congratulations on 200K gaurav....
your 100K to 200K journey was pretty fast don't you think?..
Your videos are fun and informative and they help us alot. Love your content.
Keep going like this...All the best!!
Here in the above situation we are taking 10*7 mintutes which means all these videos played once in a day. While multiple user play it at a same time. So we have to take its multiple on an average. Let's say per video is played by 1000 users simultaniously. So the time will be 10*7 x 1000.Now the processor count will be changed.
By the way Great explanation. Thanks
Bang on , this capacity estimation is very accurate and detailed. However I would like to avoid it in the actual system design interview since this estimation will take almost 10-15 mins of your time.
But I will say its important to go through the whole video to capture the essence and use the required details in your interview.
True, I would only estimate the capacity if I had to justify my architecture or if the interviewer specifically asked me to.
Great video Gaurav!!
And yeah it would be 1000 processors as 10^7 minutes is 166666 hours and not 10^4/3 hours
Thanks Nishit!
I am in awe!
Please upload more such videos! Its much better to calculate, make mistakes and reach the answer than cramming and googling for the answer to questions like this!
😁
thanks for all system design videos. wanted to suggest next topic if you get chance on how to design "zoom or facebook/youtube live" video
I can see that your Math is spot on.
At 9:30 why does the 10^7 get split up into 1000 and 10^4, and then you seem to just drop the 1000 portion? I understand that 1000 * 10^4 = 10^7, so why was the 1000 (or 10^3) dropped?
Hey big fan here, thanks for such amazing tech concept videos. Just wanted to how do you gain so much in depth knowledge of every technology in short span of time? Do you go with books or some other resources? If possible can you share some resources, link or anything, will be much appreciated. Thanx
They are based on my experience and highscalability blogs 😁
I've mentioned my sources here: ruclips.net/video/bBPHpH8aKjw/видео.html
Looks like answering questions for an interview for a job with RUclips.
Very well explained.
9.30 : What were you thinking while dividing 10^7 by 3? just want to know your thought process - though that's wrong.
Thoughtful .. great
Thanks for video!
You are Welcome!
Awesome insight! You should get a job easily in silicon valley
Good one 👍 do we really need to store all quality video? Can't we store high quality and as per the read we convert it while processing the same? Though I am not sure if it is possible but was curious to know
Doing that will require lots of CPU or GPU ad hoc for transcoding the video depending upon the quality, so it's only feasible if we before hand have all different resolutions version transcoded ready.
It's mentioned 'That's 6 PB(processed) + 3PB (raw) == 10 PB of data. About 100 hard drives.' Is it ok to assume a hard drive that has 100TB storage capacity?
The way you think is superb.....How can I make my thinking skills like you...... Plz give some tips
Hi Gaurav, The content is really informative. I have small confusion on cache requirement calculation for each thumbnail @7:12, you can multiplied by 1M ( 10 KB *90 * 1M), what is 1 M signifying?
1B users / 1000 as 1 every thousand uploads a video as explained in first vid
The computational power and storage you estimated is just for uploading, now if you take into account delivering the videos, serving ads, providing recommendations, that the calculation is far by several orders of magnitude
Yes. I've kept it simple for the interview. There's a lot more than we can talk about in an hour (recommendations, trending tab, analytics etc...)
Hello Gaurav,
Thanks for great explanation, very clear and informative. 👍
Though I have one question when we want to do 20 second/ second then in current case we directly moved to 20 processors (actually it will be in thousands 😉) but then cpu cores, hard disk type is not considered and that will also impact this count, right?
Let's say, CPU cores will helps us in having concurrent connections, threads or processing power, and with multiple HDD/ SSD we can read data in parallel.
Can you share your thoughts over impact of CPU cores, HDD, SSD on number of processors?
Yes CPU cores will have an affect on the system. If we use 4-core processors, we could use 20/4 = 5 processors.
A GPU would also have a similar effect to the calculations.
I was wondering when a creator will upload this vedio... thnk you
Got lots in progress :)
Hi @Gaurav Sen,
Your system design content is really amazing, can you please create a video on games system design (eg: pubg , clash of clans, Pokemon GO) that how they manage millions of users at a same time.
I didn't understand the 500 nodes thing. What was that 64x(3x2) about ? Can someone explain it to me ?
assumptions: key to progress further. Really helpful.
😁
little confused. the video started with the estimation of "how much storage per day" and you calculated 10^7 mins per day, but at 5:37, you mention the actual number is 10^7 mins of video Per Min. thats off by a factor of 1500. so 600hrs of new videos PER min almost comes up to 180Pb (1500 * 120gb/min). which seems way off from 1Pb assumption. am i missing something?
Which are the books u read related to comp. Science.
3:54 its written daily video limit(per day) .... 5:30 its mentioned per minute... am i mentioning correctly ? or am i incorrect?
do they store different quality video separately the don't have any technique like if I have to send an image frame then I will store the highest quality and when the user needed low quality in case of less internet speed then I will reduce the resolution of a copy of that image and send 2:34 ?? I don't have much knowledge in this field
They stores different qualities and resolutions, although Zoom works similar to your idea. Have a look at "scalable video encoding".
@@gkcs ok thank you for replying : ) this course really very good
Little confused. At 5:36. We had assumed 10^7 min for 1 day and not every minute right?
GOAT!
Hello Gaurav,
For the Third part:
When we had already estimated 30T of per day to be stored in the first section of the view then why do we again do the estimation of data to be processed per second?
It can just be 30* 10^6 MB/24*60*60 ~ 350MB/sec
That would have been a faster method, good catch 😁
Also would have avoided me making the mistake, probably.
@@gkcs @Manish But I noticed one difference. The 30TB storage which he calculated for storing videos is based on assumption videos are processed and of 200MB/hour while for counting processors, he is dealing with unprocessed videos and with assumption of 1GB/hour. So u would see a difference of 5 times.
I wold like to watch system deign of Gmail. Will u do it?
One mistake that I could find in the description:
What's the size of each video?
Assume the average length of a video to be 10 minutes.
Assume a 10 minute video to be of size 1 GB. Or...
A video is a bunch of images. 10 minutes is 600 seconds. Each second has 24 frames. So a video has 25*600 = 150,000 frames.
Each frame is of size 1 MB. Which means (1.5 * 10^5) * (10^6) bytes = 150 GB.
Here, a video will have 25*600 = 15,000 frames, and not 150,000. Hence, the total size would come around to 15GB.
Moreover, you failed to take compression into account.
I believe compressing images and videos can greatly help save storage space, plus RUclips definitely will have figured out an optimized way for compressing, storing and extracting the original file for lesser costs.
That could change the whole scenario.
For interviews, it can be safe to assume a compression ratio of 0.7.
🔥🔥🔥
It gives some estimate how much resources required. Excellent , I m thinking to calculate instagram or facebook resources 🤓
Great 👍
Dishant Kapoor are you a professional developer?
can you tell me sir if i search any type word on google then how to know the google what type of word searched and shows the exact result within a second please tell me
beacause this question asked by the interviewer
can you help ?
Hi Gaurav ,,, Must say amazing video tutorials by you , In this estimation case i have a doubt that I feel while calculating estimates we should consider formats (: MOV, MPEG4, AVI, WMV, MPEG PS, FLV, 3GPP and WebM)* Resolutions (1028,520....), For each format and resolution youtube will store one video is my assumption. No of Videos= No of formats * No of Resolutions.... ...let me know if you feel its an right assumption
That's a good point.
I have mentioned the different resolutions, but there may also be different formats similar to how Netflix processes videos.
Without any writing down I estimated 1PB before watching your solution, seems to be roughly in the correct order of magnitude.
Nice :)
7:17 Why did you mutiply 1M here?
Edit: From 9:31 It was so confusing, didn't get anything
Please create one on CAP theorem and explain some non relational db design like Mongo including its drawbacks.
I have one on CAP theorem coming up soon :)
I didn't got where that 1M came from at 7:38, can anyone please help me understand.
Total space requirement is cache for thumbnail should be equal to videos which were uploaded in last 90 days + evergreen videos and we are assuming 1 thumbnail to be 10kb , so it should be 10kb * (Number of videos in last 90 days), is it because we assumed 10^6 videos to be uploaded per day?
Maybe this is covered in one of your videos but what's the most efficient way to check which cache in the 160 nodes of 16GB data has the actually cached stuff. Can there be sharding or something similar inside the cache or like a loadbalancer for the cache?
Horizontal partitioning on caches is a good idea. Have a look at consistent hashing: ruclips.net/video/zaRkONvyGr8/видео.html
@@gkcs Got it! Thanks! And thanks a lot for the quick response!!
You know what you should launch a Full fledge cosrse on system design on Udemy or on any platform. How many agrees?
It's here 😛
get.interviewready.io/courses/system-design-interview-prep
Is it for absolute beginner? If no could y give a great course for beginner?
90% savings is a lot to assume I think, on average for video files this number should be around 50-75 %
Hello sir,
Your videos are really very informative for interviews. I would request you to make one on end to end pipline with big data ecosystem, data storage issues, handling streaming data, breaking microservices etc. to get a clear concept as to where in reality we can use all the stuff.
Thank a ton for such amazing videos.
Good idea, I'll add this to my list 😁
Hey Gaurav!, loved your content. How do we Get in Touch with you? may an Email would be more than enough.
Sir is this series for freshers?
Hey Gaurav, At 7:17, what are you adding the 1M for? Aren't we getting 10 KB times 90 days of videos in the thumbnail?
The million is for the number of videos per day. 0:15
How you came up with 1 billion users at the start?
Another thing we should also learn is to how to estimate the number of user for the software you are developing or the question you are designing in an interview.
Could you make video about uber-eats system design,please?
Why do we need to store lower format resolution explicitly separate from high resolution. Can't it be generated or sampled down during streaming using a filter ?
Well, not yet. Currently, multiple resolutions is the way to go. Variable scale encoding is advancing fast though.
how do u stay motivated all time.?
And energetic!
U can r a great teacher
Could you please turn on the auto-generated caption functionality ?much appreciated
Hey Ken, which language are you comfortable with?
@@gkcs English, I'm sorry I just sometimes can't understand what you said with your accent and I didn't discriminate and I tried hard to understand what you said,no offense, sorry I have to be honest
Could you please upload a video about string matching algorithms?
I have: ruclips.net/video/XJ6e4BQYJ24/видео.html
8:49 : why are we multiplying 64 with 3 * 2?
at 9:44, how 10^7= 10^4 GB?
Thanks for the videos with precise explanation. Can you also cover yugabyte, docker and kubernetes in system design videos in the future
Working on them 😊
can someone explain where the 1M comes from in 7:18?
Sir can you explain how Zomato app works
Is there particular reason why you chose memory caching (RAM)? In other video "
" (ruclips.net/video/U3RkDLtS7uY/видео.html), you mentioned that cache is "usually on SSD"?? Or caching on RAM is the same as caching on SSD?? Anyhow thank you very much for useful video.
About TikTok system desgin -> next video please
TikTok...I have to install the app first :)
Don't do it @ Gaurav Sen
@Garurav Sen...Could you please post the correct calculation for 9:30 minutes in the video onwards..
I'll leave that as an exercise to you. The answer is in the pinned comment btw
Hey! Are you really coming to MIT ADT UNIVERSITY on 24th January?
Yup 😁
You are calculating 1min of video using per frame/image size which is totally a wrong estimation as video are not in uncompressed format. In H264 video, compression is very huge. So, 1 min of video can be stored in 2-5 MB
Google is coming with website for coronavirus crisis! Please please do system design of that! It will be super hot topic I am predicting!
2:39 why 2*X ? like I got by combining all the possible quality size we get 'X' but as we are keeping 3 copies so shouldn't it be 3 * x ?
I don't, I got confused here 0_0 rest was good
I think each second has 30 or 60 frames . I never heard 24🤔 . Btw love your videos and also correct me if I am wrong
Thanks!
You should Google this instead of commenting :P
You haven't heard that most movies are shot at by default 24fps?
You are probably talking about 30-60 fps in terms of Games. In terms of movie/video it is usually 24 fps.
Taking into account compression format like h.264 and h.265 can improve the estimation too
To be honest, the videos sizes I have after editing are about 700 MB for a 10 minute video.
The 400 MB per hour estimate is a bit risky, but passable in an interview where we are estimating everything anyway :P
I think the interviewer should fix the maximum resolution at least
Naah, defeats the purpose of estimating in the real world.
5:00, Using image size for estimating video size is wrong. As the algorithms compress it so that they dont store each image completely. Rather like a difference between those consecutive images.
How Run time video converter work? If we store a single video format in highest quality then each time when a specific resolution video request come from user side then how these conversation work..if i store same video in different resolution then it is not good for storage
Check the other comments before posting. This has been answered already :)
@@gkcs thanks for making videos on such topics 👍
@@gkcs provide the link, there is nothing i found like that.
Thanks for the great video. I think it can be improved if you are a bit slow. Also the change in the video scenes (which you have done to shorten the video) is a bit distracting. Never mind :)
Thanks for the tips!
By any chance, are you on instagram? i've been looking for your account couldn't find anything
It's applepie404 :P
Your calculation(for storage requirements) is for daily uploads while the actual report is for per minute. :p
Sir please tell me what is the system design of apps like tiktok,hello,sharechat,like this
Check out the system design playlist in the description :D