My 2¢. AWS is very popular so using those references will have the broadest alignment. But also, when you work at Google, unless you're in the Google Cloud org, you generally don't deal with picking services... Google internally is a dream of tools and frameworks, all perfectly interconnected. So much that you don't think about where things are going, they are all handled quite nicely without worrying about scale. Most apps there are built for internal use (audience at most about 200K users).
This video is a perfect example of how things should be explained. The way Mark has explained entire design is commendable. Kudos to the guy interviewing for being so patient and polite.
Just the fact that the interviewer could shut up and listen to the answer makes this interview great. There is nothing otherworldly about design interviews, not much has changed or invented in the recent decades the only issue in my experience is that people can't just sit and listen, they'll be constantly asking questions, breaking up the train of thought, I'd say its a tutorial for the interviewer and not the other way around)
Are you talking about the interviewer asking questions, or the interviewee? It's great for the interviewee to continually ask clarifying questions - it's more annoying if the interviewer is constantly asking questions, but there still needs to be a dialogue.
I think they will cut you off if you are rambling over something they are not interested in or rabit holing and they are pulling you out of it and redirecting you to the direction they want you to go for. It's useful to take cues from them when they are cutting you off to take hints by listening to them what they want exactly. Also more they talk better for you kind of, unless if you are screwing up and are not talking at all.
A few things I did along side while understanding Mark's POV: 1. I would usually introduce DNS geo routing earlier in the stage to route to the nearest LB 2. Also worth to have a Metrics collector that can always keep track of HITs of top 100 (or emerging hits depending on BI) per region basis in some form of a max heap and then have a scheduler to periodically walk through them to ensure that nearest CDNs are hot loaded / prewarmed with them. Reading from S3 is very slow and I would usually find other alternatives instead of chunk reading in an instance memory. Packet roundtrips can be costly especially in use case of streaming. 3. I also split the durable storage into two - user data storage (less frequently used in comparison) and songs metadata storage - this way DBs can be fine tuned for workloads. 4. If I told S3, I would also mention cross region replication just to touch it a bit and indicate that I was thinking of a DC going down entirely.
Hi @jai_ver_rb17, What I experienced from my interviews, sometime trying to add more details (point 2.) took me longer and after 35 mins I felt a little crunch of time. Which handled nicely here and I learned over the time. If you have any tips to share how you balance details and not wrap in hurry? Its a fine balance.
Im still a junior but I remember some classes that featured system design and watching this interview brings up a lot of memories. What I also love about that is the "doing things from scratch" part. When you're dealing with system design, it usually means you're creating something new, a new app, a new service etc and that's always an exciting endeavour
@adennis200 congrats on our most liked comment! We're actually looking for a new Host, it would be about 15hrs work per month, would you be interested?
it's always best to start the interview answer and define the "Functional requirements"(FR) and the "Non-Functional Requirements" (NFR) that are needed for the design. NFR could look like this for this design: 1. low latency 2. high availability 3. secured connection etc.... this helps to flush out point of failures, and bottlenecks early in the design.
Pretty sure most interviewers asking you to design a system are going to all expect the same NFRs because that's just the way of the world. If it's not low latency and high availability, then it's just not going to be a good product.
@bradfordsuby8064 That's not always true. It's trade offs. Maybe something being consistent is more important than actual HA and super low latency, and then you have to figure out what HA means to the interviewer. CAP theorem, and the superior PACELC talks to this
Agree. The non functional requirements are the most important ones for mid-senior/senior level developers as they are the ones that make a large scale system work flawlessly.
It was nice that the interviewer just listened, and the interviewee presented a simple design. However, in real FANNG interviews, especially for Senior roles, you're expected to go into more detail, and the interviewer usually challenges your decisions.
@@noobgam6331why consider DRM? Our concrete use case was finding and playing music. It has nothing to do with DRM. But if we were speaking about upload part then it would make sense
What this interview really shows is that you don't really need to know every detail of the future solution(spotify i much more complex than this), but those solutions that you choose to invent - you should be capable of explaining why they are needed in the most understandable way.
That's interesting to watch. The design looks very similar to the one I produced during an Amazon Interview with the Load Balancing, Cache and Server Geo Localization. I was feeling good as well about my interview however I have failed it. The most dramatic part about failing the interview is that we do not have any feedback on our mistake to improve on. The only mistake I see is that at the last test, I did not write down one of the requirements and when I finished coding the interviewer told me he said the opposite about this particular requirement and I had nothing to back up / verify who was right. So if you are about to go through an interview, lay down on paper all the requirements, validate them and then proceed to the coding part. Good luck out there.
System design is more about communication and collaboration rather than coming up with a solution that 100% works. Most likely you were not communicating as well as you should be, or not collaborating as well as you should be.
Like many others I was designing my own version alongside Mark's and I think the area he was a little weaker in (which he himself admits) was load balancing. My background is systems administration so I may have a different perspective on this. I think going back a few steps, chunking the data also serves an important role in the load balancing process. I would have songs chunked, from every retrieval source, so that as soon as the user presses Play, the song begins playing, and playback should always be an instantaneous process unless the servers are over capacity, which can occur because some song or album has gone viral. I would structure the web server load balancing so that client apps attempt to contact the server geographically closest to them first and utilize GSLB (global site load balancing) which combines layer 4 and layer 7 load balancing, as I/O capacity or concurrent connections (the two metrics I would prioritize) reach a threshold. Again, when talking about load balancing, it's important to determine what happens when maximum capacity on all servers is exceeded. When this happens in my design, the system will issue "tickets" for data chunks, served in the order they arrive in. This is where song chunking comes into play. Because we are chunking the MP3 data, we can still serve the first chunk of the song from the nearest available server as soon as that I/O is available, further ensuring near-instantaneous playback upon pressing the Play button. The rest of the song then has some time to download and cache to the client device, reducing the number of interruptions and pauses in playback due to bandwidth and concurrent connection overages.
@@kento8453 Yeah, think of surge queues. Surge queues are essentially placeholders for a pending connection that occur when load-balanced services are overloaded. Amazon's elastic load balancers (ELBs) for example have a spillover configuration that allows excessive requests to be dropped. With a combination of chunking and surge queues with spillover protection, you can continue servicing requests and the impact is only mildly noticeable from a client perspective.
Why do you need to chunk the data instead of just streaming it? Streaming already sends data over time in minuscule chunks, so you can play the song immediately and don’t need to “find and assemble” the other chunks. Especially since each chunk takes time to download, but stream bits are instant one after next, how would chunks be a better solution here?
@@simvo7802 Uhh streaming isn't a zero I/O operation that spends no time finding and assembling. Quite a bit of resources are involved in streaming, unless someone has created a perpetual motion machine already that I wasn't aware of. In this house we obey the laws of thermodynamics! Please don't delete your comment btw. You can reference it later sheepishly. We all are at different stages in the learning process, and a bit of retrospection can be refreshing.
15:00 it’s important to mention that the problem with the storage separation is not about the data im/mutability (actually you can update data even in the blob storage). Primarily it is about how inefficient it would be to store 5MB blobs in any general kind of OLTP database that will cut each blob into pieces of 2KB sizes, build a separate table (toast) for it with index over each piece. And only then you would want have more efficient streaming and completely different types of local and global caching. So separation makes lots of sense just because one data is in small pieces and another is in big.
Most RDBMes have special blob support where they do not store the blob in the typical buffer pool with those small 2KB-16KB sized pages.Bu But your point is valid in general. So is the interviewee's. Under normal circumstances, immutability means it would not take part in any transformation functions of transactions/queries in the RDBM (even if it was stored in it). It would just be dangling as a reference to an opaque entity that never gets transformed. So if we move the opaque/large/immutable item to external blob store, you really do not lose anything (you still have refernces to it that take part in the RDBM queries/transactions).
@@lagneslagnes Why not just install them in the filesystem and have links to the blobs? The filesystem is extremely optimized for that. Maybe put 1000 songs in each directory. Any database has a cost. You could also memcache the top 1000 most popular songs.
Also, blob storage uploads different parts of a file into multiple machines in parallel. With RDBMS, to achieve the same you could split a file manually and do some kind of sharding - but too much manual stuff. Though it’s not that relevant for Spotify with low load of writes, but in general it’s a good reason why rdbms are not good for blobs.
@johnkoepi thanks for your contribution, you make a strong point. We're actually looking for someone to help me Host, it would be about 15hrs work per month, would you be interested?
I knew every single measure and strategy which Mark presented here. But I dont think I would have been able to present it the way he did with a gradual continuous increase of complexity. Awesome answer Mark. I wish I could get interviews to be able to deliver these answers, Im good at that.
Love the video! Before playing the whole video I played around with a design of my own and ended up with pretty much the same design with some variations that I'll add bellow. I think Mark and the Interviewer missed on digging a bit deeper into one of the main requirements Finding Music. Mark talks about performing the search operation directly from RDS. Taking into consideration the scale of the system, that would have been a terrible decision. With millions of users, the search function would hit the DB constantly and generating read queries in the RDS instance that stores its data on disk. Resulting in overuse of the DB and high latency. In my design, I went for a dedicated search service that is powered by a Search Engine such as ElasticSearch. This service is populated in the background asynchronously by a Consolidator service. Essentially, each time a data is added to the RDS (new songs, etc) an event or message is sent to a queue, the Consolidator Service would get the new data and push it to ES. Then the users can search very fast for songs using a highly optimised Search Engine.
Yeah, like the "finding music" part pretty much implied an efficient search system. In general, I don't think this is a good video to prepare for a system design interview because the interviewer didn't challenge the interviewee about anything. The hard part is being able to justify your choices, explain tradeoffs, admit limitations and make major optimizations on the spot.
@carlosluque4753 great input, makes a lot of sense.We're actually looking for someone to help me Host, it would be about 15hrs work per month, would you be interested?
@@ThomasSchneider-l1v good point, depends on the usecase, such as voice to search, a recommendation system. Maybe you need a vectorDB to do similarity search?
Tbf, he introduces CDN in the design so it's not accurate to say the search logic will directly go to RDS alone, and he mentions a lot of caching mechanism on different parts of the system which greatly reduces interactions with the databse.
What a pleasure it's to listen to this kind of people and the way they design solutions, they make it look easy but it takes years of experience to abstract like that
I'm having an interview at swiggy and i went through out several of youtube videos to understand that how to design an system at high level or low level but this is what I'm searching for , which made me confident to make answer in the low level design interview that how things work and function. Thank you so much
For load balancing, you'd also want to think about having certain webservers marked for specific tasks. Though I suppose that would be more like having 2 services - your lookup service and your streaming service. That way you don't have to worry about the weight/priority of IO based balancing vs CPU. Then your lookup services are CPU based and your streaming services are IO based.
This is a great session. This format works when interviewer is a good listener and allows Mark to finish what he has to say, put a logical end before transitioning to next stage of SD or asking questions. and that is great. Can you do a session where the interviewer is constantly interrupting? you neatly define stages of the SD interview and its a flow we (interviewees) would like to get into. but more often than not, interviewer doesn't wait for you to finish a topic.(usually non FAANG companies) they just want to get into details of a component. or more often than not ask "why". I personally find it hard to transition from "answering their question"(which could take easily 2-5mins) to getting back into original format I had planned for the session. and because I'm unable to logically complete i fail the interview.
The audio streaming would better work with 30 sec. chunks of audio, instead of loading the full track, which can vary in length, from am minute long track to 20 minutes long. Also, ordering the artists and songs based on both relevance to the search terms and popularity and user's personal listening habits and preferences should make sense. Artist, song, and user metadata are all connected with relations on multiple vectors, like genre, mood, country of origin, and lots of unknown relations (aspects) that come up from machine learning, etc.
Great suggestions, addition to first suggestion -> I would split the audio into chunks ONLY in cases when the length of the song is above a threshold, example if a song it's 2 minutes (say ~ 2.5mb), it would make more sense to download it all with single query rather than hitting the Audio DB four times.
how would it work when the user wants to seek to a part of the song ? i am not familiar with networking so i'm curious how the connection stays for example during a 1 hour song. if it makes a new connection it'd be slow i guess but if the connection isn't severed then the server might get too occupied ? how do we balance these ?
Yeah, if you think about how Netflix works, the media is encoded to multiple screen sizes and resolutions to deal with varying network conditions, and then chunked. The client then retrieves the next chunk of the stream from the nearest edge server. So just encoding the media to the multitude of client conditions and then disseminating the chunked content to edge servers is a hugely interesting engineering case and solution.
@@深夜-l9f Very interesting question. I like the idea of loading less than the full song to start playing and then continue to "read ahead" while playing. This is a common practice also for videos and increases the chances that the full song is loaded by the time you start seeking around. Still, it's not perfect, and you can imagine a scenario where the user seeks to the end of a 5-minute song right away, resulting in a delay.
I'm not a system designer (yet), but from my work in my bachelor's and master's - while it's a good idea and most probably how it is implemented, this is 'getting lost in details'. This is the specifics as to how the streaming gets optimized; and if you have time to talk about that after the system is fully designed, sure, that's good. But with ideas like this, it's easy to go 'so there's an app, and it talks to a server, which talks to a database that stores.. and by the way, the database does this, and this, and this' - and then one hour is up and the rest of your system is underdeveloped.
My first thought was "the level of confidence to question a senior ex-Googler". Then, I remembered that Google has put out some less than stellar solutions. All in all, Mark explained it beautifully and it must have been a joy to work with him.
Can you give an example of some "less than stellar solutions" that Google has put out? What specific Google products do you think suffered from poor infrastructure design choices?
I would have gone deeper on API specs (some endpoints, how would they work?), the searching algorithm (roughly, db indexes? some middle caches?), and audio service (streaming, shared cached besides CDN, loading all in mem takes time where the user hears nothing, and is costly in RAM, discuss alternatives). A way to deal w/ metrics (data pipeline, no need for too many details). Also, mention CAP, what would u choose and why. Normally, you will forget to mention things, and the interviewer will ask accordingly; but as mentioned, it is usually better to have your key points exposed w/o the interviewer needing to question you.
Those are all excellent followups that the interviewer could have asked, but this does a good job at iteratively building a solution while communicating
Hmm. From my point of view there are some things missing that I would expect you to mention during system design interview. First and most impactful on system design are service metrics, like reliability, responsiveness, availability, and so on. I do understand you kind of included 'apparent expectations' in form of initial question - we all have an idea about 'what you're expecting from 'Spotify' service, but at the same time you have to quantify those, because an 'idea' is just an idea, different people (stakeholders, clients) can have different expectations for the same idea. Few basic examples: - median response time (for every use case) - uptime/availability requirements - RTO/RPO Exact numbers for those will make a tremendous impact on any high-level system design. Second - constrains (you guys went through few of them, but kind of missed usual ones). I can't stress enough how valuable to understand your project (system) constrains at the beginning - it could be money, time, some government requirements, technology requirements/constrains, anything. Main thing is - you just have to understand you can't design ideal system (in it's 'final' form) and try to get there from the start. There would be iterations, growth, compromises, technical debt - and as a system architect you have to plan things around all of that. This is going in hand with my first point, examples: - do we have enough time/money to provide 99.9% service availability? - how and in what time/cost we can add additional features? There are more theoretical ones, but it would be nitpicking at this point. Some practical issues in final design I would point out: - pretty sure there's authorization service missing (you absolutely don't want to handle them with your main app). - you don't want to handle both search and playback on same service, not with those numbers. - you have to use LB at least for metadata load - there will be a lot more metadata, I would split it in two (at least) - user-defining and content-defining. - you have to add metadata to your CDN, its a part of core user stories P.S. storing/accessing/updating your data fast on this scale is quite a loaded question by itself
Uptime/RTO questions would add design regarding the HA/DR architecture which would be an extension of this design right? What impact would response time requirement have? Maybe add more caches? Understanding time/money restriction is important when scaling but would not have impacted design in video right?
@@nukeu666 You're right. I tried to say that in this exact case you can kind of assume those answers, and his design wasn't wrong. My point was - in this kind of real-life interview I would love to hear about constrains and service metrics in a bit more detailed way (or at least skim over) - how they impact system design at all. About response time - there is big difference in architecture between 500, 50, 5, 0.5 and 0.05 seconds in response time (I would guess you know this as well). Until you propose/discuss exact values/ranges, you can't really make a decision on "how to make my system".
Wow! Thank you for so much careful feedback. This is really good. Your comment "I can't stress enough how valuable to understand your project (system) constrains at the beginning" really resonated with me. That is absolutely true. One of the things I look for when I'm doing mock interviews if the topic of response time comes up is percentiles (50th aka median, 90th, 95th, 99th), but I clearly missed following my own advice there. :) Also good call on splitting the metadata - I think they have different purposes, so it would make sense to keep them separate.
This is a great interview in general as the interviewer was not intrusive, but there are a few things to note for the other viewers, 1. The level the candidate was intervieweing for was not mentioned. This makes a huge difference. Depending on the level, the signals that the interviewer looks for changes accordingly. The following points are based on this. Given the answer, I would say the candidate was young senior engineer or grwat mid-level engineer 2. Although this is a system level design, we need to think about some low level aspects, mostly the models in the application, db schema, protocol between various components 3. The candidate did not consider any security implications at all, especially with CDN in place, how would we make sure only authenticated users listen to the songs 4. Consistency pattern between metadata and the file upload, is the file uplaoded at the same time as the metadata being saved in the db? what happens when there is no song uploaded when metadata is available within search 5. Filtering songs and finding songs are almost different. One could fitler metadata just by using RDBMS, but free text search as often we do in many apps cannot happen in the same database as storing metadata, therefore we need another database to deal exclusively with free text search, this could be an RDBMS with such functinaltiy or something like elastic search. We should also consider about consistency patterns here
this was great . maybe it didn't add a lot to me in terms of technical aspects but the way mark was connecting the dots was really interesting that's exactly what you expect from an top notch engineering manager
24:00 the clinet app should be able to get the chunk of the mp3 directly with some sort of token (expires in, for example,15 mins, per Id or user, etc.), the web server should not fetch the mp3 chunk data for client app but generates access token only
My taught exactly! The app could get the mp3Url(token embeded in url or similar) from the db/cache via the web server then it could read the data from s3 directly...
I feel that with a humungous list of 100 million songs, we can implement a separate search server for the search functionality like a Solr search. It will reduce the searching time by a huge margin.
Really nice video. Another point is to dress up during the interview. Mark looks like a CTO-level person. That first impression is really important when leveling.
also, definetly wouldnt go with streaming audio from the webservers - for both scalability and separation of concerns. a finelly tuned CDN (having price constraints in mind) would do the job.
Great and interesting interview! AWS Cloudfront with S3 backend automatically pulls a file from S3 if it is not cached already so the webserver could return the mp3_link at the Cloudfront distribution endpoint and Cloudfront would take care of everything else.
Love how some of the comments are saying how they would have heaps more detail. Being in a test environment makes it much much harder to think of those things and you only have 45 mins. I have interviewed over 100 people in white boarding sessions and thought he did pretty good for someone that seems like doesn't have experience in that exact type of system. Of course if you have time to think on it, you could do a lot more in detail as the solution is much more complex. His communication was good and why made sense. How was more vague as it was obvious he has never actually done an audio streaming service. If he has I think he would have drilled down faster. Overall a good example of test where you have not done a specific design before...
It's funny that the interviewer is trying so hard to nitpick everything that the more experienced guy is saying. "He should have said up front why he was splitting the databases into two." There's so much going on and you're working through a problem you were just given 10 minutes ago, no interviewer is going to care about if he addresses it up front or if they have to ask for clarification. It's all part of the process
This was very interesting to watch. I am currently a Senior Software Engineer, and will probably end my career at this level as I'm quickly approaching retirement age. I've always loved getting my hands dirty writing code, and have never had any aspirations to advance to the level of an Engineering Manager (or Development Director, etc.). But while watching this interview, I found that my thinking was in lock-step with Mark's, and I found myself answering the interview questions with essentially the same responses. I even blurted out several of the same responses _before_ Mark answered in the same way.
How old are you? I'm also have similar thought process. Don't wanna go beyond Senior Software Engineer as I think it's too much stress. But that would mean I'll have to retire late.
Thanks mark! Very helpful to basically see how to communicate effectively calmly and enhance the design step by step. I would've added couple of more things here though 1. Separate the application servers for Querying the songs vs playing the songs (As you mentioned the load can be very different and the servers which are playing the songs will have high network bandwidth usage) 2. Add cache to the metadata server also (Songs metadata to maybe cache the songs which are recently, from some famous genre etc)
@@michaszewczak7392 there will lot of dynamic tagging involved for the songs, simple text search would not suffice here. Some sort of lucene index Elasticsearch/Solr etc would really help here for full text search.
My main takeaway from this is that some companies have absolutely absurd standards on what to expect in short interviews. I had one interview a couple of years ago that was 1,5 hours of technical questions directly followed by only 30 minutes of this type of interview, where the task was to design uber eats, and their standards were absurd. I ended up 2nd out of all applicants, didnt get an offer for the position I applied for but got one for a similar position in the same company some weeks later. I ended up declining it and honestly primarily because how the managers acted and had set up the last part of that interview. It seemed so out of touch with the realities of software engineering.
A great video to explain how solution architects work and what knowledge they need. Actually if you need to stream it would be difficult through cdn that will send the whole file. If you have own servers close to users I would just make some large cache and a small streaming server from local file system. As RAM is not that expensive now I would even suggest RAM disk for songs. So when a user needs a song it is read from some cdn(just to minimize hops for geo regions of own local servers), then file reader marks access and starts streaming just using file read and write to socket or the file is passed to the end user. Such a simple streamer/reader will be able to handle tens of thousands of connections on a single server. At end of day or some percent of disk full a job should just delete files ordered by last access. Some small local rdb can help for the marking as you will not have 1 billion songs on the local disk. This may even be better than commercial cdn as it is your own one and price is lower.
I love seeing some of these creative ideas to balance scale and cost. Putting on my manager hat, I could see this being an optimization added to the system after getting it up and running and stable using an off-the-shelf CDN. Time-to-market is often more important at the beginning, and cost becomes more of an issue at scale, at which point adding complexity may be worth it.
Interesting but I'm skeptical here. How would you make lots of servers that are close to the end user? And wouldn't your caches fill up very quick, and you would be replicating storing lots of data within each server's large cache. If we assume that songs are accessed randomly the caching wouldn't useless and we would fetch from CDN every time
all this shows is that even a seasoned engineer at a top company can struggle depending on how a question is not properly qualified. Here technical knowledge is one aspect, communication is a whole other.
I think it's a fair design to use 3 layers of caching between the user and s3. 1. In device caching and remembering play pause stop states (while also syncing it with web servers) 2. A redis like persistent caching solution for api the streaming service workers. 3. The above two can be treated as fe and bff layers. The final layer should be caching directly from storage via a cdn depending on geo proximity of the worker service (in case of bff cache miss). So Akamai and Cloudfront have edge locations and you can also replicate your buckets in s3 (multi region replication is even better)
I am a bit surprised that the numbers he asked for at the beginning, and that he jolted down on the board, did not seem to really influence the design quantitatively or qualitatively. So, did those numbers really serve any purpose? Example, what if the interviewer says the number of potential users is like 10,000 as opposed to a billion? That's a MASSIVE difference in order, yet, would Mark propose the same design (just smaller databases or S3 buckets), or would it be an overkill? Another question is about the typical system designer's job scope and journey before he becomes the architect. Are they promoted from among developers? Probably mark (the architect) will not code up the service, but are they supposed to know up to a sufficient detail on how to do it or the frameworks, languages, libraries if necessary? Or is figuring that out is left to senior developer or someone like that?
He used the numbers to show the size of the storage used but not much beyond that. I think it might be a bit too much detail for this sort of interview question to expect someone to be able to translate 100 million users for this use case == 30 web servers or something like that. In reality I think a lot would depend on the complexity and efficiency of the database and query design and what frameworks you're using etc, which would need some proper time sitting down and thinking about. Also when it comes to the real world and you start to get your metrics in, that's when the bottlenecks would start to show and you'd refactor the design to alleviate them.
I think if the users were ~10K, the load balancer and CDN can be eliminated. Overall, the number of users did impact the design choices. As for the second question, from my little SDE experience I have observed that Project Architects usually climb the SWE ladder and are fully aware of the technologies used.
If it was 10000 users you would need just one server(well, for backup 2 or 3), cdn could be skipped, all songs could be present locally, the db even, you would not even need a separate songs, artists, playlists database server. Just lb+full app server, some db replication for backup and that's it.
Point taken - I did not make it clear why I asked about number of users and how that influenced my design. What I was ultimately going for was an upper limit on data size (1B users * 1KB per user = 1TB to figure out whether this would fit into a relational database. It's big, but I believe this is doable with modern hosted relational DBs (like RDS - maximum limit is listed as 16 TiB). If we had 10,000 users, that part probably wouldn't change (metadata). I might still use S3 just because it's easy, reliable, etc. I am flattered, btw, that you would consider me an architect. I have been an engineering manager for most of my career and worked with some really great tech leads, SWEs, architects, etc., but I have not been in that role in a long time. So ... thank you! :)
@@MarkKlenk whoa, Mark, could never imagine you would come back and respond personally. Made my day. Thanks, for the response, and it was really educational to listen to your thought process nonetheless.
It is interesting to see that even a 13 year Google Engineering lead (guy's a BIG shot) has to think about an approach. Makes my own work so much more relatable. I like the fact that he was not given the question beforehand
What I noticed missing was TTL or time to live or file expiration. That should be part of the API call as we dont want to indefinitely store songs in our CDN or in Cache. And really any reference to APIs or tracking of session state to be able to continhe where a user left off.
I would prefer a location based load balancer as a primary way and spread web-servers proportional to users geographically considering that spotify is used by almost all countries.
Thats what the CDN is for routing cached items that are stored geographically that reference the database when needed in order to complete the users request
@@Sim_baah agreed, our CDN will cache most played or requested songs, but I meant from perspective of load balancer and web servers which are much in our control in case of requests which are not cached into CDN and have to fetch it from main database.
Recently had an interview with the same, hadn't come across this video then. I wish my design was as neat as it is here. The simplicity does help explain the data flow a lot better.
When Spotify started, their main USP was to load the song within few milliseconds and that they did with the help of p2p network which they abandoned a few years later to use more traditional approach like Mike explained.
Really nice mock interview, fantastic. Another improvement in this scenario is saving the songs/albums in client's device, sorting and removing by frequency its being played. What do you think of these guys?
I'm wondering how naturally Mark arrived at the idea to use S3 without even asking a question if it is okay to use a 3rd party cloud storage (with potential lock-in) or the storage itself should be the part of the architecture. He didn't ask are there any licensing restrictions on storing MP3 data and so on. I guess the selection of the cloud is an important decision and should be justified even more than the selection of the components.
Pretty good one, I would add: - No SQL for songs meta and keep users data in the relational database - mention encryption at rest (object storage and DB) and on transit (SSL on API call) - As the system is read-only so read-replica
thanks for your efforts, as a mid-level software developer I would like to share my conclusions list from this interview as a list of steps: 1- List of all functionally depends on the requirements. 2- Database. 3- use case diagram for all features. 4- objects definitions. 5- functions details. 6- requests and responses. 7- list of the system restrictions. NOTE: this list according to my understanding for this video, it's not very accurate :)
Quick question: why is this so technical/quantitative and in-depth? There was a mock interview on a different channel about building tiktok where the interviewee i believe was a google TPM. That interview was much more high level and didn’t really even go into scaling, the metrics, data replication. You might know what I'm on about. They made it seem so easy, but this mock interview is much more sensible.
honest interview...just missed some chunking idea for songs at my opinion...btw great interviewer....always acknowledging with positive body expression 😊
While I liked the overall structure and recommendations, it would have been nice if this would have been an example of a successful interview. He would fail hard in any somewhat senior interview. While I will not pick on the details, some tips. * If you establish numbers, use them later on, to verify your use-cases. * This goes in line with not taking the usecases lightly, but establishing their true meaning for the architecture. Searching songs is a string-based approximization search. Misstyping, obscure band names and more. Having 1 billion users search for text on an RDS is somewhat not the right technology. I mean you should check it. * if you introduce new concepts, review your previous established designs, if they still fit. We could see it multiple times in the interview that new concepts were introduced, making old assumptions invalid. * stay on areas where you are comfortable. While it is good to understand and state that you know your limits, but when you in turn make multiple wrong assumptions and create a bad design, this will not be to your advantage. Overall a good example how you fail.
The design is basically fine other than the search index. The only other key question to my eye is at what point a traditional RDBMS would no longer work, and how that might be sharded in a global app
@@alexs591 I disagree that this would be a definite failure for a senior role but yeah it had mistakes. I think the lack of caching over the metadata is also a mistake. The metadata is the intermediate step to the songs. The user has to get the song name or the artist. So, we need that cached in a local cache and a distributed cache for hot artists. CDN is nice for multi-region but won't be enough. Speaking of multi-region, I would have multi-region servers as well
This is probably for some junior engineers. It has very basic concepts. The questions were not too technical to kinda push the interview towards high decision making skills. This is just list all technologies in any saas , and connect them,
Indeed, my solution was about assembling SaaS "Lego blocks" to solve a problem. I think that judgment calls on which solutions to assemble carry some weight. I definitely value that in interview candidates when I'm doing mocks, but I may also ask them to go a bit deeper in certain areas if we have time.
Some observations: The beauty of system design is there's no right or wrong answer. I would've tackled these very differently(not mentioning replication until we are optimizing the design,...) but both methods would work as long as it makes sense I love his sense of detail, describing blob storage as linearly scaling, the songs being immutable are read only, the storage needed for various encodings... These make total sense but he really spells things out clearly.
HI Anoop! That's a really good point. There are, indeed, many compute options out there to choose from. I am not an expert by any means, and I would love to hear suggestions from you and others here. I think I would learn something from you all.
The part of searching music could be not sufficient at that scale. Indexing, at least the metadata we would need to use for searching songs (artist, song name, etc..) into a key-value database or a solution as elasticsearch, would be way better. Also would help with faceted search (gender, etc..). Queries to RDS, at that scale, would be too expensive. What do you think?? Anyway..great video !
this is so easy, I don't know why some make it sound like a big deal. I am not even a native english speaker and I can understand this fully and can do same thing with any other system design requested in an interview. The only problem is getting the interview lmao
The Solution is quite simple, I think most interviewers would fail the candidates if this becomes the final solution based on my experience. But Interviewers are not so great themselves because they already looked up the answers and had ample time to prepare for it. This is only 45 mins and so the scope should be limited. For interviewers if they expected something that seems to be missing, they could guide the interviewee a little bit just to stay on topic and make sure the final outcome covers the important use cases. Did they want the the interviewee to cover high qps loads? Caching? Security? Malleable architecture? A broad opening like this "create a spotify app for me" I dont think you should be very anal on some missing features. It felt more like the interviewer is asking "what do you know about general architecture components?". People on the internet here are unreasonable.
I mean, I would guess this comes more down to getting an interviewer with a bug up their butt looking to filter someone out because they didn't magically think of the aspect they find most important. At the end of the day, you're looking for a great conversation about major design considerations that show experience and breadth of awareness. And most importantly, an engineer you feel will communicate clearly and not be afraid to speak up if they see an issue or think of an issue that hasn't been addressed.
I think concurrency and fault tolerance is a big design consideration. If a web server goes down will it take down N users . I’d probably look at adopting something with Erlang. Great content and appreciate the input . Joined as a sub
thank you very much for the video! I was looking for something like this. I am not the best solution architect ever but I would design Spotify by very similar way but I am grateful for design ideas
Great job done. I only have one question to understand. Did you miss talking about the security (authorization & authentication ) of data (music)? Or it is out of scope for this interview?
The way he talks about CDN is only about cache, but you also don't want to provide a direct link to the source that can be hijacked or abused in any other way
I was asked once a question to design a system. My interviewers liked the question from my side. I asked them the following. Am I leading this project or, am I just an expert in a certain part of it? They liked it because I asked if I should focus on general architecture or leave more space for a particular component. Which also has to be properly designed. Anyhow, this interview was fantastic. Thank you.
would've also loved to have seen a bit more expression of skill, it felt a bit like watching gordon ramsay being asked to prepare a breakfast meal, and he'd make cereal.
It’s the standard “right answer” for these kinds of interviews, but fails to address any problem specific to media streaming, doesn’t touch on many networking details, and doesn’t explain how the design of the system would enable some well-known features of Spotify. But honestly, his answer was pretty good, and it probably would have been better had the interviewer challenged him more.
I think it would be great if this would have been broken down into a search service and also music service. the main goal is to search for a song and then play the song. I believe we could focus on making search faster with fuzzy search support and may be OpenSearch integration. Based on search we could also integrate Spark and a Flink as a data aggregator for downstream audit to see which region/genre/songs are most played so we can optimize this in the CDN or when we shard the db. I believe we could really optimize this design.
Pleasantly surprised he came up w the example of european punkrock, as I’ve been playing in european punkrock bands for a while 😊 nice choice!! (And it really is a bit of a niche)
is really funy this type of questions "design tiktok", "design youtube"... "design x..", for a senior /mid role, dude , if a person could design those companies, he should be applying for an investment from sequoia or Y Combinator
For worldwide scaling, there's no need to save favorite songs in a local replica. We have Content Delivery Network (CDN) already serving that function.
Get 1-1 system design interview coaching with FAANG ex-interviewers: igotanoffer.com/en/interview-coaching/type/tech-interview?RUclips&
ex google cause he used AWS instead of Google Cloud
😂 nice catch
😂😂😂😂😂😂😂
Well, maybe, ex google and current amazon ? :P
My 2¢. AWS is very popular so using those references will have the broadest alignment. But also, when you work at Google, unless you're in the Google Cloud org, you generally don't deal with picking services... Google internally is a dream of tools and frameworks, all perfectly interconnected. So much that you don't think about where things are going, they are all handled quite nicely without worrying about scale. Most apps there are built for internal use (audience at most about 200K users).
😂😂
This video is a perfect example of how things should be explained.
The way Mark has explained entire design is commendable.
Kudos to the guy interviewing for being so patient and polite.
Just the fact that the interviewer could shut up and listen to the answer makes this interview great.
There is nothing otherworldly about design interviews, not much has changed or invented in the recent decades the only issue in my experience is that people can't just sit and listen, they'll be constantly asking questions, breaking up the train of thought, I'd say its a tutorial for the interviewer and not the other way around)
Are you talking about the interviewer asking questions, or the interviewee?
It's great for the interviewee to continually ask clarifying questions - it's more annoying if the interviewer is constantly asking questions, but there still needs to be a dialogue.
@@CommentGeneric awesome username you have there 👍
Yep dialogue is the key
I think they will cut you off if you are rambling over something they are not interested in or rabit holing and they are pulling you out of it and redirecting you to the direction they want you to go for. It's useful to take cues from them when they are cutting you off to take hints by listening to them what they want exactly. Also more they talk better for you kind of, unless if you are screwing up and are not talking at all.
@@bombrman1994 what you are talking about is the best case scenario and I am sure some are like that.
@@CommentGenericDialogue for the sake of it or to make the discussion fruitful? Most of my exp has been the former.
A few things I did along side while understanding Mark's POV:
1. I would usually introduce DNS geo routing earlier in the stage to route to the nearest LB
2. Also worth to have a Metrics collector that can always keep track of HITs of top 100 (or emerging hits depending on BI) per region basis in some form of a max heap and then have a scheduler to periodically walk through them to ensure that nearest CDNs are hot loaded / prewarmed with them. Reading from S3 is very slow and I would usually find other alternatives instead of chunk reading in an instance memory. Packet roundtrips can be costly especially in use case of streaming.
3. I also split the durable storage into two - user data storage (less frequently used in comparison) and songs metadata storage - this way DBs can be fine tuned for workloads.
4. If I told S3, I would also mention cross region replication just to touch it a bit and indicate that I was thinking of a DC going down entirely.
You just said all what I thought about during this video 😅
Good point
Hi @jai_ver_rb17,
What I experienced from my interviews, sometime trying to add more details (point 2.) took me longer and after 35 mins I felt a little crunch of time. Which handled nicely here and I learned over the time.
If you have any tips to share how you balance details and not wrap in hurry? Its a fine balance.
Im still a junior but I remember some classes that featured system design and watching this interview brings up a lot of memories. What I also love about that is the "doing things from scratch" part. When you're dealing with system design, it usually means you're creating something new, a new app, a new service etc and that's always an exciting endeavour
@adennis200 congrats on our most liked comment! We're actually looking for a new Host, it would be about 15hrs work per month, would you be interested?
it's always best to start the interview answer and define the "Functional requirements"(FR) and the "Non-Functional Requirements" (NFR) that are needed for the design.
NFR could look like this for this design:
1. low latency
2. high availability
3. secured connection
etc....
this helps to flush out point of failures, and bottlenecks early in the design.
Pretty sure most interviewers asking you to design a system are going to all expect the same NFRs because that's just the way of the world. If it's not low latency and high availability, then it's just not going to be a good product.
@bradfordsuby8064 That's not always true. It's trade offs. Maybe something being consistent is more important than actual HA and super low latency, and then you have to figure out what HA means to the interviewer.
CAP theorem, and the superior PACELC talks to this
Knowing the trade offs is the key. And in real world you will always have budget limit too.
Agree. The non functional requirements are the most important ones for mid-senior/senior level developers as they are the ones that make a large scale system work flawlessly.
It was nice that the interviewer just listened, and the interviewee presented a simple design. However, in real FANNG interviews, especially for Senior roles, you're expected to go into more detail, and the interviewer usually challenges your decisions.
Yeah all of this teaches me nothing
Try InterviewJARVIS
To be fair would you trust your EM to clear a system design interview?
@@noobgam6331 what is DRM?
@@noobgam6331why consider DRM? Our concrete use case was finding and playing music. It has nothing to do with DRM. But if we were speaking about upload part then it would make sense
This is pure gold, explains almost everything when you need to learn what a system is and how it functions...very very useful !!! Thanks man !
Great to hear!
What this interview really shows is that you don't really need to know every detail of the future solution(spotify i much more complex than this), but those solutions that you choose to invent - you should be capable of explaining why they are needed in the most understandable way.
That's interesting to watch. The design looks very similar to the one I produced during an Amazon Interview with the Load Balancing, Cache and Server Geo Localization. I was feeling good as well about my interview however I have failed it. The most dramatic part about failing the interview is that we do not have any feedback on our mistake to improve on.
The only mistake I see is that at the last test, I did not write down one of the requirements and when I finished coding the interviewer told me he said the opposite about this particular requirement and I had nothing to back up / verify who was right. So if you are about to go through an interview, lay down on paper all the requirements, validate them and then proceed to the coding part.
Good luck out there.
That's the pain of today's society with development interviews - no freaking feedback. Just "we moved forward with someone else".
System design is more about communication and collaboration rather than coming up with a solution that 100% works. Most likely you were not communicating as well as you should be, or not collaborating as well as you should be.
@@bradfordsuby8064 its a market of companies and not developer . so I thing interview became more and more difficult
Like many others I was designing my own version alongside Mark's and I think the area he was a little weaker in (which he himself admits) was load balancing. My background is systems administration so I may have a different perspective on this. I think going back a few steps, chunking the data also serves an important role in the load balancing process. I would have songs chunked, from every retrieval source, so that as soon as the user presses Play, the song begins playing, and playback should always be an instantaneous process unless the servers are over capacity, which can occur because some song or album has gone viral.
I would structure the web server load balancing so that client apps attempt to contact the server geographically closest to them first and utilize GSLB (global site load balancing) which combines layer 4 and layer 7 load balancing, as I/O capacity or concurrent connections (the two metrics I would prioritize) reach a threshold.
Again, when talking about load balancing, it's important to determine what happens when maximum capacity on all servers is exceeded. When this happens in my design, the system will issue "tickets" for data chunks, served in the order they arrive in. This is where song chunking comes into play. Because we are chunking the MP3 data, we can still serve the first chunk of the song from the nearest available server as soon as that I/O is available, further ensuring near-instantaneous playback upon pressing the Play button. The rest of the song then has some time to download and cache to the client device, reducing the number of interruptions and pauses in playback due to bandwidth and concurrent connection overages.
Can you explain more about these “tickets” in LB
@@kento8453 Yeah, think of surge queues. Surge queues are essentially placeholders for a pending connection that occur when load-balanced services are overloaded. Amazon's elastic load balancers (ELBs) for example have a spillover configuration that allows excessive requests to be dropped. With a combination of chunking and surge queues with spillover protection, you can continue servicing requests and the impact is only mildly noticeable from a client perspective.
Yeah, I didn’t understand what are tickets.
Why do you need to chunk the data instead of just streaming it? Streaming already sends data over time in minuscule chunks, so you can play the song immediately and don’t need to “find and assemble” the other chunks. Especially since each chunk takes time to download, but stream bits are instant one after next, how would chunks be a better solution here?
@@simvo7802 Uhh streaming isn't a zero I/O operation that spends no time finding and assembling. Quite a bit of resources are involved in streaming, unless someone has created a perpetual motion machine already that I wasn't aware of. In this house we obey the laws of thermodynamics! Please don't delete your comment btw. You can reference it later sheepishly. We all are at different stages in the learning process, and a bit of retrospection can be refreshing.
The elegance with which Mark explained it 🤌🤌. Exquisite!!!
15:00 it’s important to mention that the problem with the storage separation is not about the data im/mutability (actually you can update data even in the blob storage). Primarily it is about how inefficient it would be to store 5MB blobs in any general kind of OLTP database that will cut each blob into pieces of 2KB sizes, build a separate table (toast) for it with index over each piece. And only then you would want have more efficient streaming and completely different types of local and global caching. So separation makes lots of sense just because one data is in small pieces and another is in big.
Most RDBMes have special blob support where they do not store the blob in the typical buffer pool with those small 2KB-16KB sized pages.Bu
But your point is valid in general.
So is the interviewee's. Under normal circumstances, immutability means it would not take part in any transformation functions of transactions/queries in the RDBM (even if it was stored in it). It would just be dangling as a reference to an opaque entity that never gets transformed. So if we move the opaque/large/immutable item to external blob store, you really do not lose anything (you still have refernces to it that take part in the RDBM queries/transactions).
@@lagneslagnes Why not just install them in the filesystem and have links to the blobs? The filesystem is extremely optimized for that. Maybe put 1000 songs in each directory. Any database has a cost. You could also memcache the top 1000 most popular songs.
Also, blob storage uploads different parts of a file into multiple machines in parallel. With RDBMS, to achieve the same you could split a file manually and do some kind of sharding - but too much manual stuff.
Though it’s not that relevant for Spotify with low load of writes, but in general it’s a good reason why rdbms are not good for blobs.
@johnkoepi thanks for your contribution, you make a strong point. We're actually looking for someone to help me Host, it would be about 15hrs work per month, would you be interested?
@@IGotAnOffer-Engineering I'd be happy to connect about this
Really like how Mark communicates so effectively, and designs iteratively.
I knew every single measure and strategy which Mark presented here. But I dont think I would have been able to present it the way he did with a gradual continuous increase of complexity. Awesome answer Mark.
I wish I could get interviews to be able to deliver these answers, Im good at that.
Mark does a great job of explaining the different aspects of design in a clear and concise way.I really enjoyed this video,keep going on man 🤟
Glad you found it useful :)
Mark is only 24
To be able to watch this for free is just amazing. Thanks so much
our pleasure, Big Poppa. Hope you enjoy the rest of the videos on the channel (plus more coming in a few weeks)
Love the video! Before playing the whole video I played around with a design of my own and ended up with pretty much the same design with some variations that I'll add bellow.
I think Mark and the Interviewer missed on digging a bit deeper into one of the main requirements Finding Music. Mark talks about performing the search operation directly from RDS. Taking into consideration the scale of the system, that would have been a terrible decision. With millions of users, the search function would hit the DB constantly and generating read queries in the RDS instance that stores its data on disk. Resulting in overuse of the DB and high latency.
In my design, I went for a dedicated search service that is powered by a Search Engine such as ElasticSearch. This service is populated in the background asynchronously by a Consolidator service. Essentially, each time a data is added to the RDS (new songs, etc) an event or message is sent to a queue, the Consolidator Service would get the new data and push it to ES. Then the users can search very fast for songs using a highly optimised Search Engine.
Yeah, like the "finding music" part pretty much implied an efficient search system. In general, I don't think this is a good video to prepare for a system design interview because the interviewer didn't challenge the interviewee about anything. The hard part is being able to justify your choices, explain tradeoffs, admit limitations and make major optimizations on the spot.
@carlosluque4753 great input, makes a lot of sense.We're actually looking for someone to help me Host, it would be about 15hrs work per month, would you be interested?
@@ThomasSchneider-l1v good point, depends on the usecase, such as voice to search, a recommendation system. Maybe you need a vectorDB to do similarity search?
Tbf, he introduces CDN in the design so it's not accurate to say the search logic will directly go to RDS alone, and he mentions a lot of caching mechanism on different parts of the system which greatly reduces interactions with the databse.
@@markiel55 the CDN is storing the raw audio in his answer, not the metadata. Searching requires the metadata. Carlos is right.
What a pleasure it's to listen to this kind of people and the way they design solutions, they make it look easy but it takes years of experience to abstract like that
I'm having an interview at swiggy and i went through out several of youtube videos to understand that how to design an system at high level or low level but this is what I'm searching for , which made me confident to make answer in the low level design interview that how things work and function. Thank you so much
The fact that he's bringing up specific kpop groups makes my day.
For load balancing, you'd also want to think about having certain webservers marked for specific tasks. Though I suppose that would be more like having 2 services - your lookup service and your streaming service. That way you don't have to worry about the weight/priority of IO based balancing vs CPU. Then your lookup services are CPU based and your streaming services are IO based.
This is a great session. This format works when interviewer is a good listener and allows Mark to finish what he has to say, put a logical end before transitioning to next stage of SD or asking questions. and that is great.
Can you do a session where the interviewer is constantly interrupting? you neatly define stages of the SD interview and its a flow we (interviewees) would like to get into. but more often than not, interviewer doesn't wait for you to finish a topic.(usually non FAANG companies) they just want to get into details of a component. or more often than not ask "why". I personally find it hard to transition from "answering their question"(which could take easily 2-5mins) to getting back into original format I had planned for the session. and because I'm unable to logically complete i fail the interview.
This was the most realistic System Design interview video I've watched.
The audio streaming would better work with 30 sec. chunks of audio, instead of loading the full track, which can vary in length, from am minute long track to 20 minutes long. Also, ordering the artists and songs based on both relevance to the search terms and popularity and user's personal listening habits and preferences should make sense. Artist, song, and user metadata are all connected with relations on multiple vectors, like genre, mood, country of origin, and lots of unknown relations (aspects) that come up from machine learning, etc.
Great suggestions, addition to first suggestion -> I would split the audio into chunks ONLY in cases when the length of the song is above a threshold, example if a song it's 2 minutes (say ~ 2.5mb), it would make more sense to download it all with single query rather than hitting the Audio DB four times.
how would it work when the user wants to seek to a part of the song ? i am not familiar with networking so i'm curious how the connection stays for example during a 1 hour song. if it makes a new connection it'd be slow i guess but if the connection isn't severed then the server might get too occupied ? how do we balance these ?
Yeah, if you think about how Netflix works, the media is encoded to multiple screen sizes and resolutions to deal with varying network conditions, and then chunked. The client then retrieves the next chunk of the stream from the nearest edge server.
So just encoding the media to the multitude of client conditions and then disseminating the chunked content to edge servers is a hugely interesting engineering case and solution.
@@深夜-l9f Very interesting question. I like the idea of loading less than the full song to start playing and then continue to "read ahead" while playing. This is a common practice also for videos and increases the chances that the full song is loaded by the time you start seeking around.
Still, it's not perfect, and you can imagine a scenario where the user seeks to the end of a 5-minute song right away, resulting in a delay.
I'm not a system designer (yet), but from my work in my bachelor's and master's - while it's a good idea and most probably how it is implemented, this is 'getting lost in details'.
This is the specifics as to how the streaming gets optimized; and if you have time to talk about that after the system is fully designed, sure, that's good. But with ideas like this, it's easy to go 'so there's an app, and it talks to a server, which talks to a database that stores.. and by the way, the database does this, and this, and this' - and then one hour is up and the rest of your system is underdeveloped.
Had an incredible experience! The learning was truly insightful, and the conversation was engaging throughout.
My first thought was "the level of confidence to question a senior ex-Googler". Then, I remembered that Google has put out some less than stellar solutions. All in all, Mark explained it beautifully and it must have been a joy to work with him.
Can you give an example of some "less than stellar solutions" that Google has put out? What specific Google products do you think suffered from poor infrastructure design choices?
Google engineer and still uses as reference AWS lol, poor GCP. Nice vid BTW.
So what ???
Emotional damage
@@just_A_doctormakes you think whether gcp is inferior
@LuisRuizHalo it's most likely is because he knew Spotify runs off AWS so it was the most relevant cloud for the context
Spotify uses gcp, I used to work there
usually I get bored by tech teaching videos but this is the first one that I am still watching.
I would have gone deeper on API specs (some endpoints, how would they work?), the searching algorithm (roughly, db indexes? some middle caches?), and audio service (streaming, shared cached besides CDN, loading all in mem takes time where the user hears nothing, and is costly in RAM, discuss alternatives). A way to deal w/ metrics (data pipeline, no need for too many details).
Also, mention CAP, what would u choose and why.
Normally, you will forget to mention things, and the interviewer will ask accordingly; but as mentioned, it is usually better to have your key points exposed w/o the interviewer needing to question you.
Those are all excellent followups that the interviewer could have asked, but this does a good job at iteratively building a solution while communicating
That was a nice one. I like how Mark evolved the design.
Redis for heatmap. Timescale db for stats and playback history. Asynnc Via a message queue
Hmm.
From my point of view there are some things missing that I would expect you to mention during system design interview.
First and most impactful on system design are service metrics, like reliability, responsiveness, availability, and so on.
I do understand you kind of included 'apparent expectations' in form of initial question - we all have an idea about 'what you're expecting from 'Spotify' service, but at the same time you have to quantify those, because an 'idea' is just an idea, different people (stakeholders, clients) can have different expectations for the same idea.
Few basic examples:
- median response time (for every use case)
- uptime/availability requirements
- RTO/RPO
Exact numbers for those will make a tremendous impact on any high-level system design.
Second - constrains (you guys went through few of them, but kind of missed usual ones). I can't stress enough how valuable to understand your project (system) constrains at the beginning - it could be money, time, some government requirements, technology requirements/constrains, anything. Main thing is - you just have to understand you can't design ideal system (in it's 'final' form) and try to get there from the start. There would be iterations, growth, compromises, technical debt - and as a system architect you have to plan things around all of that.
This is going in hand with my first point, examples:
- do we have enough time/money to provide 99.9% service availability?
- how and in what time/cost we can add additional features?
There are more theoretical ones, but it would be nitpicking at this point.
Some practical issues in final design I would point out:
- pretty sure there's authorization service missing (you absolutely don't want to handle them with your main app).
- you don't want to handle both search and playback on same service, not with those numbers.
- you have to use LB at least for metadata load
- there will be a lot more metadata, I would split it in two (at least) - user-defining and content-defining.
- you have to add metadata to your CDN, its a part of core user stories
P.S. storing/accessing/updating your data fast on this scale is quite a loaded question by itself
Uptime/RTO questions would add design regarding the HA/DR architecture which would be an extension of this design right?
What impact would response time requirement have? Maybe add more caches?
Understanding time/money restriction is important when scaling but would not have impacted design in video right?
@@nukeu666 You're right. I tried to say that in this exact case you can kind of assume those answers, and his design wasn't wrong.
My point was - in this kind of real-life interview I would love to hear about constrains and service metrics in a bit more detailed way (or at least skim over) - how they impact system design at all.
About response time - there is big difference in architecture between 500, 50, 5, 0.5 and 0.05 seconds in response time (I would guess you know this as well). Until you propose/discuss exact values/ranges, you can't really make a decision on "how to make my system".
Absolutely. There are a lot of really important pieces missing in the design I would expect an architect to include.
Wow! Thank you for so much careful feedback. This is really good.
Your comment "I can't stress enough how valuable to understand your project (system) constrains at the beginning" really resonated with me. That is absolutely true.
One of the things I look for when I'm doing mock interviews if the topic of response time comes up is percentiles (50th aka median, 90th, 95th, 99th), but I clearly missed following my own advice there. :)
Also good call on splitting the metadata - I think they have different purposes, so it would make sense to keep them separate.
@@MarkKlenk Thanks for responding, I'm happy there were some useful bits I could provide!
This is a great interview in general as the interviewer was not intrusive, but there are a few things to note for the other viewers,
1. The level the candidate was intervieweing for was not mentioned. This makes a huge difference. Depending on the level, the signals that the interviewer looks for changes accordingly. The following points are based on this. Given the answer, I would say the candidate was young senior engineer or grwat mid-level engineer
2. Although this is a system level design, we need to think about some low level aspects, mostly the models in the application, db schema, protocol between various components
3. The candidate did not consider any security implications at all, especially with CDN in place, how would we make sure only authenticated users listen to the songs
4. Consistency pattern between metadata and the file upload, is the file uplaoded at the same time as the metadata being saved in the db? what happens when there is no song uploaded when metadata is available within search
5. Filtering songs and finding songs are almost different. One could fitler metadata just by using RDBMS, but free text search as often we do in many apps cannot happen in the same database as storing metadata, therefore we need another database to deal exclusively with free text search, this could be an RDBMS with such functinaltiy or something like elastic search. We should also consider about consistency patterns here
this interview is much more useful than my 3 months university course 😜
this was great . maybe it didn't add a lot to me in terms of technical aspects but the way mark was connecting the dots was really interesting that's exactly what you expect from an top notch engineering manager
24:00 the clinet app should be able to get the chunk of the mp3 directly with some sort of token (expires in, for example,15 mins, per Id or user, etc.), the web server should not fetch the mp3 chunk data for client app but generates access token only
My taught exactly!
The app could get the mp3Url(token embeded in url or similar) from the db/cache via the web server then it could read the data from s3 directly...
I feel that with a humungous list of 100 million songs, we can implement a separate search server for the search functionality like a Solr search. It will reduce the searching time by a huge margin.
Really nice video. Another point is to dress up during the interview. Mark looks like a CTO-level person. That first impression is really important when leveling.
great video to watch before an interview for any position in computer science fields...
also, definetly wouldnt go with streaming audio from the webservers - for both scalability and separation of concerns. a finelly tuned CDN (having price constraints in mind) would do the job.
Great and interesting interview!
AWS Cloudfront with S3 backend automatically pulls a file from S3 if it is not cached already so the webserver could return the mp3_link at the Cloudfront distribution endpoint and Cloudfront would take care of everything else.
Yea I think he overcomplicated this part
Won’t it be an issue that CDN does not authenticate the downloader?
He is just genius!! The way he is explaining is REMARKABLE!
Love how some of the comments are saying how they would have heaps more detail. Being in a test environment makes it much much harder to think of those things and you only have 45 mins. I have interviewed over 100 people in white boarding sessions and thought he did pretty good for someone that seems like doesn't have experience in that exact type of system.
Of course if you have time to think on it, you could do a lot more in detail as the solution is much more complex.
His communication was good and why made sense. How was more vague as it was obvious he has never actually done an audio streaming service. If he has I think he would have drilled down faster.
Overall a good example of test where you have not done a specific design before...
It's funny that the interviewer is trying so hard to nitpick everything that the more experienced guy is saying. "He should have said up front why he was splitting the databases into two." There's so much going on and you're working through a problem you were just given 10 minutes ago, no interviewer is going to care about if he addresses it up front or if they have to ask for clarification. It's all part of the process
This was very interesting to watch. I am currently a Senior Software Engineer, and will probably end my career at this level as I'm quickly approaching retirement age. I've always loved getting my hands dirty writing code, and have never had any aspirations to advance to the level of an Engineering Manager (or Development Director, etc.). But while watching this interview, I found that my thinking was in lock-step with Mark's, and I found myself answering the interview questions with essentially the same responses. I even blurted out several of the same responses _before_ Mark answered in the same way.
How old are you? I'm also have similar thought process. Don't wanna go beyond Senior Software Engineer as I think it's too much stress. But that would mean I'll have to retire late.
Thanks mark! Very helpful to basically see how to communicate effectively calmly and enhance the design step by step.
I would've added couple of more things here though
1. Separate the application servers for Querying the songs vs playing the songs (As you mentioned the load can be very different and the servers which are playing the songs will have high network bandwidth usage)
2. Add cache to the metadata server also (Songs metadata to maybe cache the songs which are recently, from some famous genre etc)
I would expect a lucene based technology for search in there. For an app as large as spotify search for songs in RDS will not scale
Most likely. Also metrics like play count will go into a time series database like timescale.
I think you don't need a websocket connection for chunk loading. Both HLS and MPEG-DASH are working though HTTP protocol for this purpose
I didn't bored, Very informative interview.
This will help me in my upcoming amazon berlins interview (L5).
Thanks Mark 👍
Elastic search might work really well for the metadata db. It should cover the storage as well as the search functionality.
10gb-100gb of data in DB is not that much. Indexes will do a trick there
@@michaszewczak7392 there will lot of dynamic tagging involved for the songs, simple text search would not suffice here. Some sort of lucene index Elasticsearch/Solr etc would really help here for full text search.
My main takeaway from this is that some companies have absolutely absurd standards on what to expect in short interviews.
I had one interview a couple of years ago that was 1,5 hours of technical questions directly followed by only 30 minutes of this type of interview, where the task was to design uber eats, and their standards were absurd. I ended up 2nd out of all applicants, didnt get an offer for the position I applied for but got one for a similar position in the same company some weeks later. I ended up declining it and honestly primarily because how the managers acted and had set up the last part of that interview. It seemed so out of touch with the realities of software engineering.
A great video to explain how solution architects work and what knowledge they need.
Actually if you need to stream it would be difficult through cdn that will send the whole file. If you have own servers close to users I would just make some large cache and a small streaming server from local file system. As RAM is not that expensive now I would even suggest RAM disk for songs. So when a user needs a song it is read from some cdn(just to minimize hops for geo regions of own local servers), then file reader marks access and starts streaming just using file read and write to socket or the file is passed to the end user. Such a simple streamer/reader will be able to handle tens of thousands of connections on a single server. At end of day or some percent of disk full a job should just delete files ordered by last access. Some small local rdb can help for the marking as you will not have 1 billion songs on the local disk. This may even be better than commercial cdn as it is your own one and price is lower.
Woah, did you just build your own CDN? Great comment!
I love seeing some of these creative ideas to balance scale and cost.
Putting on my manager hat, I could see this being an optimization added to the system after getting it up and running and stable using an off-the-shelf CDN. Time-to-market is often more important at the beginning, and cost becomes more of an issue at scale, at which point adding complexity may be worth it.
Interesting but I'm skeptical here. How would you make lots of servers that are close to the end user? And wouldn't your caches fill up very quick, and you would be replicating storing lots of data within each server's large cache. If we assume that songs are accessed randomly the caching wouldn't useless and we would fetch from CDN every time
all this shows is that even a seasoned engineer at a top company can struggle depending on how a question is not properly qualified. Here technical knowledge is one aspect, communication is a whole other.
wow, love this mock interview ! I think a lot of this is covered in AWS Cloud Certifications !!!
In real world you will be bombarded with a lot of questions.
I think it's a fair design to use 3 layers of caching between the user and s3.
1. In device caching and remembering play pause stop states (while also syncing it with web servers)
2. A redis like persistent caching solution for api the streaming service workers.
3. The above two can be treated as fe and bff layers. The final layer should be caching directly from storage via a cdn depending on geo proximity of the worker service (in case of bff cache miss).
So Akamai and Cloudfront have edge locations and you can also replicate your buckets in s3 (multi region replication is even better)
I am a bit surprised that the numbers he asked for at the beginning, and that he jolted down on the board, did not seem to really influence the design quantitatively or qualitatively. So, did those numbers really serve any purpose?
Example, what if the interviewer says the number of potential users is like 10,000 as opposed to a billion? That's a MASSIVE difference in order, yet, would Mark propose the same design (just smaller databases or S3 buckets), or would it be an overkill?
Another question is about the typical system designer's job scope and journey before he becomes the architect. Are they promoted from among developers? Probably mark (the architect) will not code up the service, but are they supposed to know up to a sufficient detail on how to do it or the frameworks, languages, libraries if necessary? Or is figuring that out is left to senior developer or someone like that?
He used the numbers to show the size of the storage used but not much beyond that. I think it might be a bit too much detail for this sort of interview question to expect someone to be able to translate 100 million users for this use case == 30 web servers or something like that. In reality I think a lot would depend on the complexity and efficiency of the database and query design and what frameworks you're using etc, which would need some proper time sitting down and thinking about. Also when it comes to the real world and you start to get your metrics in, that's when the bottlenecks would start to show and you'd refactor the design to alleviate them.
I think if the users were ~10K, the load balancer and CDN can be eliminated. Overall, the number of users did impact the design choices. As for the second question, from my little SDE experience I have observed that Project Architects usually climb the SWE ladder and are fully aware of the technologies used.
If it was 10000 users you would need just one server(well, for backup 2 or 3), cdn could be skipped, all songs could be present locally, the db even, you would not even need a separate songs, artists, playlists database server. Just lb+full app server, some db replication for backup and that's it.
Point taken - I did not make it clear why I asked about number of users and how that influenced my design. What I was ultimately going for was an upper limit on data size (1B users * 1KB per user = 1TB to figure out whether this would fit into a relational database. It's big, but I believe this is doable with modern hosted relational DBs (like RDS - maximum limit is listed as 16 TiB).
If we had 10,000 users, that part probably wouldn't change (metadata). I might still use S3 just because it's easy, reliable, etc.
I am flattered, btw, that you would consider me an architect. I have been an engineering manager for most of my career and worked with some really great tech leads, SWEs, architects, etc., but I have not been in that role in a long time. So ... thank you! :)
@@MarkKlenk whoa, Mark, could never imagine you would come back and respond personally. Made my day. Thanks, for the response, and it was really educational to listen to your thought process nonetheless.
It is interesting to see that even a 13 year Google Engineering lead (guy's a BIG shot) has to think about an approach. Makes my own work so much more relatable. I like the fact that he was not given the question beforehand
What I noticed missing was TTL or time to live or file expiration. That should be part of the API call as we dont want to indefinitely store songs in our CDN or in Cache.
And really any reference to APIs or tracking of session state to be able to continhe where a user left off.
I liked the interviewer’s zero cross question policy.😂
I would prefer a location based load balancer as a primary way and spread web-servers proportional to users geographically considering that spotify is used by almost all countries.
Thats what the CDN is for routing cached items that are stored geographically that reference the database when needed in order to complete the users request
@@Sim_baah agreed, our CDN will cache most played or requested songs, but I meant from perspective of load balancer and web servers which are much in our control in case of requests which are not cached into CDN and have to fetch it from main database.
Great Job!!!. I really liked the way the very high level initial blocks are made and add details whenever we dive deep into an area. thanks
Thank you very much!
Recently had an interview with the same, hadn't come across this video then. I wish my design was as neat as it is here. The simplicity does help explain the data flow a lot better.
Very informative, thank you! Start with simpler design and get buy in in order to avoid going on a tangent into details.
When Spotify started, their main USP was to load the song within few milliseconds and that they did with the help of p2p network which they abandoned a few years later to use more traditional approach like Mike explained.
Imagine u walk into an interview and they just tell u to design spotify
Really nice mock interview, fantastic. Another improvement in this scenario is saving the songs/albums in client's device, sorting and removing by frequency its being played.
What do you think of these guys?
Great interview style! Q. Which tool is Mark using for drawing diagrams/texts?
Google Drawings.
You could queries and can use O(1) operations partitioning in S3. Just wanted to clarify that.
Amazing video, such great detail and simple breakdowns!
Why not using pre signed uri and serve the audio directly from the bucket/cdn instead of loading into memory of web servers?
I'm wondering how naturally Mark arrived at the idea to use S3 without even asking a question if it is okay to use a 3rd party cloud storage (with potential lock-in) or the storage itself should be the part of the architecture. He didn't ask are there any licensing restrictions on storing MP3 data and so on. I guess the selection of the cloud is an important decision and should be justified even more than the selection of the components.
Pretty good one, I would add:
- No SQL for songs meta and keep users data in the relational database
- mention encryption at rest (object storage and DB) and on transit (SSL on API call)
- As the system is read-only so read-replica
Why won't you use no-sql for user data as well?
NoSQL database sucks hard. I'd much rather settle for elastic search here
thanks for your efforts, as a mid-level software developer I would like to share my conclusions list from this interview as a list of steps:
1- List of all functionally depends on the requirements.
2- Database.
3- use case diagram for all features.
4- objects definitions.
5- functions details.
6- requests and responses.
7- list of the system restrictions.
NOTE: this list according to my understanding for this video, it's not very accurate :)
Quick question: why is this so technical/quantitative and in-depth?
There was a mock interview on a different channel about building tiktok where the interviewee i believe was a google TPM. That interview was much more high level and didn’t really even go into scaling, the metrics, data replication. You might know what I'm on about. They made it seem so easy, but this mock interview is much more sensible.
honest interview...just missed some chunking idea for songs at my opinion...btw great interviewer....always acknowledging with positive body expression 😊
While I liked the overall structure and recommendations, it would have been nice if this would have been an example of a successful interview. He would fail hard in any somewhat senior interview. While I will not pick on the details, some tips.
* If you establish numbers, use them later on, to verify your use-cases.
* This goes in line with not taking the usecases lightly, but establishing their true meaning for the architecture.
Searching songs is a string-based approximization search. Misstyping, obscure band names and more. Having 1 billion users search for text on an RDS is somewhat not the right technology. I mean you should check it.
* if you introduce new concepts, review your previous established designs, if they still fit.
We could see it multiple times in the interview that new concepts were introduced, making old assumptions invalid.
* stay on areas where you are comfortable.
While it is good to understand and state that you know your limits, but when you in turn make multiple wrong assumptions and create a bad design, this will not be to your advantage.
Overall a good example how you fail.
The design is basically fine other than the search index.
The only other key question to my eye is at what point a traditional RDBMS would no longer work, and how that might be sharded in a global app
@@alexs591 I disagree that this would be a definite failure for a senior role but yeah it had mistakes. I think the lack of caching over the metadata is also a mistake. The metadata is the intermediate step to the songs. The user has to get the song name or the artist. So, we need that cached in a local cache and a distributed cache for hot artists. CDN is nice for multi-region but won't be enough. Speaking of multi-region, I would have multi-region servers as well
This is probably for some junior engineers. It has very basic concepts. The questions were not too technical to kinda push the interview towards high decision making skills. This is just list all technologies in any saas , and connect them,
Indeed, my solution was about assembling SaaS "Lego blocks" to solve a problem. I think that judgment calls on which solutions to assemble carry some weight. I definitely value that in interview candidates when I'm doing mocks, but I may also ask them to go a bit deeper in certain areas if we have time.
It is a system-design interview, that's kind of its purpose.
Some observations:
The beauty of system design is there's no right or wrong answer. I would've tackled these very differently(not mentioning replication until we are optimizing the design,...) but both methods would work as long as it makes sense
I love his sense of detail, describing blob storage as linearly scaling, the songs being immutable are read only, the storage needed for various encodings... These make total sense but he really spells things out clearly.
It would have been great if he could explain about what compute options to use. Some like gke gce or app engine etc
HI Anoop! That's a really good point. There are, indeed, many compute options out there to choose from. I am not an expert by any means, and I would love to hear suggestions from you and others here. I think I would learn something from you all.
Great content, thank you! This channel should have more subs.
The part of searching music could be not sufficient at that scale.
Indexing, at least the metadata we would need to use for searching songs (artist, song name, etc..) into a key-value database or a solution as elasticsearch, would be way better. Also would help with faceted search (gender, etc..). Queries to RDS, at that scale, would be too expensive. What do you think??
Anyway..great video !
This was super interesting! Thanks for bringing this content to us!
Thanks for watching Victor!
this is so easy, I don't know why some make it sound like a big deal. I am not even a native english speaker and I can understand this fully and can do same thing with any other system design requested in an interview. The only problem is getting the interview lmao
The Solution is quite simple, I think most interviewers would fail the candidates if this becomes the final solution based on my experience. But Interviewers are not so great themselves because they already looked up the answers and had ample time to prepare for it.
This is only 45 mins and so the scope should be limited. For interviewers if they expected something that seems to be missing, they could guide the interviewee a little bit just to stay on topic and make sure the final outcome covers the important use cases.
Did they want the the interviewee to cover high qps loads? Caching? Security? Malleable architecture? A broad opening like this "create a spotify app for me" I dont think you should be very anal on some missing features. It felt more like the interviewer is asking "what do you know about general architecture components?".
People on the internet here are unreasonable.
I mean, I would guess this comes more down to getting an interviewer with a bug up their butt looking to filter someone out because they didn't magically think of the aspect they find most important.
At the end of the day, you're looking for a great conversation about major design considerations that show experience and breadth of awareness.
And most importantly, an engineer you feel will communicate clearly and not be afraid to speak up if they see an issue or think of an issue that hasn't been addressed.
100%. Certainly it would have been a Reject in the real world.
such answers might be ok for just an average middle developer. possibly guest might give proper answer if interviewer would ask them.
I think concurrency and fault tolerance is a big design consideration. If a web server goes down will it take down N users . I’d probably look at adopting something with Erlang. Great content and appreciate the input . Joined as a sub
thank you very much for the video! I was looking for something like this. I am not the best solution architect ever but I would design Spotify by very similar way but I am grateful for design ideas
Great job done. I only have one question to understand. Did you miss talking about the security (authorization & authentication ) of data (music)? Or it is out of scope for this interview?
The way he talks about CDN is only about cache, but you also don't want to provide a direct link to the source that can be hijacked or abused in any other way
Does anybody know what is the software Mr Mark was using while making the diagram ?
Hi Ahmad, he was using Google Draw.
I was asked once a question to design a system. My interviewers liked the question from my side. I asked them the following. Am I leading this project or, am I just an expert in a certain part of it? They liked it because I asked if I should focus on general architecture or leave more space for a particular component. Which also has to be properly designed. Anyhow, this interview was fantastic. Thank you.
This is a junior level design. I don’t understand why so many compliments
would've also loved to have seen a bit more expression of skill, it felt a bit like watching gordon ramsay being asked to prepare a breakfast meal, and he'd make cereal.
@@emdeevy lol good one
It’s the standard “right answer” for these kinds of interviews, but fails to address any problem specific to media streaming, doesn’t touch on many networking details, and doesn’t explain how the design of the system would enable some well-known features of Spotify. But honestly, his answer was pretty good, and it probably would have been better had the interviewer challenged him more.
"High level". In-depth would probably have taken about two days.
Great interview. It's interesting how differently I would have approached it with more focus on how the webserver would be structured.
I think it would be great if this would have been broken down into a search service and also music service.
the main goal is to search for a song and then play the song. I believe we could focus on making search faster with fuzzy search support and may be OpenSearch integration. Based on search we could also integrate Spark and a Flink as a data aggregator for downstream audit to see which region/genre/songs are most played so we can optimize this in the CDN or when we shard the db. I believe we could really optimize this design.
Pleasantly surprised he came up w the example of european punkrock, as I’ve been playing in european punkrock bands for a while 😊 nice choice!! (And it really is a bit of a niche)
This was so damn cool, as a rookie CTO this is a great transfer application of SD concepts to learn from. Definitely coming back for more!
What is a rookie CTO
@@jialx Chief Technology Officer
@@hamzaf19 'rookie'
is really funy this type of questions "design tiktok", "design youtube"... "design x..", for a senior /mid role, dude , if a person could design those companies, he should be applying for an investment from sequoia or Y Combinator
For worldwide scaling, there's no need to save favorite songs in a local replica. We have Content Delivery Network (CDN) already serving that function.