Google Systems Design Interview With An Ex-Googler

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024

Комментарии • 869

  • @clem
    @clem  4 года назад +142

    I hope you all find this mock systems design interview insightful! And if you're in the business of acing your systems design interviews (who isn't? 😎), be sure to check out www.systemsexpert.io where we have more questions like this one as well as a full course on all of the fundamental systems design topics that you need to understand to tackle these types of interviews.

    • @ajr1791ze
      @ajr1791ze 4 года назад +5

      who is the guy taking interview? where he works ?

    • @Andrew-ez9ft
      @Andrew-ez9ft 4 года назад

      Clement, I have some questions.
      What is the fastest way to become a junior developer in 2020???
      What is the best way to learn programming to build advanced projects???

    • @KeyrptNetwork
      @KeyrptNetwork 4 года назад +3

      it will be very useful to make another Systems Design Interview?

    • @VishalSingh-kc1jn
      @VishalSingh-kc1jn 4 года назад +3

      Dude grow some beard. Would look good🔥

    • @techwithwhiteboard3483
      @techwithwhiteboard3483 4 года назад +2

      make sure you buy other domains with the same name
      🤔✌

  • @nishantt6419
    @nishantt6419 4 года назад +188

    wow...simply amazing.....this is like a real world experience for the system design interviews
    most of the videos on youtube are just about explaining the system design about some popular applications like messenger, tinyurl, dropbox etc...
    all of them have one thing in common....all of them explain the design very well in only one aspect....they are all well structured, there are no hiccups, there is not much thinking to do.....they design the system without any flaws...why? because they know the design and there is no interviewer who is asking for in depth details or looking for more clarifying answers...they just explain the system in only one direction
    This is totally different and the one which we actually need.
    This is by far the best for multiple reasons
    1. It shows how a real world interview looks like with all the white boarding sessions
    2. the question given by the interviewer is very abstract....the interviewee may know the design in one particular aspect, but the interviewer is looking for some different aspect and asking more in depth details at every stage
    3. the interviewee actually asks a series of questions at the start and also keep asking different questions time to time
    4. at times the interviewer stops the interviewee because he missed explaining few details and missed details about fault tolerance and keeps asking for more in depth details about how the interviewee is approaching his design
    5. its more of a collaborative brainstorming which actually happens in real world
    6. it also showed how the interviewee picked some technologies like SQL, GCS, KVS etc and validated it with his explanation, getting approval of the interviewer and proceeding to the next stage at every point in the interview.
    Totally worth watching the whole video.....thank you Clement and kudos for creating such a valuable content
    hope to see more such videos

    • @elliemay1748
      @elliemay1748 4 года назад +2

      Yeah this is exactly how this goes in real interviews. When I interviewed at Amazon for engineering manager, I was asked to write an analytics system with 1 million writes per second. It was a bit more challenging than this example in terms of scale, but still similar in concept.

  • @khoavo5758
    @khoavo5758 4 года назад +805

    "Ex-Googler" means that you now use Bing or what? /s

    • @444haluk
      @444haluk 4 года назад +24

      hahahahh underrated joke

    • @DarkH4X0
      @DarkH4X0 4 года назад +35

      Yep, now he uses Bing to see StackOverflow answers

    • @atharvparlikar8765
      @atharvparlikar8765 3 года назад +3

      maybe duckduckgo

    • @amazingvipul8392
      @amazingvipul8392 3 года назад +1

      He still uses RUclips though 🙂

    • @carlhsieh2615
      @carlhsieh2615 3 года назад

      It means he's not evil anymore lol

  • @suboii2978
    @suboii2978 4 года назад +14

    First of all, thanks for making these available Clem. I seen a lot of videos about the systems design interview in preparation for my interviews and I'm a SystemsExpert customer, but I'm wondering if you could explain the difference between a novice , intermediate, and expert level performance. What criteria can be used to determine what level someone is and how able they are at designing good systems. Thanks!

  • @arpanmukherjee4625
    @arpanmukherjee4625 2 года назад

    ACID is just not enough to give you concurrency guarantee that you actually need in your case. It really depends on your Isolation level. Also might need application level concurrency handling as well using things like Optimistic locking etc

  • @occo5877
    @occo5877 3 года назад +1

    Isn’t it better to first find the numbers before jumping in the design? Without knowing how many workers you have you cannot say if MySQL is a good option or not. Also instead of SQL, persistent queues I think would serve better this problem

  • @ArjunGovind
    @ArjunGovind 2 года назад

    13:06 - Why to need a queue to maintain the commits. You always pick the latest commit if there one. If there are multiple commits, you pick the latest commit since you don't want to build on intermittent commit which is anyways going to be replaced in the next cycle of 30 mins. Right? If it is deploying to multiple applications to multiple systems, the queue makes sense.

  • @ohanfillbach7106
    @ohanfillbach7106 4 года назад +16

    I might have missed something, admittedly the video was pretty long, but you asked for feedback in the comments so here we go:
    Given how in depth a lot of the solution was, it seemed that how the binaries get into the regional peer to peer networks was missing. You covered that GCS will magically replicate them for you, but then you jumped right into those binaries going from host to host. How does it get into the network? I am assuming that it would just be the first host to poll KVS and see a new state, then poll the network and nobody see nobody else has it so it goes to fetch from regional GCS.
    Why not expand the enum in the jobs table so that you can have a single source of truth with regards to deployment? Something like: Queued, Building, Replicating, Deployable, Complete, Failed. Also with that DB, why not just use the last_h column as a phase_started time. Since you are still benchmarking against expected worst runtime, you can can keep your query basically the same but also track failures across the whole process (and not just building), without needing the heartbeat system for builders/workers.
    I probably also would have asked a little more about the code that we are deploying. Is it only one codebase, or should this work for multiple applications? Do we only need to support one deployed version at a time? Especially if the answer to the second question is yes, then the whole goal state solution would likely need to be changed.
    Overall great interview, I think the most important part was watching how you took your time and thought through everything slowly and deliberately while asking for clarification at basically every step

    • @StupidNub
      @StupidNub 4 года назад +2

      I'm not getting how the 10gb binary gets to all the regional GCS either. Or how p2p networks work. I would think a 10gb file going from country to country would take a while, especially at if it's done ad-hoc at the time of the app release? What if it's a 5TB binary? My guess from what was meant in this video is that regional host 1 (rh1) sends to host 2 (rh2), then rh1 and rh2 send to rh3 and rh4, incrementally/exponentially being able to send binary/replicate more at a time?
      Alternatively, with an entirely different take: I heard that making incremental patches to source code as patches is very quick. So, why can't the version control system live in each region, and we just ensure they all have the latest commit. Then, each regional host just builds the binary prior to the scheduled release time of the app, and there wouldn't be a need to download a huge binary across countries.

  • @abhishekkadam2999
    @abhishekkadam2999 4 года назад +3

    This is great!
    There is some weird satisfaction in free content which wasn't supposed to be free.
    Feels like I am stealing.

    • @heraldo623
      @heraldo623 4 года назад +1

      ??? This video is a sample of their new platform, systemsexpert.io
      You are free to buy the course. So there is nothing for free, the video itself is an ad.

  • @pranavdeshpande8402
    @pranavdeshpande8402 Год назад +1

    The best Systems design video I have seen so far! Great work Clement! Also, the only thing which still confuses me about these interviews is to what depth or how do expectations differ based on what level you are interviewing for. I think it would be helpful if you could also compare and contrast how would a Senior Software Engineer give this interview vs Staff vs Principal or maybe at least give a general idea about how would these levels would be evaluated in a systems design interview.

  • @SameerSrinivas
    @SameerSrinivas 4 года назад +1

    Great job. Very informative. Thanks for the effort. Could you have used Kubernetes for orchestrating deployment in this scenario?

    • @humann5682
      @humann5682 4 года назад +1

      You could potentially, but over-engineering is a red flag in Sys/Ops interviews. People like efficient, scalable solutions that have no more complexity added to them than necessary. Kubs was not necessary to answer the questions here.

  • @mathewrajan8514
    @mathewrajan8514 3 года назад

    This really helped me in my interview . Thank you 😊

  • @codyelhard5779
    @codyelhard5779 2 года назад

    He must've downed all 4 monsters before the video

  • @charliel.5794
    @charliel.5794 4 года назад +482

    I'd like to hear the feedback from the interviewer, I think that's the important part on distinguishing what went well and what didn't

    • @bluespeckcf5949
      @bluespeckcf5949 4 года назад +132

      The feedback is probably that he's missed a lot of corner cases (interviewer mentions at least 2 of them in the video), and I don't know why he's not using a messaging queue instead. In short, this is probably a weak interview.

    • @vishnuv4813
      @vishnuv4813 4 года назад +7

      Normally, the interviewers not sure about the solution themselves or they would have a template solutions which they feel that would fit every problem. So, they would not be able to comment on our design.

    • @gavindmello
      @gavindmello 4 года назад +15

      @@bluespeckcf5949 I thought he could have used a message queue too, but, it wouldn't allow monitoring workers which have gone awol.

    • @cloud5887
      @cloud5887 4 года назад +2

      @@bluespeckcf5949 can you elaborate a bit more? What exact corner cases did he miss?

    • @SpencerYork1534
      @SpencerYork1534 4 года назад +56

      @@cloud5887 For Example, one that I caught, was the worker heartbeat, what if the job is successfully completed but didn't send a heartbeat due to a network error, you wouldn't want to run the build again. Another is he never mentioned storage for the build artifacts, and how to resume a build from the artifacts, or use them, if you are building a build system, you definitely need to cover some sort of distributed storage for build artifacts, if you are building 100k times a day you don't want to have to download JS, or .NET, 100k times, even if you are pulling from your own repository you want to eliminate as much bandwidth usage as possible. BUT overall this is a good interview, maybe not perfect, but that is typical for most interviews, I think it was more important to show how he interacted with the interviewer, asking questions, clarifying requirements. Most companies want to see how well you handle overall project communication so as to best deliver according to the stated requirements.

  • @crankerson
    @crankerson 3 года назад +7

    The SQL transaction would ensure that all of the SQL statements would run together or not at all (rollback), but it doesn't ensure atomicity. In this design, to prevent multiple workers from running the same job concurrently, wouldn't you need to specify the isolation level on the transaction in order to prevent a non-repeatable read? Example: worker 1 selects the oldest "queued" job. worker 2 selects the oldest "queued" job. worker 1 updates the job status to "running".

  • @jlove1296
    @jlove1296 2 года назад +30

    I used to have zero understanding of the importance of clarifying requirements and used to ask typical dimension clarifying question without knowing how that could actually help me solving the problem at hand. You really showed how you could incorporate the information you gained by asking clarifying questions through this mock interview - appreciate it

  • @lasophistique3513
    @lasophistique3513 4 года назад +93

    What I loved most is, it actually went on like a real interview, where candidate doesn't know all the answers, and kept building on the assumptions, kept extending the designs. Thank you for making it look realistic!

  • @charan775
    @charan775 4 года назад +129

    Every time clement was the interviewer but now for a change clement is the interviewee.

    • @clem
      @clem  4 года назад +62

      Gotta flex my owns skills _sometimes_ ! 😎

    • @kevinsheng8775
      @kevinsheng8775 4 года назад +1

      Lol clement does sound like a interviewee

    • @kevinsheng8775
      @kevinsheng8775 4 года назад +6

      It seems like the interviewer is getting educated from clement lol

    • @yunjiehong4649
      @yunjiehong4649 4 года назад +8

      Chinglin Sheng An interviewer has no motivation to speak a lot, while an interviewee needs to think loud

  • @AviralBajpai
    @AviralBajpai 3 года назад +87

    Drink everytime he says interview.

  • @achilles165
    @achilles165 3 года назад +23

    Most time spent on build system, not too many details covered about the deployment systems like safety of deployment, health checks, configurable deployment speeds, rollbacks, global service health \ availability checks to pause, incident management. These things are highest order bit for any large scale company as no one wants a global or regional outage.

  • @zss123456789
    @zss123456789 4 года назад +30

    This is VERY helpful. This is literally better than any video that just tells you general tips that are hard to follow in an interview setting.

  • @ramkrishnakulkarni8289
    @ramkrishnakulkarni8289 4 года назад +33

    Clement you nailed it man. Fabulously awesome content. I’ve learned a lot from this. Please make more videos like this. 👏👏👏

  • @danielboteroc
    @danielboteroc 4 года назад +15

    Clement, Nice interview, I liked it. Only one thing, I am not sure about the fact that two servers won't pick up the same job using the code you've done. You might do a more complicated logic to pick up that job. If two jobs do the select at the same time, both will get the same job unless you are blocking the table.

    • @FlaviusAspra
      @FlaviusAspra 2 года назад +7

      The interviewee was quite weak I have to say.
      He could have done at least a SELECT FOR UPDATE SKIP LOCKED or equivalent.

  • @l_karuhanga
    @l_karuhanga 4 года назад +92

    No one:
    Clément: Thinking back to my experience with code deployment systems.

  • @AshutoshAnandashu
    @AshutoshAnandashu 4 года назад +47

    This was so good. Thank you @Clement for giving us an insight on what goes on in a real system design interview. Kudos to the interviewer as well. He asked brilliant questions and put up some interesting extra questions to make it difficult yet understandable.
    Subscribed ✅

  • @illuminek
    @illuminek 4 года назад +17

    @clement which tool do you use for white boarding and drawing for system design?

    • @nobytes2
      @nobytes2 3 года назад

      There's literally thousands of drawings apps.

  • @sergiiolshanetskyi5488
    @sergiiolshanetskyi5488 4 года назад +11

    Clement, thanks for the video. It would be nice to have a summary at the end - what the interviewer thinks you did well/what could have been better. Or the assumption is that you did everything perfectly fine?
    One specific question/concern I have is - the interviewer asked you if you can have concurrency problems when querying your queue log table from 100 workers. You said 'no' and wrote a query you would use. What transaction isolation level you meant when you were creating a transaction in your query?
    If you use MySQL for instance, and REPEATABLE_READ (the default one), two or more workers can read the same record as the 'oldest job' at the same time (as read is non-locking operation by default), and all such workers will update the same record as 'queued' and will pick up the same job.
    If you use, SERIALIZABLE you will run into a bunch of deadlocks, two or more workers can read the same record and then will wait on each other to unlock the record so it can be updated.

    • @heraldo623
      @heraldo623 4 года назад +1

      Each Worker should own a private Job Queue. Make an Scheduler push new Jobs and balance the load between Queues. At any pont in time the max number of concurrent queries over the same queue is 2, the Scheduler and the Worker. In most of the cases the Scheduler will query for non running jobs to move them around and balance the load, so I think queries which will change the rows should lock the rows because the Worker may be querying for the same rows at the same time. This way just one Worker is locked down, the remaining keep working on their own queue. I made a comment explaining my solution.

    • @sergiiolshanetskyi5488
      @sergiiolshanetskyi5488 4 года назад +3

      @@heraldo623 Thanks for the reply. I was not asking for an alternative implementation description (I can come up with several myself). I was specifically mentioning that
      1) there was no feedback at the end
      2) there seems to be an apparent concurrency problem with the existing approach. Unless I am missing something and it might still work. That is why I was asking in the first place.

    • @heraldo623
      @heraldo623 4 года назад

      @@sergiiolshanetskyi5488 Clement made a very high level design of that system. There was a constraint of time to do the entire process, 30 min. But they did not show that the system does its job in 30 min. They could have described more the constraints (like available network bandwidth of workers and target machine) and details about the builder itself. How many time the builder takes to convert the code into a binary? Vague description of the system's responsabilities and constraints, so one can't tell if that works or does not.

  • @alexfeguson1
    @alexfeguson1 3 года назад +11

    Great job. I would like to see a feedback at the end of the interview, but it is just my humble opinion. Thanks

  • @AmrMostafaY
    @AmrMostafaY 4 года назад +13

    I loved this. Thanks so much for taking the time to put this high quality video! I've a small comment that I can't get out of my head: The SQL transaction that workers use to grab jobs seem to have a race condition that might make 2 or more workers grab the same job. I don't think ACID here protects from that. Locking needs to be done somehow to prevent that, and the first option I'd explore, as it's the simplest, is to change the SELECT to SELECT FOR UPDATE, assuming the underlying SQL engine supports it which is guaranteed to lock the resulting row for a single worker until it had a chance to perform the update on the status. If SELECT FOR UPDATE is not supported, we can explore other options such as manual row-level or table-level locking.
    Great stuff, thanks again!

    • @ihorhula
      @ihorhula 4 года назад +3

      Was thinking the same :)

    • @jiaruitian677
      @jiaruitian677 3 года назад +1

      Same thought. The transaction's atomicity does not mean only one of the concurrent update requests will be success.

    • @nameunknown007
      @nameunknown007 2 года назад +1

      Exactly my thought. Plus I was thinking why he wouldn’t use a tool built for queues instead of making sql fit into this use case. Especially when the interviewer asked him do you think this will serve the purpose and he doubled down on it lol. Note to myself: listen carefully when such a question comes up haha

    • @JohnTowell
      @JohnTowell 2 года назад

      Came here to say this

    • @swoogan
      @swoogan 2 года назад +2

      You can actually build a queue in SQL without SELECT FOR UPDATE, or any implicit locks. You do an update on the row, setting its state to RUNNING, where the id matches and the row is in the QUEUED state. If the affected rows is 0, the task has already been picked up. If it is 1, you have picked up the task. Note that autocommits need to be enabled on the UPDATE, or there is no difference from SELECT FOR UPDATE.

  • @hbgl8889
    @hbgl8889 3 года назад +11

    This is more like an interview about implementing a scalable queue.

  • @altereago1
    @altereago1 4 года назад +26

    Clement, I am curious why you didn't use a 'message queue' instead of the SQL table for task queueing?

    • @karimhasebou4252
      @karimhasebou4252 4 года назад +1

      I have the same question, why not something similar to kafka.

    • @knightwfu
      @knightwfu 4 года назад +2

      I don't know what the answer is, but I would imagine that a message queue works for maintaining which requests are available, but by itself it's not enough. For example, what happens when the request fails or times out? How do you make sure that exactly one request is being processed by the workers? How can you represent the status of all jobs, past and present?
      My first reaction is to implement using AWS SQS; by itself, SQS FIFO might handle exactly once, some basic request error handling, but based on the read/write volume and the use cases, I think a SQL might be a great first pass.

    • @CodeCampaign
      @CodeCampaign 4 года назад +12

      Using Database for queuing is famous anti-pattern.

    • @skiller1339
      @skiller1339 4 года назад +1

      @@knightwfu I think a message queue would actually be a better option. You could set the prefetch to 1 to make sure the worker only consumes one message and when a worker starts building you could dispatch an event. You would only acknowledge the message after the build is done, in that case if the worker stops and a new one would start up it will just try the message again. Whenever something happend you could dispatch an event (for example BuildStarted, BuildFailed, BuildDone and BuildCancelled). These events would be consumed by another service which builds an representation of the current status. Also when you want to cancel the current build you could make sure that all messages for a certain job would end up at the same consumer (for example with a consisten hash exchange in rabbitmq).

    • @ronquan3730
      @ronquan3730 4 года назад +9

      I think the thread above is great, a point I want to call out is Messaging systems like Kafka are durable message queues in the sense that messages are persisted to disk. This actually could have been expanded on rather than going with the SQL Table approach. If this system is used by the entire company for every commit of every repository this won't scale well with a SQL Table as you'd have more writes than reads due to status updates. Messaging system would be more scalable not just with durability but also performance as you can partition on the hash of the repository and preserve order of the commits.

  • @Tourach74
    @Tourach74 3 года назад +7

    I would also have asked to which platform/OS we deploy: linux, windows, ... etc. Thoughts?

    • @nobytes2
      @nobytes2 3 года назад +2

      Is probably one OS, most likely Linux unless is some dotnet or iis product.

    • @shabyparveen1649
      @shabyparveen1649 3 года назад

      @@muthukumarankothandaraman2371 when you say global there first thing I would ask is what do you mean by global? And this question was aked by the candidate in this mock interview, then what kind of devices or machines are target , along with geo-locations. Also, how devices or machines distributed across are supposed to receive or trigger the deployment build etc etc. At last there has to be some very generic speaking format among systems like XML, html and scripts basically the last part of design will define the technology stack. Very realistic interview. Love it. Btw I also when try ing to get into doing some design it turn out to be rabbit hole but then if one has clarity in thoughts then exit and entry points will always be clear to them. Having a clear thought process is a challenge. 😔

  • @samehghanmi4487
    @samehghanmi4487 4 года назад +22

    Thank you Clément for the great video. I learned so much about System Design in less than an hour. Keep it up!

  • @CodeCampaign
    @CodeCampaign 4 года назад +1

    Using transactions does not solve concurrency problem. You will have to use locking, otherwise multiple server will be running the same build. One more thing, using database as queue is famous anti-pattern, this design will never scale for millions of builds per day. I really doubt by giving these answers anyone can crack design rounds at Google or Facebook, or may be bar has come down a lot in recent times.
    Read this article to know more: www.cloudamqp.com/blog/2015-11-23-why-is-a-database-not-the-right-tool-for-a-queue-based-system.html

    • @heraldo623
      @heraldo623 4 года назад

      The video is an ad of their courses... but that showed bad quality, maybe we are not the target for that. Newbies liked that.

  • @kevinmsft
    @kevinmsft 2 года назад +2

    This interview is clearly no hire.

  • @saadakhtar2000
    @saadakhtar2000 4 года назад +42

    You're great. Keep creating quality contents for us !

    • @clem
      @clem  4 года назад +6

      I'll do my best!

  • @mikedqin
    @mikedqin 2 года назад +55

    Great interview video: I liked the most is the interaction between the interviewer and interviewee. Some thoughts on the design: 1) the SQL database can be a Queue system such as Kafka or Pub/Sub from GCP, which handles the scale very well with partitions. Given the large scale, the frequent query and updates to the SQL database will harm the performance, especially update the worker's heartbeats every second in the table. So a message broker will be the best solution. Worker status / heartbeat should be monitored and reported to a monitoring system such as Prometheus. Worker can push its health check status to Prometheus. 2) The interviewer tried to ask system design in Non-Abstract way, meaning how many servers are needed, do an analysis on bandwidth, timing, storage in a datacenter, and non-uniformed load (x1.25), then 3 datacenter deployment in a region, to derive the footprint per datacenter and global footprint. I hope the interviewee can elaborate on the Non-Abstract design more. Overall, a great system design interview. Thank you very much.

    • @artemtraer7615
      @artemtraer7615 2 года назад +1

      I guess problem with Kafka is that it stores the msges in topic only for some period of time, and also how would you update the record inside the topic . But there could be a sql - Kafka replica tho

    • @mputcha
      @mputcha 2 года назад +3

      I had imagined a distributed queue and a separate process to store the jobs for retaining history if needed. DB for queue will be expensive. Maybe even a DB+ Redis Cache where the server pulled off from a queue structure if access to cloud services is a constraint

    • @dorcohen3522
      @dorcohen3522 2 года назад +2

      100% right. I can add two more things on top of this.
      1. Why building SQL queries is important for system design interview? We should deal with more abstract design, and maybe, later, it's possible to talk about SQL in depth, but they wasted so much time on it.
      2. The presented SQL query will now work properly under load, this is why a queue system is by far superior, I couldn't understand why he wouldn't just use that, we can retain the data for as long as we want and slap extra metadata on top.

    • @Core433
      @Core433 2 года назад +12

      I think SQL is indeed a better choice than message queue (i.e., Kafka) here. MQ for stream processing might be appropriate for super high QPS workflows, but assuming a giant tech company with 40k devs each with 2 commits per day, that's only 0.9 QPS builds coming into this build system, which is well within the abilities of a SQL DB and complete overkill for MQ. Additional reasons why MQ is not an ideal option:
      1) MQ tends to have low retention, you'll need to end up storing your jobs in some persistent store like SQL or NoSQL DB anyway to preserve full history
      2) It's difficult to parallelize MQ unless you have multiple partitions / topics which match the number of workers -- this is because coordinating multiple workers consuming the same MQ partition is difficult. Yet the low QPS of input to this system does not warrant partitioning to me.
      3) As the presenter explained, SQL DB has built-in consistency where build workers need to acquire a read/write lock to obtain the rights to build a job. MQ does not provide this out of the box, so it's up to client to coordinate worker pool. See point 2 above on why this is difficult.
      4) MQ does not preserve order of jobs between partitions, and it was a requirement that jobs be roughly started in the same order they come in.
      I think jumping to the conclusion that we need MQ for *any* kind of queue can be a detrimental mindset in system design, and could hurt a candidate in a system design interview. It's worth understanding the tradeoffs and explaining this to the interviewer.

    • @xtzyshuadog
      @xtzyshuadog Год назад

      @@Core433 How does one enter the career path of system design? What prerequisite knowledge, certifications, skills would you think they would need? Are there particular programming languages or tools for home situations one can practice in by designing their own simple small systems using free tools?

  • @thefunkyhermit
    @thefunkyhermit 2 года назад +9

    Great video Clement. Thank you for sharing an insight into how these interviews are conducted.
    2 things I would have prodded more for as the interviewer are:
    1) How do you deal with downtime while switching between the versions?
    2) How do you manage DB changes as well as the code changes discussed here.

  • @ashrafkotbofficial
    @ashrafkotbofficial 4 года назад +19

    This is really beneficial, thank you for taking the time to go through a full mock interview.

  • @scubasteve117
    @scubasteve117 3 года назад +2

    So I felt like you had an overly complicated solve for the monitoring for the workers. The problem parameters said that the deployment needs to be completed in 30 minutes. Figure out how long it takes to move built code to servers let's say 10 minutes, and allow 20 minutes for build. Record the last time status changed, and when the rows are RUNNING and status was set > 20 minutes ago, update status to FAILED:
    Update StatusTable SET Status = "FAILED" where Status = "RUNNING" AND LastStatusUpdate > 20 minutes;

  • @iricrylale6589
    @iricrylale6589 3 года назад +1

    Do you hv to design everything from scratch, like the queue, why not pick something off the shelf directly?
    In the regional key value store discussion, which you did, pick etcd or zookeeper, right?

  • @ikucuk
    @ikucuk 3 года назад +9

    Tip: Don't reinvent the wheel, just use a queue service or job scheduler service. Experienced senior engineers should know when to implement something and when not to. I wouldn't hire this candidate if it was a senior engineer interview

    • @HallaBhalla
      @HallaBhalla 3 года назад

      Why do interviewers ask you to write merge ? When sort() is available in all the languages. Because interview is all about knowing if the candidate knows the basics well and if they can INVENT THE WHEEL if need be. You senior engineer would probably get stuck, if GCP / AWS stop their services for some god saken reason.

    • @ikucuk
      @ikucuk 3 года назад

      @@HallaBhalla If the question is to design a Queue service, then you implement it by using the basics. If question involves using a queue service then, you don't re implement the queue service. I said a senior engineer should know when to implement something and when not to.

  • @notangryatanyone
    @notangryatanyone 4 года назад +157

    "My SQL is a bit rusty..."
    proceeds to write perfectly valid SQL

    • @andre1sk
      @andre1sk 4 года назад +24

      SQL is not valid for the intended purpose multiple workers can grab same row so you need select for update or something like FOR UPDATE SKIP LOCKED if supported

    • @siobhanahbois
      @siobhanahbois 3 года назад +2

      Should've said "COMMIT TRANSACTION;" but ok

    • @SAKronikle
      @SAKronikle 3 года назад

      Also SQL Server uses single quotes for strings, not double quotes.

    • @Sudo4Life
      @Sudo4Life 3 года назад +1

      Starting a transaction isn't sufficient when using typical isolation level. The interview was trying to hint at this and that's why he asked the SQL to be written. A correct SQL should have been UPDATE ... WHERE id=? AND status="queued". When there are competing workers nodes, only one of them would have a result of 1 row updated whereas the other node would have 0 row updated. Otherwise both nodes would attempt to build and deploy over each other.
      Listen to your interviewer! When they ask "Do you think your current solution is good given the concurrency level you have?", you should probably stop and give it a thought

    • @北斗星-y4x
      @北斗星-y4x 3 года назад

      I think order all then limit 1 has bad performance . should select * from jobs where created_at = (select min(created_at) from jobs);

  • @ulanbayaliev6831
    @ulanbayaliev6831 4 года назад +3

    Please explain which system can have 1000 deployments/day? Keeping in mind that one complete deployment is 30 mins.

    • @heraldo623
      @heraldo623 4 года назад

      We dont know if that solves the problem because the problem itself is not well defined. The time constraint depends more of the builder than that deploy sistem lol

  • @billb.8980
    @billb.8980 4 года назад +15

    Thank you for sharing insight into the process of system design. I think an important question relating to the regional swarms sharing new builds may have went unanswered. Using this design, how do you determine which node(s) download the builds initially from GCS in order for it to be shared via the p2p network while avoiding hundreds or even thousands of them simultaneously attempting that initial download?

    • @Tourach74
      @Tourach74 3 года назад +3

      I had the same thoughts. I think you could implement an election process to select 1 or more leaders who would be allowed to download from GCS. If the leader dies, the P2P network would then elect a new one.

  • @bhanumittal3038
    @bhanumittal3038 3 года назад +1

    Instead of heartbeat why can't we have worker_id and I.P. of the worker in the database?

  • @uskro
    @uskro 3 года назад +1

    I don't get why you'd set a heartbeat to write to the db that frequently. You could just set a timestamp of assignment and your health check could just check the timestamp delta. You could have a read query executed every 5 minutes, to find the running operations updated over 15 minutes ago to find the problem operations. But you could also benefit from a worker identifier in the table, and your health check could also be a queue solution to check both with your table and the workers, maybe the operation is just taking longer than expected so you also wouldn't want to reassign the operation to another worker and have the code deployed by two different workers.

  • @iamworstgamer
    @iamworstgamer 4 года назад +21

    man they are so calm and composed. Indian interviewers get irritated when I ask them questions to understand their questions

    • @CesarOzuna
      @CesarOzuna 4 года назад +1

      Thank you for showing your prejudice for all to see. Btw this is a mock interview, not a real one. The Indians you speak so “highly” of are under pressure of a real interview.

  • @romulino
    @romulino 3 года назад +2

    Painful to watch he going down the SQL rabbit hole.
    You need something lika Kafka for that, it was designed for that specific purpose.
    Any SQL database would just get on its knees and cry if you throw something like this at it.

    • @nands4410
      @nands4410 3 года назад

      Kafka would suite as a queue to retain messages, but wouldn't you still need to store the status of job in some DB? To query all Failed Jobs etc

    • @nands4410
      @nands4410 3 года назад

      Ok I watched more of the video now and realized SQL was being used to consume messages for the worker

  • @canadalove1506
    @canadalove1506 Год назад +1

    Question. Why didn’t you try using a queue service or Kafka service in cloud instead of writing them to Sql database and then workers polling the db to start the build. Wouldn’t it be more efficient and simple when we using a AWS sqs or equivalent in gcp

  • @dfeprado
    @dfeprado 4 года назад +9

    Amazing video! What's the software you're using to write (this "blackboard")?

  • @ClayShentrup
    @ClayShentrup 2 года назад +1

    this style of super open ended question is just so bizarrely unrealistic. i've worked in big complex systems doing things like video advertising analytics 12+ years. i was the 11th engineer at zendesk (shoulda stayed longer, DOH!). i rewrote a core backend system at a database graphing company similar to looker and tableau, etc. etc. this interview style bears no resemblance to what the real world is like. in real life, you have clearly defined problems and you don't have to go on a fishing expedition to understand your requirements. this is more of the industry's broken interview process.

  • @DA-vz4gs
    @DA-vz4gs 4 года назад +5

    Hi! Just to clarify, did he forget to add FOR UPDATE to SELECT and lock the job row before updating a status of it, so that some other worker that does the query at the same time will not take it as well?

    • @luisfguadagnin
      @luisfguadagnin 4 года назад +2

      that's what BEGIN TRANSACTION/COMMIT do. It enables the code inside that block to run as an unit, so nothing can happen between the SELECT and UPDATE queries

    • @bossikom
      @bossikom 4 года назад

      @@luisfguadagnin Yes, but next job will get this row and during the update we can have an error or row can be anyway updated by another job (depends of transaction type and database used). Writing FOR UPDATE will lock this row and make them invisible for another transactions.

  • @Adam-ro2zv
    @Adam-ro2zv 3 года назад +7

    Hi Clement you're really doing great works for people, I like the way you shares these insightful contents.

  • @deathbombs
    @deathbombs 4 года назад +6

    I love your alg interviews but found the sys interview unsatisfying in terms of high level architecture, how the details of approaches used
    Great scoping of the question tho!

  • @anamigator
    @anamigator 3 года назад +11

    In what universe does a 10 GB binary build within 30 minutes, leave alone deploy across multiple locations.

  • @karnizland-russia
    @karnizland-russia 3 года назад +1

    You failed. Ужаснейшее решение с очередью заданий на сборку. Праильнее будет сделать список доступных/свободных серверов, которые могут собрать бинарник, и как только появляется комит посылать на доступный сервер. Если серверов свободных нет, то ничего не собирать. Таким образом задержка между последним билдом и комитом будет минимальной. Чувак уже бы провалил собеседование.

  • @akkshayganesh1136
    @akkshayganesh1136 4 года назад +37

    Brilliant setup. Great cross questioning and really makes the viewer think. Good going Clement!

  • @Funsiestype
    @Funsiestype 3 года назад +1

    We all wish we have interviewers like this guy who is nice and giving pointers to help the candidates as they design. Almost all the fuckers are busy typing and taking notes and trying to fuck you up.

  • @TheVigilancer
    @TheVigilancer 3 года назад +16

    Feels like you completely ignored “under 30 minutes” requirement. Am I missing something?

  • @ThatCrazzyRussian
    @ThatCrazzyRussian 3 года назад +1

    It is odd that Clément worked at Google because the deployment system he created has nothing to do with what is used at Google. (Contrary to what the interviewer suggested, I worked there) In fact, it would not be used anywhere. Simply put, it is very rarely acceptable to deploy to all of the machines at the same time. Any non-trivial application would have a non-zero startup time (often measured in minutes) and you don't want your service to go dark while the fleet is ramping up. You would use a gradual rollout, often with multiple stages, and have a dedicated system which manages the whole process.
    I was going to buy SystemsExpert, but this and a couple of smaller misses gave me a pause. The video does give me some interesting ideas on how to communicate in a system design interview, though. Start with the birds-eye view, identify the blocks and then address them one by one. Be very methodical, very pedantic, very thorough. Go into obscene amount of detail. Check back with the interviewer regularly. Clarify the constraints in the beginning, even the ones that seem self-evident. I tend to jump quickly to the trickiest areas and bottlenecks, so this is quite helpful for me.

  • @timurkhadimullin8211
    @timurkhadimullin8211 3 года назад +4

    I am not entirely convinced that SQL scales horizontally. Especially when you realise how many records one needs to update on this single table. What I think it does well though - it illustrates the point and allows to throw some concrete details into the solution

    • @Mocha.
      @Mocha. 2 года назад +1

      Realistically you would move the completed transactions into a permanent storage log through a separate service to maintain SoC. Definitely do not want to keep them in a working table. IMO it's good that Clement makes oversights because that is what makes the interview authentic.

  • @ViktorFerenczi
    @ViktorFerenczi 4 года назад +4

    I'm surprised that nobody commented that SQL may not be the best choice to implement such a system. Think Apache Kafka...

  • @iorekby
    @iorekby 4 года назад +6

    Great to see this sort of thing. So many people trying to get in to tech are obsessed just with software engineering. There's a whole gamut of careers and skill sets in tech, and hopefully this will open people's eyes to the opportunities aside from software engineering (nothing wrong with software engineering, btw).

    • @danielrocks234
      @danielrocks234 2 года назад

      This is the kind of interview *is* for software engineering. This is what senior SWEs at major tech companies go through in the interview process.

  • @JayChaimArbiv
    @JayChaimArbiv 4 года назад +3

    What application you are using for the drawing?

  • @chainer22
    @chainer22 4 года назад +8

    Amazing content! Thank you for making this. I have a question regarding the peer-to-peer deployment solution proposed. How does that work? Is there any existing tools or services that provides this capability?

  • @axelriet
    @axelriet 2 года назад +2

    So it’s a deployment system and out of 59 minutes it took the first 52 minute to talk about things that were NOT deployment-related. What was (would have been) the result of this interview if it was real? Hire/no hire? If hire, how would the candidate be leveled based on that particular presentation? L3, L4, above? There was nothing really deep nor complex here. Just curious.

  • @Nainara32
    @Nainara32 4 года назад +1

    Is the queue + polling approach necessary for a company with infinite compute resources like google? Couldn't you just monitor utilization of the worker pool and dynamically scale it up to meet surge demand? This way, your "queue" microservice could just log the job state and propagate the build call on to the next available worker node. Would your score get docked for answering this way?

  • @sonicjetson6253
    @sonicjetson6253 2 года назад +1

    Dude cut down on the exaggerated acting blowing simple things out of proportion

  • @sakex
    @sakex 3 года назад +1

    For the queue idea, you should really go with something like rabbitmq

  • @XYZ-nz5gm
    @XYZ-nz5gm 2 года назад +1

    If I get that question during interview I’d say I’m not a DevOps r u high?

  • @MyGroo
    @MyGroo 4 года назад +1

    For the queue of jobs, isn't the SQL database of jobs the single point of failure here? Perhaps I missed something in this part, but this is the part where it seems like there would be a solution like a messaging queue with clustering or whatever way of replication.

  • @rajasubasubramanian9365
    @rajasubasubramanian9365 4 года назад +2

    @Clément Mihailescu : How do we ensure the goal state set right - what I mean is say commit1 we generate build B1 and would like to deploy it - followed by another engineer who do commit2 and generates build B2 and would like to deploy it. When B2 gets deployed which means that B1 changes would also get included. So when it's deployed in the following order B1 -> B2 then there will not be an issue. But what happens when the build is deployed in the order B2 -> B1? How does the system here captures it? Or am I missing something here?

  • @alooydaboss9262
    @alooydaboss9262 3 года назад +1

    I'm tired of seeing your gf in every video I click lol
    Love your videos, keep it up

  • @dilp79
    @dilp79 2 года назад +1

    Which SW engineer level can pretend person passing a similar interview in a similar way?

  • @zelimkhandurdishev7225
    @zelimkhandurdishev7225 4 года назад +4

    Hi Clement.
    For which level programmers that kind of system design interview is considered for?
    For example:
    L3
    L4
    L5
    (as at google)

    • @elliemay1748
      @elliemay1748 4 года назад

      I don’t know about Google levels, but it’s common at a senior level.

  • @Mr5nan
    @Mr5nan 2 года назад +1

    0:15 actually we did see that one coming from the video title 😂😂

  • @housseinimaiga6913
    @housseinimaiga6913 3 года назад +4

    Hi Clément, great video. I have one question regarding your design, specially the part where you are fetching "QUEUED" jobs. So the question is, what will happen if two "Build servers" select the same record at the same times, because transactions only prevent DML operations? From my understanding without locking or skipping row during selection both could update the record at some point, no ? What do you think?

    • @vladimirbrable
      @vladimirbrable 3 года назад

      Race condition happens. They both get the same job Id.

    • @nimrodshneor6039
      @nimrodshneor6039 3 года назад

      You would likely need to use SELECT .. FOR UPDATE to lock the row.

  • @BiancaAguglia
    @BiancaAguglia 4 года назад +17

    This was very useful. Not only did it help me get a better understanding of systems design from a software development perspective, it also gave me some ideas for better designing data science specific projects.
    Thanks to both of you. Keep up the great work. 😊

    • @stepanseliuk8042
      @stepanseliuk8042 4 года назад +2

      I can recommend “grokking the system design interview” on educative.io and github.com/donnemartin/system-design-primer#index-of-system-design-topics

    • @PoulJulle-wb9iu
      @PoulJulle-wb9iu 4 года назад

      @@stepanseliuk8042 bra

  • @ShubhamSingh-ku2ow
    @ShubhamSingh-ku2ow 3 года назад +11

    Such an AMAZING video Clement. Before this video, I went through few others but this one literally beated them all. The quality of the questions asked and the way you responded gave a Google feel. Thanks ALOT for bringing such high quality content. Absolutely terrific.

  • @cliffrosen3605
    @cliffrosen3605 3 года назад +1

    There is a pretty glaring issue with this discussion. The discussion/solution proceeded as though there was only a single application, which was deployed to every server. Maybe that’s not absurd in its own right, but the problem states that there were around 5,000 new builds per day. That would mean this one application was updated every 15-20 seconds!
    Clearly, a system that can deploy 5,000 different builds per day only makes sense if the system is supporting multiple different applications. And, providing such support would require a much more involved design than what was discussed.

  • @kalmurza
    @kalmurza 4 года назад +12

    Great video. I totally recommend pausing the video after each feedback from interviewer and try to continue the design based on his comments. And great job, Clement. You basically recaptured Gitlab CI’s structure :)

  • @cloud5887
    @cloud5887 4 года назад +2

    Question: Clemente says at 41:40 that this design is super horizontally scalable. How is it super well scalable when we use a SQL database? I thought SQL databases scale bad horizontally.

    • @mayuragrawal4352
      @mayuragrawal4352 3 года назад

      SQL databases are not horizontally scalable in true sense. however, you are not really talking about volume of data here. a simple sql db can easily meet his requirements as data volume is pretty low.

  • @sangramjitchakraborty7845
    @sangramjitchakraborty7845 4 года назад +1

    You couldn't make the video 2 secs longer and make it a full hour? Glanced at the length and ocd just punched me in the gut.

  • @samuelchien821
    @samuelchien821 4 года назад +1

    Instead of using SQL database to backup queue and store worker servers health, couldn’t we use off shelf solution like SQS which is durable and handling queue dying. Then use Lambda to run the queue, which will handle the server dying as well as the concurrency problem.

  • @mullergyula4174
    @mullergyula4174 2 года назад +1

    This is more about how to communicate than anything else. It is hard to take such thing seriously, real life tasks are usually more complex and you don't start from scratch, still seems useful.

  • @aditnegi2577
    @aditnegi2577 3 года назад +2

    A message queue and a log file doesnt that make more sense, log files are cheaper and provide fast writes.

  • @bryanurizar
    @bryanurizar 3 года назад +3

    This was extremely interesting! I have no background in system design, but really thought this was cool!

  • @animalrocket4809
    @animalrocket4809 2 года назад +2

    This was really good. I often paused the video to offer my 2 cents and thought out loud. Really liked the collaboration between Clement and the interviewer. Also, one subtle thing is Clement’s frightlessness in asking even the most simple questions and repeating things in his own words. This is what allowed him to really build and understanding of the problem, as the interviewer’s words may not click the first time and they are meant to be ambiguous so it fosters clarifying questions.
    His interviewing skills really showed off here. So much to learn!

  • @rajanparmar1269
    @rajanparmar1269 4 года назад +11

    Hey Clement, can you please make one on Ad System like google ads and how targetting works. I was completely stuck in facebook interview.

    • @humann5682
      @humann5682 4 года назад

      Just to be clear: Do you mean from a Engineering perspective or a Marketing perspective?

  • @notoriouscrt8684
    @notoriouscrt8684 4 года назад +3

    What hardware do you use to draw like that? Is it just a Wacom tablet?

  • @phillipsmyth01
    @phillipsmyth01 3 года назад +1

    Excellent video, I don't agree with all of your implementation choices, but wonderful content, thanks!

  • @NitinSingh-xg9rw
    @NitinSingh-xg9rw 3 года назад +2

    Good video but 1 hour to design all this? Usually you have 30 min. I think you were talking lot of unnecessary stuff while designing.

    • @steven7846
      @steven7846 3 года назад

      I agree he spoke a lot of words to basically say more or less the same thing. That could have been a 30 minutes video

  • @art9soft
    @art9soft 4 года назад +2

    Btw , you’ve run out of time , it took more than 45 minutes. And I guess probably in the reality this interview could get negative or mixed feedback because of lack of deep dive into any component :)

    • @heraldo623
      @heraldo623 4 года назад

      This video is an ad, I think the complete content is in their new platform. He did not solve the given problem.

  • @danielgerlag2369
    @danielgerlag2369 4 года назад +1

    There is a race condition in this design... simply selecting the next queued job does not give you a write lock on that row, so 2 workers could select it and try to update it, and they will both succeed, resulting in a duplicate parallel build on a second worker. Could add and 'AND status = "queued"' to the update clause to avoid this.

  • @zeno_aratus
    @zeno_aratus 3 года назад +7

    I've always found estimating workload to be a challenging part of design, especially when you do not have as much information as you'd like to go on. So in one instance I have gone for determining the nightmare scenario and developing system specs based on that.

  • @koga7349
    @koga7349 3 года назад +2

    I think the overall solution is good although I don't love the polling all over the place. I think using webhooks or as another comment mentioned a message queue would be a great improvement.

  • @Equ1n0x88
    @Equ1n0x88 4 года назад +1

    Sorry to say, I found this not very relevant. You went way over the 45 minutes which most people are given and he went directly into the question. Usually, you have 40 minutes or so because the interviewer starts with a general question about your resume. Then, you didn't specify how many VMs are there gonna be, where exactly they are gonna be deployed, what the size of your database is etc. In my experience, Google ALWAYS makes a point about asking almost exact numbers, such as how many machines will this service be on, what kind of machines etc.
    I don't mean to be mean, but I can easily see a the interviewer failing you, if he so desired.

  • @Andrew-ez9ft
    @Andrew-ez9ft 4 года назад +2

    I am first. I want to ask something.
    1. Can mentally challenged become good programmers???
    2. What is the best way for a self taught person to become a software engineer in 3 to 6 months???
    Clement, you have awesome videos!!! Please make a video about a good curriculum for people being self taught into fullstack engineering. What is the best curriculum???

    • @stepanseliuk8042
      @stepanseliuk8042 4 года назад

      Jew, 10 years ago be a full stack engineer was a feasible mission, but not now ;-) or you will be late everywhere.
      To be a good software engineer first understand what real world problems you want to solve. After that you can choose appropriate stack of technologies and start to learn the stack. Aside of that is general CS: data structures, algorithms, logic and some math. If you don’t know why you want be here in IT, then you can choose a popular language, like JavaScript, start learning and do control-exams (pet projects) to save your knowledges. Solve LeetCode problems in background (this way you can meet yourself with different aspects of CS area).

  • @michalfuka5322
    @michalfuka5322 4 года назад +6

    I think "begin transaction" at ~27:39 does not really mean that select is concurent safe. I would rather use something like "select ... for update" statement (depends on db). I know he assume "serial sql job service" but it's not really prepared for pooling or horizontal scaling from the sql point of view.

    • @ankurlucknowi
      @ankurlucknowi 4 года назад

      yes. you need some kind of locking mechanism.