Apache Flink - A Must-Have For Your Streams | Systems Design Interview 0 to 1 With Ex-Google SWE

Поделиться
HTML-код
  • Опубликовано: 26 авг 2024

Комментарии • 43

  • @popricereceipts4279
    @popricereceipts4279 15 дней назад +1

    This video really helps with understanding the need for stream processing frameworks.

  • @harris1801
    @harris1801 24 дня назад +2

    just commenting to make sure the algo knows the streets heck with Jordan

  • @maxvettel7337
    @maxvettel7337 Год назад +4

    By the way, Flink was in video about Web Crawler
    It's sad that I can't use streams at work

  • @AntonioMac3301
    @AntonioMac3301 Год назад +3

    yo you should make some playlists and group the videos based on the topics - i think like all the stuff over streams is pretty fire so it'll be cool to be all in one place, same with batch processing

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад

      I may do this, but for now they're all under one systems design playlist and in order so all the batch/stream transferring stuff is near one another

  • @grim.reaper
    @grim.reaper Год назад +6

    I rewatch your videos all the time because your explanation is really helpful!!
    Do you recommend any resources for reading? And trying out these in code?

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +2

      Honestly just anything that doesn't make sense I'd just look up the official docs.
      As for coding good question, it's open source stuff so you're always welcome to check out the code

  • @tarunnurat
    @tarunnurat 6 месяцев назад +4

    Hey Jordan, I'm having difficulty understanding the "exactly once" aspect here. Say you have some messages that were processed by a consumer that are processed right after a checkpoint, and the consumer goes down. Now when the consumer comes back up, it would be initialised from the snapshot at the last checkpoint, and it would reprocess the messages thst come right after the checkpoint, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад +4

      Yes, but it would be reinitialized with the state that it had at the checkpoint. So reprocessing these messages would lead to identical state as it would have had before.
      Basically, exactly once processing doesn't mean that each message is only processed exactly once. It just means that the state that is generated in Flink will be as if each message had only been processed once.
      If for whatever reason Flink is going to do something that isn't idempotent (like sending an email), that can happen multiple times.

    • @yaswanth7931
      @yaswanth7931 3 месяца назад +1

      @@jordanhasnolife5163 suppose, assume the above case for the consumer 2 in the example from the video. Then the messages will be duplicated to the consumer 3 right? As the consumer2 will process again those messages and keep them in the queue again for the consumer3. Then how can we guarantee that each message affects the state once? Am i missing something here? Please explain?
      Is it that if one consumer went down then all the consumers goes back to there previous check point?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      @@yaswanth7931 The state that gets restored to each consumer is what was present when the snapshot barrier reached that consumer. Hence, we can ensure that all messages before the snapshot barrier have been completely processed, and we can restart by processing all messages after the snapshot barrier.

    • @gtdytdcthutq933
      @gtdytdcthutq933 2 месяца назад +1

      So even when one node crashed, we restore ALL consumers from the last snapshot?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад +2

      @@gtdytdcthutq933 Double check to confirm but I believe if we want the whole "exactly once" processing then yes

  • @jykimmm
    @jykimmm 17 дней назад +1

    Hi Jordan thanks for the video - I feel like we haven't touched on how the producer to broker portion of streaming is fault tolerant? As in what if the producer goes down before receiving acknowledgement from the broker, or if the broker goes down. Or any failure scenario in general - how can we ensure the message was processed?

    • @jordanhasnolife5163
      @jordanhasnolife5163  16 дней назад

      Flink isn't really responsible for making producers fault tolerant. It just ensures that once the message hits kafka, it will be processed.
      If you want to make a producer fault tolerant, you can do the same things we normally look to do (replicate it, have a mechanism for failover).

  • @dibll
    @dibll 8 месяцев назад +3

    Jordan, from 5:30 time onwards I was not able to understand how flink provides exactly once processing of messages. You used the term checkpoints and snapshots - are they the same thing if not whats the difference and when do we take one over the other? Also when we save state of the consumers - are we saving all the events that are in memory of that consumer and why do we need to replay the queue once we re-store the state from S3, if we replay wouldn't it process all the messages one more time, I think my confusion is with replay word, does it mean to send all the messages which have not been acknowledged by the consumer yet or something else? Could you pls explain in layman terms once more. Thanks in advance!

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 месяцев назад +1

      Hey! Basically I use the terms checkpoint and snapshot interchangeably.
      The idea is that our checkpoints are occasional and so it only captures a certain amount of state, but some events may have been processed after one checkpoint but before the next. When we restart, we need to replay the events after our checkpoint barrier (even if they had already been handled before the failure) so that we can rebuild the state in our consumers that had not been saved in our checkpoint.

    • @mocksahil
      @mocksahil 7 месяцев назад +1

      @@jordanhasnolife5163 Does this mean that if we're running a counter at the end we risk extra counts for those items post barrier that may already have been processed once? Does this mean we need to incorporate a UUID or something to double check in the consumer?

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 месяцев назад

      @@mocksahil If the counter is within flink, then no, these will be processed exactly once! However, if your counter is in an external database, and you're incrementing it, you may hit that multiple times!
      In which case you've hit the nail on the head, using a UUID as some method of maintaining idempotence can stop us from double counting.

    • @Aizen4661
      @Aizen4661 Месяц назад

      Hey, how does flink handle acknowledgement failures? If a message is processed and has failed to acknowledge, the message will be replayed?

  • @shibhamalik1274
    @shibhamalik1274 5 месяцев назад +2

    Hi @jordan does Spark Cache data or it does a fresh fetch of data on every run of its pipeline ?i know there is a cache api where we can cache but if we dont use that does spark fetch the same data from DB (spark sql) everytime we run the pipeline ? if not how is it so efficient and fast without a cache?

  • @kaqqao
    @kaqqao 4 месяца назад +1

    I was really expecting you'd mention Kafka Streams and contrast that with Flink 😶
    I'm struggling to understand when I'd need which. Maybe a topic for a future video?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 месяца назад +1

      Perhaps so! As far as I know they're basically the same, but if I can find a stark difference I'd be happy to make a video!

    • @kaqqao
      @kaqqao 3 месяца назад +2

      You're such a legend for replying to comments! Thank you! 🙏

  • @nithinsastrytellapuri291
    @nithinsastrytellapuri291 5 месяцев назад +1

    Hi Jordan, since this is a distributed system and each consumer is passing messages to the other, does Apache Flink use the Chandy-Lamport algorithm to take distributed snapshots?

  • @levyshi
    @levyshi 3 месяца назад +1

    In the video you mentioned that Flink are for the consumers, does that mean a flink node would sit between a broker and the consumer?

  • @grim.reaper
    @grim.reaper Год назад +2

    Was waiting for this after watching the last video 🥹

  • @michaelencasa
    @michaelencasa 6 месяцев назад +1

    Hey great content. I’m new to these scalable systems so maybe I’m misunderstanding, when messages need to be replayed after a Flink consumer crashes do those messages come from kafkas log of already processed messages? If yes, can that process of Flink coming back online, reading its snapshot and replaying all messages after the most recent set of barrier nodes be automated? Or will that likely be a manual process? And I’m still waiting on my foot pics. 😢❤

  • @tunepa4418
    @tunepa4418 Год назад +7

    not really a big deal but I have noticed that your mouth is always out of sync with your voice in most of your videos

  • @chrishandel2006
    @chrishandel2006 2 месяца назад +1

    0:38

  • @ganesansanthanam-5896
    @ganesansanthanam-5896 2 месяца назад +1

    Please accept me as your mentee

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад

      I'm sorry man I'm pretty pressed for time these days, perhaps you could find one amongst my other gigachad viewers or go asking on blind/linkedin