Martin Kleppmann | Kafka Summit SF 2018 Keynote (Is Kafka a Database?)

Поделиться
HTML-код
  • Опубликовано: 15 сен 2024

Комментарии • 48

  • @markodimic
    @markodimic 5 лет назад +66

    This video should be titled - The best ACID explanation on the whole f-in Internet

    • @homerg
      @homerg 4 года назад

      LMAO

    • @traviscramer
      @traviscramer 4 года назад

      was about to comment with something similar, but you said it perfectly

  • @_sudipidus_
    @_sudipidus_ 5 лет назад +31

    For those who liked this talk I strongly recommend reading his book called Designing Data Intensive Applications

    • @MrJoao6697
      @MrJoao6697 4 года назад +5

      This guy is a fucking guru, the book is amazing.

    • @_sudipidus_
      @_sudipidus_ 7 месяцев назад

      Here after 4 years of writing this comment to revisit this topic haha

  • @YoyoMoneyRing
    @YoyoMoneyRing 4 года назад +8

    I was reading his dataintensive application book and could relate everything mentioned in Transactions chapter with this video. I would highly recommend reading transactions chapter and then watch this video

  • @rpisal1976
    @rpisal1976 5 лет назад +15

    Thanks for a very detailed explanation around this topic, Martin. When you started explaining Consistency, one question popped in my mind in relation to account debit/credit example. In a case where I have split debit and credit transactions into two separate stream processors they have in turn lost knowledge of the overall 'account transfer' action, isn't it? You mentioned consistency is a feature which needs to be enforced from an application and not really in the DB. Does it mean both these stream processors are returning some response (status of the transaction) back to the application? Consider the following scenario.
    1. Event was put on first (parent topic) which is processed by first streams processor and two events (one debit and one credit) get put into two separate topics.
    2. When the debit stream processor tries to debit the account 12345, if it will make the account balance go into negative, the transaction will fail.
    3. The credit stream processor will succeed in crediting account 54321 which kind of makes the whole transaction not succeed as it did not do what it was supposed to
    How will the above scenario be handled? Only if the application is notified of both of these and then initiates a compensating (reversing) event, isn't it?

    • @josemose2000
      @josemose2000 5 лет назад

      Maybe there could be a process before debiting/checking that checks that the transactions will succeed first.

    • @anandsakthivel9425
      @anandsakthivel9425 5 лет назад

      Hi Rahul,
      Great question. When I thought about this design with your question, I thank they will validate the balance of the transaction owner and then put the transfer event into the parent topic.

    • @MahmoudTantawy
      @MahmoudTantawy 3 года назад +1

      Validating the balance of the transaction owner does not really ensure anything, because you might have events for debiting that were not applied yet, in an async system you can't have a reliable sync check/validation step, you have no idea how many async tasks/events have not been applied yet
      Example
      1. User A withdraws money
      2. Validation happens and there is money in balance
      3. System that applies the event of debiting is slow or down or whatever
      4. User A immediately afterwards transfers an amount of money to User B
      5. Validation happens and still sees there is money in balance (remember in an "eventually-consistent" system, there is no guarantee of "when")
      6. Transfer is OK'ed and proceeds
      Now you have a negative balance
      Also similar concern is this idea of different systems seeing different balance, yes they all converge to same balance but "when"!

    • @alvin3832
      @alvin3832 2 года назад

      @@MahmoudTantawy I think the validation process can be included in the same transaction and done synchronously before the money transfer.

  • @adamosloizou538
    @adamosloizou538 3 года назад +8

    While I really enjoyed Martin's beautiful talk on Transactions (ruclips.net/video/5ZjhNTM8XU8/видео.html) I find this one at odds with it.
    Martin's take at Atomicity with Kafka is misleading to me.
    Perhaps Kafka's log could be used as an ingredient to manually, painfully build a distributed database (a very hard problem) but it sure doesn't give you enough out-of-the-box.
    From wikipedia's definition (en.wikipedia.org/wiki/Atomicity_(database_systems)) I'd like to bring to your attention this:
    "A guarantee of atomicity prevents updates to the database occurring only *partially*..."
    and
    "As a consequence, the transaction *cannot be observed to be in progress by another database client*. At one moment in time, it has not yet happened, and at the next it has already occurred in whole (or nothing happened if the transaction was cancelled in progress)."
    Now, in Martin's example of transferring an amount between 2 bank accounts.
    Important: In practice, the observable behaviour of the total distributed system will be at the end storage DB(s). Not at the Kafka level. This is a crucial detail left out. This is the point where the last stream processors will write those 2 account-based events, produced by the original event.
    At the point of splitting the original transaction to 2 isolated events we effectively lose the atomicity.
    Why?
    Because the 2 stream apps are not aware of each other and don't coordinate.
    So, in practice you'll be definitely getting partial state at the end DBs!
    Best case, a temporary one (i.e. the time it takes between the 2 apps to perform both their operations - eventual consistency)
    Worse case, when 1 of them goes down, we will stay with a partial result at the DB level (remember this is the observable side of our distributed system!) until someone has realised the app is down and brought it back up. No rollback, no auto-detection and recovery.
    Then 1 of the 2 accounts will have updated their balance at the DB and the other one won't.
    And, more crucially, the whole system will have no idea that that is the case.
    So, in a system that money can stay in limbo, or created out of thin air, I'd say we still have a long way to go.
    Also, let's bear in mind that Atomicity is not independent of the rest of the ACID principles (see Orthogonality on the Wikipedia article). So Atomicity without Consistency is not atomicity.
    I hope this helps bring things down to earth a little.

  • @balaramganesh2970
    @balaramganesh2970 3 года назад +1

    Have a question on how he explained atomicity is expressed in Kafka. When a event that needs to be consumed by a number of consumers is added to a partition, yes, it means that eventually each consumer will consume this event and get the job done. However, the issue that when one of these consumers go down, there is a time interval when all other consumers have consumed this event but this one hasn't still persists and is a violation of atomicity. Not saying this issue isn't there in RDBMS. In fact, all I am saying is that the same issue is there in RDBMS where until the crashed system/consumer is back up, we'll still be violating the all in/all out policy. Open to views from everyone!

  • @lgylym
    @lgylym 5 лет назад +4

    Seems like many of the properties are achieved by turning synchronous processing to asynchronous

  • @hl5768
    @hl5768 2 года назад

    the traditional 2PC is CA from a CAP point of view, but the Kafka based implementation is a kind of eventual consistency

  • @sshawnuff
    @sshawnuff 5 лет назад +4

    13:19: "In financial systems you only want to preserve money, only move it around..." Wow.

  • @olegyablokov4334
    @olegyablokov4334 5 лет назад +2

    So you make your system serializable by using kafka as one-point-of-entry for writes. Actually, you can do the same by using single leader replication, and you don't necessary need kafka or other stream processing software to achieve that. I understand that it is not the case with several systems (db, cache, index), you already have to somehow synchronize them. But with the money transfer example using just one database can be enough. So why to use kafka in the money transfer example?
    P.S. And of course thank you for a great talk! :-)

  • @Oswee
    @Oswee 5 лет назад +2

    Berglund + Kleppmann = Power! :) Love their stuff.

  • @slimbouguerra7111
    @slimbouguerra7111 5 лет назад +8

    another great design min 17:00, instead of using a traditional Transactional updates, let's use events and make each update (here 2 updates) idempotent, well the question, if one of your stream processor fails or is slow do you consider the state of your DBs correct ? is this really atomicity ?

    • @songs4enjoy
      @songs4enjoy 5 лет назад +5

      His diagram says RDMS can consumer in parallel with cache & messaging. I assume this must be a mistake as RDBMS the consistency has to be guaranteed by some snapshot aware store (RDBMS) before cache/search index can be updated
      Again, this model has its uses & the graping hole of additional complexity to achieve this operation. But, this whole model should only be looked at when RDBMS is not able to meet the throughput & latency requierements of the business. Else, just stick with RDMBS, which is quite a safe place

  • @slimbouguerra7111
    @slimbouguerra7111 5 лет назад +6

    what a great idea (minute 12:00) to get atomicity let's serialize all the writes using a Kafka Queue instead of 2 phase commit because 2PC can be slow!

    • @songs4enjoy
      @songs4enjoy 5 лет назад +4

      For developers who are dumb like me, this comment is a sarcasm. Dont take it literally :D

  • @pankajsinghv
    @pankajsinghv 5 лет назад +4

    An amazing talk... Very informative... I'm thinking of making use of it in my projects...

  • @abhiitechie
    @abhiitechie 6 лет назад +3

    Hey Martin, what an amazing talk. I had one question though about eventual consistency model for this event driven architecture where the end users, maybe for a period of time, would see different versions of a data at the different data-stores i.e the Database, the cache and the search index, how do we solve this problem

    • @karthikrangaraju9421
      @karthikrangaraju9421 5 лет назад +1

      abhishek gupta You’d need distributed transactions across different DBs.
      But that’s not possible/available in all DBs.
      It’s still a researchy area.

    • @aljjxw
      @aljjxw 5 лет назад

      In some other video that I can't remember which, someone was saying this is unavoidable and when everything works correctly it will not happen. Also, they suggested to use always the same source for showing data to the end user. Say, always use the cache ... I know it is not a satisfying answer.

    • @sachintiwari1579
      @sachintiwari1579 5 лет назад +1

      Hi it seems confluent guys have picked your question and tried to answer. Not sure if you are aware of it thus sharing it here.
      m.ruclips.net/video/CeDivZQvdcs/видео.html
      IMO they have not understood your question and thus their answer is not going to address your concern.
      Please check it yourself and share your thoughts.

  • @rydmerlin
    @rydmerlin 2 года назад

    How can you say that something that isn’t performed in the same time quantum as something else is automic? If it does it later on then that isn’t a guarantee it’s automic is it? A time period has to apply for atomicity doesn’t it?

  • @adamdude
    @adamdude 2 года назад

    For the username claims example, how can you scale this to multiple partitions if there needs to be some central store of if a username was taken? There's a process that checks if the claim can be accepted which must be checking some database to see if it already exists and this database must have all of the usernames right? How is that scalable then?

  • @patricknazar
    @patricknazar 5 лет назад +1

    I love it, thank you.

  • @arariks
    @arariks 4 года назад +2

    Worked on Relational Databases for 10 years, Shaking my head in disbelief at your Serialisation example. You know you are ultimately relying on the database for checking uniqueness right ? ..... The account transfer would take an unacceptable amount of work to make it happen and the Jane example is downright misleading.
    Kafka is good tech. Use it where it is needed. Click-Streams, user action tracking and logging, distributing event information across network, not to replace RDBMS processes.

  • @super-ulitka
    @super-ulitka 2 года назад

    Don't like it for a free form ACID interpretation. Consistency also means that read after write returns the same value, this is not guaranteed with Kafka example, as there is a time gap between. Same for Atomicity. What if Cache got updated first, so the user will make an order looking at the new value, while DB has not yet received it and the app will charge from previous. Would love to see how you would like to use your Twitter feed with an eventual consistency ;) Kafka is a great software but this keynote is misleading.

  • @TyzFix
    @TyzFix 2 года назад

    While realizing the consistency using Kafka, do we need to estimate the latency introduced by the system?

  • @王帅-j4j
    @王帅-j4j 2 года назад

    great!thanks

  • @hydrogeneration7353
    @hydrogeneration7353 5 лет назад +1

    broaden database architecture perspective

  • @exocom2
    @exocom2 2 года назад

    Why is it easier to have 3 topics and 3 consumers that eventually write to a DB, which additionally have to check whether a transaction id has already been processed - compared to two pretty standard updates in the already existing DB? Maybe someone can enlighten me here, I would be very much interested in figuring this out... (I am talking about the -100/+100 debit example)

    • @micahhutchens7942
      @micahhutchens7942 2 года назад

      Because you no longer need to support transactions and there is no need for row locks etc. It increases throughput, and is not a one size fits all solution. It's mostly for scalability purposes. Imagine if you had a hot row in a database and it is constantly locking that row (also if the transaction spans multiple rows then it needs to hold locks on both those rows and this only gets worse if the rows are on different partitions). That is a lot of overhead that can be avoided by just making it asynchronous using a single threaded stream processor. Not to mention the other 2 consumers are able to continue working normally even if the third goes down.
      In the debit example, it is much more efficient to just serialize the financial transaction via a single threaded application than it is to support distributed transactions over a whole database. Also it's pretty trivial to just reject a message with an id that is lower than the current id (offsets in Kafka) in order to support deduplication.

  • @ArashBizhanzadeh
    @ArashBizhanzadeh 2 года назад

    The second example of acidity didn't make any sense. Debit and credit should be simultaneous, otherwise people can double spend

  • @_SoundByte_
    @_SoundByte_ 3 года назад

    Totally ordered log is not possible with multiple partitions right ?
    It’s not possible to have no of partitions equal to the number of customers ?
    Or banks really have lakhs of partitions ?

    • @john-taylorsmith3100
      @john-taylorsmith3100 3 года назад +2

      You can have any number of different keys in a partition and the guarantee is that each key in the partition will be in order.

  • @johnboy14
    @johnboy14 3 года назад

    It can be used to build a database but it is not a database. Thats alot of code we all need to write.

  • @stoneshou
    @stoneshou 4 года назад

    Does Kafka based database exist yet?

    • @adnxn
      @adnxn 4 года назад

      kinda, www.confluent.io/blog/intro-to-ksqldb-sql-database-streaming/

  • @karthikrangaraju9421
    @karthikrangaraju9421 5 лет назад

    If those two events for different users creating username end up in different partitions, different threads will consume concurrently
    And we are back to square one.
    If we only had one partition then it will work because there will only be one consumer writing to the DB.
    Which is the same idea of serializable isolation. It literally forces writes to happen in a single thread.

    • @aljjxw
      @aljjxw 5 лет назад +8

      But what Martin is saying is to use the username as the partitions key.

  • @davidbenyaminovch6761
    @davidbenyaminovch6761 3 года назад +1

    What a bunch of marketing bullshit. It should have been titled "Is Message Bus an Application Framework?", or how fast can the notion of "smart endpoints / dumb pipes" go down the drain.