Great presentation and overview for Debezium !! The opening use case of using CDC for dual writes is more of an anti-pattern. If you are already using Kafka (since you are using Debezium), it would be better to emit Kafka events that are modelled after your domain (as described in the Outbox pattern section) instead of Debezium events which are essentially just CDC logs. That advantage there is that you are building good Event Driven Architecture which will pay off in the long term. A potential downside is scalability as each database is handled by only one Debezium cluster. If you are generating events in the outbox at rate higher that what a single Debezium connector can process, you cannot really scale to achieve higher throughput. That's just the theory though :-) Debezium is an excellent choice for ETL implementations - you can pair it with Kafka Streams and you got real time ETL.
Since Debezium tails transaction logs (for databases that support replication via logs) I think it would be difficult to overload Debezium since it's reacting as soon as transactions are commited. However, if you did run into that issue, you could actually remove Debezium and use a simple polling mechanism to publish to Kafka, which would act almost as a (heavy-handed and inefficient) back pressure method since the extra reads would slow down your write throughput. This is just a theory, I have no benchmarking to prove how frequently you'd have to poll to see this occur. You make a great point about scalability though, as each microservice would need its own Debezium cluster (assuming they all have dedicated outbox tables).
36:47 I think running two CDC pipelines is not required for achieving high availability for connectors. You could run the connector in distributed mode. That allows you to have multiple connectors in a cluster; only one will stream the events at a time while others will be on standby. If the connector streaming the events goes down, another node in the cluster will kick in and start from the last LSN (Log Sequence Number). This topology obviates the need for the duplicator.
Great presentation and overview for Debezium !! The opening use case of using CDC for dual writes is more of an anti-pattern. If you are already using Kafka (since you are using Debezium), it would be better to emit Kafka events that are modelled after your domain (as described in the Outbox pattern section) instead of Debezium events which are essentially just CDC logs. That advantage there is that you are building good Event Driven Architecture which will pay off in the long term. A potential downside is scalability as each database is handled by only one Debezium cluster. If you are generating events in the outbox at rate higher that what a single Debezium connector can process, you cannot really scale to achieve higher throughput. That's just the theory though :-) Debezium is an excellent choice for ETL implementations - you can pair it with Kafka Streams and you got real time ETL.
Since Debezium tails transaction logs (for databases that support replication via logs) I think it would be difficult to overload Debezium since it's reacting as soon as transactions are commited. However, if you did run into that issue, you could actually remove Debezium and use a simple polling mechanism to publish to Kafka, which would act almost as a (heavy-handed and inefficient) back pressure method since the extra reads would slow down your write throughput. This is just a theory, I have no benchmarking to prove how frequently you'd have to poll to see this occur. You make a great point about scalability though, as each microservice would need its own Debezium cluster (assuming they all have dedicated outbox tables).
Amazing. Really informative and upto the point
Amazing presentation!! Thanks so much for sharing this!
36:47 I think running two CDC pipelines is not required for achieving high availability for connectors. You could run the connector in distributed mode. That allows you to have multiple connectors in a cluster; only one will stream the events at a time while others will be on standby. If the connector streaming the events goes down, another node in the cluster will kick in and start from the last LSN (Log Sequence Number). This topology obviates the need for the duplicator.
very informative, great presentation