This concept kind of exists in Pulsar. You can move old data into a tiered solution, basically old logs move out to S3 or something similar, or even just to cheaper disks. It's not perfect, because it doesn't really have a good backup and recovery mechanism, but the data offloaded to S3 is still live and available in Pulsar it just has a higher latency. We basically use an approach like this for important telemetry data. Pulsar is very functionally similar to Kafka, it even has a kafka client dropin replacement and a kafka protocol extension so you can use it in place of kafka. Apparently there is work being done to store all Pulsar state into S3 for the exact purpose of creating a data lake directly off Pulsar with minimal latency increase. I do agree though this concept is super interesting, because we use Pulsar and Kafka as messaging systems, but also find a lot of utility in using them for event storage and replay. We have to combine a lot of data into materialized views from various live sources, and being able to play back data has been very useful.
@@thedataguygeorge Thanks for the response. Please can you share any Kafka or Flink tutorials you know. All Kafka or Flink courses I have seen requires knowing Java
Imagine if every topic in Kafka had infinite retention. Instant data lake!
That's the dream!
This concept kind of exists in Pulsar. You can move old data into a tiered solution, basically old logs move out to S3 or something similar, or even just to cheaper disks. It's not perfect, because it doesn't really have a good backup and recovery mechanism, but the data offloaded to S3 is still live and available in Pulsar it just has a higher latency. We basically use an approach like this for important telemetry data. Pulsar is very functionally similar to Kafka, it even has a kafka client dropin replacement and a kafka protocol extension so you can use it in place of kafka. Apparently there is work being done to store all Pulsar state into S3 for the exact purpose of creating a data lake directly off Pulsar with minimal latency increase. I do agree though this concept is super interesting, because we use Pulsar and Kafka as messaging systems, but also find a lot of utility in using them for event storage and replay. We have to combine a lot of data into materialized views from various live sources, and being able to play back data has been very useful.
Love the video . Have a question though, do you think a broad understanding of Java is needed to work with Apache Kafka in production. Thanks
I would say not really, can use python instead if you prefer for sure
@@thedataguygeorge Thanks for the response. Please can you share any Kafka or Flink tutorials you know. All Kafka or Flink courses I have seen requires knowing Java
Add Redpanda to the comparison :D
I'll get it in the next one!
Redpanda is functionally the same as kafka, it's primary design goal is to be a drop in kafka replacement.