The future of Delta Lake and Apache Iceberg with Tathagata Das
HTML-код
- Опубликовано: 5 фев 2025
- Software Engineer: Tathagata Das is a Staff Software Engineer at Databricks, an Apache Spark committer, and a member of the Project Management Committee (PMC) for Apache Spark.
Apache Spark: He is the lead developer behind Spark Streaming and has contributed significantly to the development of Structured Streaming.
Delta Lake: Das is a core developer of Delta Lake and a committer to the project.
Research: He has conducted research on data-center processing frameworks and networks at the University of California, Berkeley, and has published several papers on these topics.
Author: Das is one of the authors of "Learning Spark: Lighting-fast Data Analytics" (2nd edition)
NextGenLakehouse Newsletter
#Databricks #DeltaLake #Delta #UnityCatalog #ETL #DataEngineering
Databricks DeltaLake Delta UnityCatalog ETL Data Engineering
Really insightful discussion. Thank you for that. Honestly, I've always wondered whether these lakehouses built on open format tables can guarantee the same performance as MPP warehouses. And the biggest reason for that concern has been how in delta every operation (insert, update or delete) is essentially an insert (new file) under the hood. And then there are other considerations like small file problems and optimized writes. And always felt there was a significant development/operational overhead in terms of running OPTIMIZE, Z-ORDER and now enabling DELETING VECTORS in order to keep the tables performant as they grow. Does LIQUID CLUSTERING take that overhead away from customers and make their life easier? I know Databricks promises intelligent optimization and automatic clustering for managed tables but what about external tables because most companies would be having external tables where the underlying files are in their realm.
Yes Liquid Clustering is a good starting point and moving thing in the right direction in terms of user/developer experience. But Liquid Clustering might not solve all the problems, but will already help with the part of your small files concern.
Why did Apple switch to Apache Iceberg?