3:18 Apache Iceberg has ACID transactions out of the box, and it’s not Nessie which brings ACID transactions to Iceberg. In Iceberg specification the catalog only has knowledge of the list of snapshots, and the catalog doesn’t track the list of individual files part of commit or snapshots.
great video, congrats. If possible, bring an end-to-end architecture with streaming data ingested directly into the lakehouse. also something related to the integration of datalake and datalakehouse.
That’s a great idea 💡. I will put something together that combines the data streaming and the data lake. This will give an end to end implementation.
8 месяцев назад
Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?
Hi there, Dremio is a SQL Query Engine like Trino and Presto. You do not insert/ingest data in dremio directly. The S3 layer is where you store your data. Apache Iceberg provides the Lakehouse Management service (upsert/merge) for the objects in the catalog. I'd advise to handle upsert/merge in the catalog layer rather than S3, sole reason of the iceberg's presence in this stack. Here is an article on how to handle upsert using SQL. medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf
Great Video! Although i have set this up in a server, how would i be able to get data from tables and write to using pyiceberg or any other library. I'm trying fetch data from the iceberg using API. I have tried a lot of methods they are not working. Please help. thanks
Thanks. You can use the dremio client to query/write data stored in the tables. If you want to use a python library then I will cover that in a future video.
Hi there, you can create an iceberge table using the Python library or the SQL console. This set up is using Nessie's catalog. If you mean Tabular's "rest catalog" then that's not used in this tutorial.
@@BiInsightsInc I am trying to make a rest catalog for nessie using pyiceberg library for iceberg. In that i am trying to access the following uri: "uri": "localhost:19120/api/v1" but it is not accessing it
@@BiInsightsInc so the original csv file stay as it is, nessie-iceberg will create a parquet file which contain the actual and most updated data. Is my understanding correct?
7 месяцев назад
This is so insane. Is it also possible to query data from a specific versionstate directly instead of only the metadata? I am wondering if this would be suitable for bigger Datasets? Have you ever benchmarked this stack with a big Dataset? If the versioncontrol is scalable with bigger datasets and higher change frequency, this would be a crazy good solution to implement.
Yes, it is possible to query data using the specific snapshot id. We can time travel using the available snapshot id to view our Iceberg data from a different point in time, see Time Travel Queries. The processing of large dataset depends on your set up. If you have multiple node with enough ram/compute power than you can process large data. Or levrage a cloud cluster that you can scale up or down depening on your needs. select count(*) from s3.ctas.iceberg_blog AT SNAPSHOT '4132119532727284872';
It is giving me "A fatal error has been detected by the Java Runtime Environment" error, it was working fine 3-4 months back but is failing now. I am using Mac and there is no openjdk installed in "cd /opt/java/openjdk cd: no such file or directory: /opt/java/openjdk" appreciate this video and your help here
nice video! Data lakehouses offer a lot of functionality at an affordable price. It seems like dremio is the platform that allows you to aggregate all of these services together ? Could you go a little more in depth on some of the services.
Thanks. Yes, dremio engines brings various services together to offer data lake house functionality. I will be going over Iceberg and the project Nessie in the future.
Link to to Data Lake Videos On-premis and AWS:
ruclips.net/video/DLRiUs1EvhM/видео.html&t
ruclips.net/video/KvtxdF7b_l8/видео.html
Can you try with another project with deltalake,hive-metastore?
Great video 👍 Thanks for sharing the video 💝
3:18 Apache Iceberg has ACID transactions out of the box, and it’s not Nessie which brings ACID transactions to Iceberg. In Iceberg specification the catalog only has knowledge of the list of snapshots, and the catalog doesn’t track the list of individual files part of commit or snapshots.
great video, congrats.
If possible, bring an end-to-end architecture with streaming data ingested directly into the lakehouse.
also something related to the integration of datalake and datalakehouse.
That’s a great idea 💡. I will put something together that combines the data streaming and the data lake. This will give an end to end implementation.
Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?
Hi there, Dremio is a SQL Query Engine like Trino and Presto. You do not insert/ingest data in dremio directly. The S3 layer is where you store your data. Apache Iceberg provides the Lakehouse Management service (upsert/merge) for the objects in the catalog. I'd advise to handle upsert/merge in the catalog layer rather than S3, sole reason of the iceberg's presence in this stack. Here is an article on how to handle upsert using SQL.
medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf
Great Video!
Although i have set this up in a server, how would i be able to get data from tables and write to using pyiceberg or any other library. I'm trying fetch data from the iceberg using API. I have tried a lot of methods they are not working. Please help. thanks
Thanks. You can use the dremio client to query/write data stored in the tables. If you want to use a python library then I will cover that in a future video.
Great video, thank you!
I want to create Iceberg table with rest catalog using pyiceberg. Does this setup works for it?
Hi there, you can create an iceberge table using the Python library or the SQL console. This set up is using Nessie's catalog. If you mean Tabular's "rest catalog" then that's not used in this tutorial.
@@BiInsightsInc I am trying to make a rest catalog for nessie using pyiceberg library for iceberg. In that i am trying to access the following uri: "uri": "localhost:19120/api/v1" but it is not accessing it
Is Nessie storing the data in a different file? or it will refer and update the original file of 'sales_data.csv'?
Hey there, the data is managed by iceberg and yes it’s stored in the parquet format.
@@BiInsightsInc so the original csv file stay as it is, nessie-iceberg will create a parquet file which contain the actual and most updated data.
Is my understanding correct?
This is so insane. Is it also possible to query data from a specific versionstate directly instead of only the metadata? I am wondering if this would be suitable for bigger Datasets? Have you ever benchmarked this stack with a big Dataset? If the versioncontrol is scalable with bigger datasets and higher change frequency, this would be a crazy good solution to implement.
Yes, it is possible to query data using the specific snapshot id. We can time travel using the available snapshot id to view our Iceberg data from a different point in time, see Time Travel Queries. The processing of large dataset depends on your set up. If you have multiple node with enough ram/compute power than you can process large data. Or levrage a cloud cluster that you can scale up or down depening on your needs.
select count(*)
from s3.ctas.iceberg_blog
AT SNAPSHOT '4132119532727284872';
is it possible to replace dremio with Trino?
Yes, it’s possible to use Trino with Nessie’s catalog. Here is the link to their docs: projectnessie.org/iceberg/trino/
It is giving me "A fatal error has been detected by the Java Runtime Environment" error, it was working fine 3-4 months back but is failing now. I am using Mac and there is no openjdk installed in
"cd /opt/java/openjdk
cd: no such file or directory: /opt/java/openjdk"
appreciate this video and your help here
Try installing OpenJDK and re-try. The error can be caused by the missing installation or corrupted files.
very good!
nice video! Data lakehouses offer a lot of functionality at an affordable price. It seems like dremio is the platform that allows you to aggregate all of these services together ? Could you go a little more in depth on some of the services.
Thanks. Yes, dremio engines brings various services together to offer data lake house functionality. I will be going over Iceberg and the project Nessie in the future.
Amazing!