Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO | Lakehouse

Поделиться
HTML-код
  • Опубликовано: 2 дек 2024
  • НаукаНаука

Комментарии • 27

  • @BiInsightsInc
    @BiInsightsInc  Год назад

    Link to to Data Lake Videos On-premis and AWS:
    ruclips.net/video/DLRiUs1EvhM/видео.html&t
    ruclips.net/video/KvtxdF7b_l8/видео.html

    • @hungnguyenthanh4101
      @hungnguyenthanh4101 Год назад

      Can you try with another project with deltalake,hive-metastore?

  • @SonuKumar-fn1gn
    @SonuKumar-fn1gn 2 месяца назад

    Great video 👍 Thanks for sharing the video 💝

  • @jeanchindeko5477
    @jeanchindeko5477 4 месяца назад +1

    3:18 Apache Iceberg has ACID transactions out of the box, and it’s not Nessie which brings ACID transactions to Iceberg. In Iceberg specification the catalog only has knowledge of the list of snapshots, and the catalog doesn’t track the list of individual files part of commit or snapshots.

  • @orafaelgf
    @orafaelgf 6 месяцев назад +2

    great video, congrats.
    If possible, bring an end-to-end architecture with streaming data ingested directly into the lakehouse.
    also something related to the integration of datalake and datalakehouse.

    • @BiInsightsInc
      @BiInsightsInc  6 месяцев назад +1

      That’s a great idea 💡. I will put something together that combines the data streaming and the data lake. This will give an end to end implementation.

  •  8 месяцев назад

    Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?

    • @BiInsightsInc
      @BiInsightsInc  7 месяцев назад

      Hi there, Dremio is a SQL Query Engine like Trino and Presto. You do not insert/ingest data in dremio directly. The S3 layer is where you store your data. Apache Iceberg provides the Lakehouse Management service (upsert/merge) for the objects in the catalog. I'd advise to handle upsert/merge in the catalog layer rather than S3, sole reason of the iceberg's presence in this stack. Here is an article on how to handle upsert using SQL.
      medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf

  • @jestinsunny575
    @jestinsunny575 Месяц назад

    Great Video!
    Although i have set this up in a server, how would i be able to get data from tables and write to using pyiceberg or any other library. I'm trying fetch data from the iceberg using API. I have tried a lot of methods they are not working. Please help. thanks

    • @BiInsightsInc
      @BiInsightsInc  27 дней назад

      Thanks. You can use the dremio client to query/write data stored in the tables. If you want to use a python library then I will cover that in a future video.

  • @andriifadieiev9757
    @andriifadieiev9757 Год назад

    Great video, thank you!

  • @Muno_edits06
    @Muno_edits06 2 месяца назад

    I want to create Iceberg table with rest catalog using pyiceberg. Does this setup works for it?

    • @BiInsightsInc
      @BiInsightsInc  2 месяца назад

      Hi there, you can create an iceberge table using the Python library or the SQL console. This set up is using Nessie's catalog. If you mean Tabular's "rest catalog" then that's not used in this tutorial.

    • @Muno_edits06
      @Muno_edits06 2 месяца назад

      @@BiInsightsInc I am trying to make a rest catalog for nessie using pyiceberg library for iceberg. In that i am trying to access the following uri: "uri": "localhost:19120/api/v1" but it is not accessing it

  • @KlinciImut
    @KlinciImut 2 месяца назад

    Is Nessie storing the data in a different file? or it will refer and update the original file of 'sales_data.csv'?

    • @BiInsightsInc
      @BiInsightsInc  2 месяца назад +1

      Hey there, the data is managed by iceberg and yes it’s stored in the parquet format.

    • @KlinciImut
      @KlinciImut 2 месяца назад

      @@BiInsightsInc so the original csv file stay as it is, nessie-iceberg will create a parquet file which contain the actual and most updated data.
      Is my understanding correct?

  •  7 месяцев назад

    This is so insane. Is it also possible to query data from a specific versionstate directly instead of only the metadata? I am wondering if this would be suitable for bigger Datasets? Have you ever benchmarked this stack with a big Dataset? If the versioncontrol is scalable with bigger datasets and higher change frequency, this would be a crazy good solution to implement.

    • @BiInsightsInc
      @BiInsightsInc  7 месяцев назад +1

      Yes, it is possible to query data using the specific snapshot id. We can time travel using the available snapshot id to view our Iceberg data from a different point in time, see Time Travel Queries. The processing of large dataset depends on your set up. If you have multiple node with enough ram/compute power than you can process large data. Or levrage a cloud cluster that you can scale up or down depening on your needs.
      select count(*)
      from s3.ctas.iceberg_blog
      AT SNAPSHOT '4132119532727284872';

  • @luisurena1770
    @luisurena1770 19 дней назад

    is it possible to replace dremio with Trino?

    • @BiInsightsInc
      @BiInsightsInc  14 дней назад

      Yes, it’s possible to use Trino with Nessie’s catalog. Here is the link to their docs: projectnessie.org/iceberg/trino/

  • @dipakchandnani4310
    @dipakchandnani4310 14 дней назад

    It is giving me "A fatal error has been detected by the Java Runtime Environment" error, it was working fine 3-4 months back but is failing now. I am using Mac and there is no openjdk installed in
    "cd /opt/java/openjdk
    cd: no such file or directory: /opt/java/openjdk"
    appreciate this video and your help here

    • @BiInsightsInc
      @BiInsightsInc  13 дней назад

      Try installing OpenJDK and re-try. The error can be caused by the missing installation or corrupted files.

  • @hungnguyenthanh4101
    @hungnguyenthanh4101 Год назад

    very good!

  • @nicky_rads
    @nicky_rads Год назад

    nice video! Data lakehouses offer a lot of functionality at an affordable price. It seems like dremio is the platform that allows you to aggregate all of these services together ? Could you go a little more in depth on some of the services.

    • @BiInsightsInc
      @BiInsightsInc  Год назад

      Thanks. Yes, dremio engines brings various services together to offer data lake house functionality. I will be going over Iceberg and the project Nessie in the future.

  •  8 месяцев назад

    Amazing!