Видео 566
Просмотров 9 361 474

EP57 - From Hadoop & Hive to Minio & Dremio: Moving Towards a Next Gen Data Architecture

1:04:19

Hands-on with Dremio #5 - Dremio with Python and BI Tools

8:58

Hands-on with Dremio #4 - Apache Iceberg and Git-for-Data

10:35

Hands-on with Dremio #3 - Data Reflections

11:49

Hands-on with Dremio #2 - Preparing Data Across Sources (Joins, Type Conversions, Drop Columns, etc)

11:10

Hands-on with Dremio #1 - Setup and Source Connections

10:44

Whoop's Carlos Peralta on Building a Data-Driven Culture at Whoop and Moderna

Data Disruptors - A Podcast for Data Leaders. Listen to Carlos Peralta, Whoop Data Leaders disrupting technologies giving the company access to cutting-edge insights on Building a Data-Driven Culture at Whoop and Moderna: Collaboration and Alignment.
In this episode, Tomer Shiran interviews Carlos Peralta, MLOps & Data Eng Global Director, WHOOP discusses his experience building data platforms and driving a data-driven culture. He emphasizes the importance of collaboration between technical teams and business stakeholders, as well as the need for data quality and integrity. Carlos also highlights the challenges of model interpretability and fairness, as well as automated feature engineerin...

Видео

EP57 - From Hadoop & Hive to Minio & Dremio: Moving Towards a Next Gen Data Architecture

1:04:19

EP57 - From Hadoop & Hive to Minio & Dremio: Moving Towards a Next Gen Data Architecture

Просмотров 11912 часов назад

Legacy data platforms often fall short of the performance, processing and scaling requirements for robust AI/ML initiatives. This is especially true in complex multi-cloud (public, private, edge, airgapped) environments. The combined power of MinIO and Dremio creates a data lakehouse platform that overcomes these challenges, delivering scalability, performance and efficiency to ensure successfu...

Hands-on with Dremio #5 - Dremio with Python and BI Tools

8:58

Hands-on with Dremio #5 - Dremio with Python and BI Tools

Просмотров 11216 часов назад

In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?

Hands-on with Dremio #4 - Apache Iceberg and Git-for-Data

10:35

Hands-on with Dremio #4 - Apache Iceberg and Git-for-Data

Просмотров 7716 часов назад

Hands-on with Dremio #3 - Data Reflections

11:49

Hands-on with Dremio #3 - Data Reflections

Просмотров 5916 часов назад

Hands-on with Dremio #2 - Preparing Data Across Sources (Joins, Type Conversions, Drop Columns, etc)

11:10

Hands-on with Dremio #2 - Preparing Data Across Sources (Joins, Type Conversions, Drop Columns, etc)

Просмотров 4616 часов назад

Hands-on with Dremio #1 - Setup and Source Connections

10:44

Hands-on with Dremio #1 - Setup and Source Connections

Просмотров 7016 часов назад

Dremio Hands-On Demo #8 - BI Dashboards with Dremio

6:38

Dremio Hands-On Demo #8 - BI Dashboards with Dremio

Просмотров 5219 часов назад

Dremio Hands-on Demo #7 - Using Dremio in Python

3:19

Dremio Hands-on Demo #7 - Using Dremio in Python

Просмотров 5419 часов назад

Dremio Hands-On Demo #6 - Apache Iceberg & Git-for-Data

11:10

Dremio Hands-On Demo #6 - Apache Iceberg & Git-for-Data

Просмотров 3619 часов назад

12:32

Dremio Hands-On Demo #5 - Reflections

Просмотров 6819 часов назад

Dremio Hands-On Demo #4 - Preparing Data (Change Data Types, Drop Columns, Joins)

10:15

Dremio Hands-On Demo #4 - Preparing Data (Change Data Types, Drop Columns, Joins)

Просмотров 4719 часов назад

Dremio Hands-On Demo #3 - Connecting Sources

7:02

Dremio Hands-On Demo #3 - Connecting Sources

Просмотров 3619 часов назад

Dremio Hands-On Demo #2 - Running SQL & Creating Views

6:39

Dremio Hands-On Demo #2 - Running SQL & Creating Views

Просмотров 5319 часов назад

Dremio Hands-on Demo #1 - Environment Setup (Docker Compose)

5:47

Dremio Hands-on Demo #1 - Environment Setup (Docker Compose)

Просмотров 7219 часов назад

Dremio Hands-on Demo #1 - Environment Setup (Docker Compose)

1:07:31

Special Edition 1 - Apache Iceberg Q&A

Просмотров 23714 дней назад

Special Edition 1 - Apache Iceberg Q&A

Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio...

34:21

Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio...

Просмотров 41214 дней назад

Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio...

EP56 - What’s New in Dremio: Improved Automation, Performance + Catalog for Iceberg Lakehouses

31:40

EP56 - What’s New in Dremio: Improved Automation, Performance + Catalog for Iceberg Lakehouses

Просмотров 13314 дней назад

EP56 - What’s New in Dremio: Improved Automation, Performance Catalog for Iceberg Lakehouses

Dremio Demo - Federated Queries Joining Mongo and Postgres Data (Breaking Data Silos)

9:09

Dremio Demo - Federated Queries Joining Mongo and Postgres Data (Breaking Data Silos)

Просмотров 11921 день назад

Dremio Demo - Federated Queries Joining Mongo and Postgres Data (Breaking Data Silos)

Real-Time Analytics Across Data Sources Using Dremio

9:27

Real-Time Analytics Across Data Sources Using Dremio

Просмотров 29521 день назад

Real-Time Analytics Across Data Sources Using Dremio

The Iceberg REST Catalog - Meetup - Tampa Bay Data Engineers Group

28:11

The Iceberg REST Catalog - Meetup - Tampa Bay Data Engineers Group

Просмотров 202Месяц назад

The Iceberg REST Catalog - Meetup - Tampa Bay Data Engineers Group

End-to-End Data Engineering from CSV/JSON/Parquet to Apache Iceberg to Apache Superset Dashboard

39:24

End-to-End Data Engineering from CSV/JSON/Parquet to Apache Iceberg to Apache Superset Dashboard

Просмотров 441Месяц назад

End-to-End Data Engineering from CSV/JSON/Parquet to Apache Iceberg to Apache Superset Dashboard

32:37

A Git Like Experience for Data Lakes

Просмотров 174Месяц назад

A Git Like Experience for Data Lakes

Cyber Lakehouse for the AI Era, ZTA and Beyond

57:07

Cyber Lakehouse for the AI Era, ZTA and Beyond

Просмотров 100Месяц назад

Cyber Lakehouse for the AI Era, ZTA and Beyond

EP55 - Unite Data Across Dremio, Snowflake, Iceberg, and Beyond

14:30

EP55 - Unite Data Across Dremio, Snowflake, Iceberg, and Beyond

Просмотров 178Месяц назад

EP55 - Unite Data Across Dremio, Snowflake, Iceberg, and Beyond

EP54 - Mastering Semantic Layers: The Key to Data-Driven Innovation

50:42

EP54 - Mastering Semantic Layers: The Key to Data-Driven Innovation

Просмотров 415Месяц назад

EP54 - Mastering Semantic Layers: The Key to Data-Driven Innovation

Tampa Bay Data Engineers Group - July 2024 - Dremio's Reflections

45:07

Tampa Bay Data Engineers Group - July 2024 - Dremio's Reflections

Просмотров 2042 месяца назад

Tampa Bay Data Engineers Group - July 2024 - Dremio's Reflections

Unifying On-Prem and Cloud Data with Dremio: Cloud Data on Snowflake + On-Prem Data with Minio

9:10

Unifying On-Prem and Cloud Data with Dremio: Cloud Data on Snowflake + On-Prem Data with Minio

Просмотров 2012 месяца назад

Unifying On-Prem and Cloud Data with Dremio: Cloud Data on Snowflake On-Prem Data with Minio

4:05

Apache Iceberg Lakehouse crash course

Просмотров 7392 месяца назад

Apache Iceberg Lakehouse crash course

EP53 - Build the next-generation Iceberg lakehouse with Dremio and NetApp

47:50

EP53 - Build the next-generation Iceberg lakehouse with Dremio and NetApp

Просмотров 3742 месяца назад

EP53 - Build the next-generation Iceberg lakehouse with Dremio and NetApp

@santhoshsandySanthosh День назад
Is that hive table sub partioned on date string ? And then further bucketing.. would drastically reduce those 3 million files ... Without much info on that structure could not agree on performance issues in this example..
@rosko1971 5 дней назад
the audio is very very soft. even at 100%
@rosko1971 5 дней назад
the audio is very very soft. even at 100%
@lynnwilliam 8 дней назад
Can you do more of an intro ? Telling people what your tools does ? Tool looks amazing but it was only at the end I found out what it did.
@lynnwilliam 8 дней назад
Audio is too low, can't hear much even at max volume and it's crackled
@Dremio 8 дней назад
May have to re-record this, not sure which of my microphones it’s trying to use that much that it’s overdriving the audio, I’ll double check and re-record another iteration of this series, probably as one longer video.
@guilhermeranieri8445 14 дней назад
Great content as always! Can you do some demo talking about the engines inside dremio config?
@elvisasihene2403 15 дней назад
Thanks Alex! I have been watching most of your video and blog tutorials. You have really improved my knowledge in Datalakehouse, Apache Iceberg, and Dremio. However, I can't find any detailed enterprise BI cloud solution in Dremio Solar and Arctic. Can you please take us through an enterprise cloud BI solution using the Azure Data Lake Gen2 as the storage layer? Otherwise, could you point me to an existing blog on the Dremio website? Regarding this tutorial, do I really need Spark since I can accomplish all my tasks in Dremio? Cheer!
@Dremio 15 дней назад
Nope, I could do everything in Dremio or other tools. I was just demonstrating how different tools can work with a single copy of the data in Iceberg. Regarding BI tools, what tool are you looking to use. When using Dremio connecting to a BI tool is exactly the same regardless of storage layer. Essentially, you connect your sources to Dremio then Dremio to your BI tool and Dremio can serve all your data to that BI tool on a single connection.
@elvisasihene2403 8 дней назад
@@Dremio Well noted and thank you! My BI tool is PowerBI.
16 дней назад
So to understand/use iceberg you also need 20 other components on top of running hadoop and all of the things that that requires? Seems like an... "improvement"? aka. Hadoop should have never been turned into a data warehouse to begin with.
@Dremio 16 дней назад
Hadoop wasn't used in this video, and you don't need all the tools shown. A Variety of tools were used in the demo to demonstrate the portability of the data.
@nawazishkhan9230 17 дней назад
Very poor sound quality!
@JulioTorresM 17 дней назад
First comment
@rembautimes8808 17 дней назад
Very nice and concise 😂
@SonuKumar-fn1gn 19 дней назад
Great video ❤
@rembautimes8808 20 дней назад
Very nice tutorial, am beginning to see how branching is useful
@user-if2kq8nh8m 23 дня назад
This was super helpful!
@clintonchikwata4049 26 дней назад
Amazing ---
@ahmedshamma Месяц назад
Instructor talks too fast that difficult to cop with specially for english as a second language speaker. I advice the instructor to slow down to be able to convey his message clearly , after all thank you for the architectural lecture about data lakehouse as it is a confusing term for many of us.
@gregorywpower Месяц назад
Pretty awesome talk! I would love to see geospatial formats be supported in Apache Iceberg, but I'm glad that this fork exists!
Месяц назад
There are so many tools used here my head hurts. For "getting data into dwh" how can this possibly be seen as better than classic parque files in hive-partitioned storage with an external table on? In gcp this is 5 lines of sql code, in redshit you might have to add Glue to the mix. And then any sql-running tool can do the rest inside the database/dwh. In 99% of companies theres at best 3-4 data engineers, managing all this + data modelling would mean nothing ever gets done.
@ettaroo Месяц назад
Alex is the man - explains these things and the context so well.
@Matt-n9l1l Месяц назад
Excellent overview - thanks! - My question is about how best to get that first "Raw Copy" into Iceberg. Using the SQL Connector I can hook Dremio up to MS-SQL data sources, and they are then available for reflections - but I don't see any clear way to copy the data into an Iceberg table that would persist (in the case that the source data became unavailable) am I missing something?
@Dremio Месяц назад
If you connect an Iceberg catalog source (hive, Nessie) you can write data to it, so you can use syntax like CTAS to write the data and it will be persisted in the configured storage when you connected the catalog. You’d have to orchestrate workloads so it updates the new iceberg table from the source system or you can use an external tool to land the data as iceberg into your lake like spark, Flink, Upsolver, Airbyte, etc.
@guilhermeranieri8445 Месяц назад
@@Dremio Using airbyte to do the ingestion, the first iceberg table created using nessie will automatically update itself?
@Dremio Месяц назад
@@guilhermeranieri8445 depends on how you are doing, how are you ingesting with Airbyte, directly into Iceberg?
@guilhermeranieri8445 15 дней назад
@@Dremio using minio. I send data to minio and manipulate data on spark, but already created views on dremio for testing
@smokindave74 Месяц назад
How does Dremio support the Iceberg Rest Catalog........I cannot seem to find a way, other than Nessie to create an S3 resident Iceberg table in Dremio......
@Dremio Месяц назад
You could also use AWS Glue, but the rest catalog connector is coming soon in Dremio is on the horizon.
@recs8564 2 месяца назад
Why is the spark configuration with all of the lakehouse services hardcoded in a notebook? Shouldn’t these configurations be incorporated into the docker image you’re using for Spark?
@Dremio 2 месяца назад
I do that primarily for educational purposes to help people learn the spark configs so they can apply the learning to their environment. Many tutorials abstract configs then when people try to apply what they learned they don't know what the configs are or where they come from. - Alex
@recs8564 2 месяца назад
This is so nice. Now I dont have to pay for databricks in order to learn Spark!
@MrKamalNeel 2 месяца назад
why did they create manifest list & manifest files as two separate layer? They could have just created metadata file & manifest file for each snapshot. What challenges we could have if we don't have manifest list in iceberg design
@Dremio 2 месяца назад
the reason being as tables get larger that manifest list if it listed files would get larger and take longer to traverse. By breaking it up into manifests you can more efficiently only scan the portions of file listing you need for the query. Using partition pruning I may have 100 manifests in a snapshot but only need to scan 10 relevant to the query resulting in much faster scan planning. Also these manifests can be reused, so you use up less storage space not having to rewrite lists of of files that have already been listed in a pre-existing manifest.
@databro1991 2 месяца назад
❤
@AshikaUmanga 2 месяца назад
Thanks for the tutorial. If I use CDC based ingestion as the data source, where the Spark Job for writing Iceberg Table ( steamWrite) runs on ? Is it Inside Airbyte?
@ledinhanhtan 2 месяца назад
Thanks!!
@user-if2kq8nh8m 2 месяца назад
This was really helpful, thanks!
@swaroopsuki1322 2 месяца назад
when we expire the snapshot fi our table created copy-on-write ot merge-on-read the what happen in that case.
@Dremio Месяц назад
Same thing, if a file is associated with a valid snapshot it will not be deleted.
@AhmedShamma-q3n 2 месяца назад
The infrastructure is knowledgeable and able to address the new technology in a detailed pace. the challenge that I can see the instructor's pronunciation is not clear in some points. I advise that the instructor articulate the idea in clear pronunciation, as sometime he is too fast and some times too slow.
@roman87ljp 2 месяца назад
great presentation! just one question. In dremio, which would be the case scenario where a view WON'T need reflections activated? (and view has many columns and high volume)
@Dremio 2 месяца назад
I would actually default to not activating reflections, the Dremio engine is very performant on all sources directly. Dremio has a reflection recommended feature that'll analyze query history and identify reflections you should activate and even give the SQL to create them.
@pilarriush.9373 2 месяца назад
Pregunta. ¿Necesitas realizar alguna conexión desde Dremio hacia Docker? A mi se me presenta el error Invalid staging location provided Al momento de cargar los archivos. ¡Por favor, ayuda!
@Dremio 2 месяца назад
¿Tiene más detalles sobre a qué está intentando conectarse desde Dremio? Si solo está intentando evaluarlo con Docker, siga las instrucciones de este blog. www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/
@pilarriush.9373 2 месяца назад
⚠⚠Pregunta. ¿Necesitas realizar alguna conexión desde Dremio hacia Docker? A mi se me presenta el error Invalid staging location provided Al momento de cargar los archivos. ¡Por favor, ayuda!
@pilarriush.9373 2 месяца назад
⚠⚠Pregunta. ¿Necesitas realizar alguna conexión desde Dremio hacia Docker? A mi se me presenta el error -Invalid staging location provided- Al momento de cargar los archivos. ¡Por favor, ayuda!
@neilgodfree22 2 месяца назад
How do you persist the data using volumes? When container is deleted I don't want to recreate everything. Please send me link to solution if there is one.
@Dremio 2 месяца назад
github.com/developer-advocacy-dremio/dremio-compose look through different version docker-compose files here may find some examples that help.
@AxelNtwari 2 месяца назад
nice overview
@stephanierandall1170 3 месяца назад
🐬 fan already, next please show their corporate headquarters
@stephanierandall1170 3 месяца назад
🔥
@zmihayl 3 месяца назад
Your voice is like an angel to fall asleep😇
@santhoshreddykesavareddy1078 3 месяца назад
Hi thanks this is really a great information to start with Apache Iceberg. But I have a question, when modern databases are already doing it with so much advance technology to prune and scan the data, why would we need to store the data in files format instead of directly loading them to a table ?
@Dremio 3 месяца назад
When you start talking about 10TB+ datasets yo run into issues on whether database can hold the dataset and performantly. Also different purposes need different tools so you need your data in a way that be used by different teams with different tools.
@Dremio 3 месяца назад
Also with data lakehouse tables there doesn’t have to be any running database server when no one is querying the dataset since they are just files in storage while traditional database tables need a persistently running environment.
@santhoshreddykesavareddy1078 3 месяца назад
@@Dremio wow! Now I have got full clarity. Thank you so much for your response.
@santhoshreddykesavareddy1078 3 месяца назад
@@Dremio cost saving. Thanks for the tip 😀.
@intjprogrammer3877 3 месяца назад
Thanks for the great video. Question: when we first the DELETE command in the lesson2 branch, does the data also appear in minio ? Like, does minio object storage shows both lesson2 branch and main branch separately ? I am curious this because on minio, there is only data and metadata partition, and there is not directory for main vs lesson2 branch.
@intjprogrammer3877 3 месяца назад
I think I got it now. Storage layer does not have this concept of branches, so in the waraehouse/data/ directory, it stores parquet files both lesson2 branch and main branches. I can tell this because there are files with different timestamp associated with my sql operations in each branch.
@nooh_jl 3 месяца назад
Thank you so much! I have a question. I'm wondering if there might be any way to do these procedures automatically in Iceberg. Do I have to do these things in person every time?
@Dremio 3 месяца назад
Dremio Cloud has the ability to automate these types of operations
@nooh_jl 3 месяца назад
it's really helpful for me!! Thank you so much
@ZaidAlig 3 месяца назад
Hi Alex, Really tankful to you for such nice explanation and handson. I got stuck at 'CREATE BRANCH IF NOT EXISTS lesson2 IN nessie' . This keeps failing with error message "syntax error at or near 'BRANCH'". Am I missing something? Kindly assist.
@Dremio 3 месяца назад
If you want pm me (Alex Merced) your spark configs. Usually it’s a typo or an update that needs to be made the spark configs. Spark can be very touchy on he config side which is one reason using Dremio for a lot of iceberg operations is so nice (much easier).
@kenhung8333 3 месяца назад
Awsome Video !! At 3:18 when explaining different delete format I have question regards to the implementation : As the delete mode only accept MOR or COW , how exactly do I specify the delete operation to use Equality delete or Positional delete ??
@Dremio 3 месяца назад
It’s mainly based on the engine, most engines will use position delete but streaming platforms like Flink will use equality deleted to keep write latency to a minimum
@mdafazal12 3 месяца назад
very well explained...great job Dipankar
@agrohe21 3 месяца назад
Great explanation and details
@joeingle1745 3 месяца назад
Great article Alex. Slight issue creating a view in Dremio, I get the following exception "Validation of view sql failed. Version context for table nessie.names must be specified using AT SQL syntax". Nothing obvious in the console output, any ideas?
@AlexMercedCoder 3 месяца назад
That means the table is in Nessie and it needs to know which branch your using so it would be AT BRANCH main
@joeingle1745 3 месяца назад
@@AlexMercedCoder Thanks Alex. This would seem to be a limitation of the 'Save as View' dialogue, as it doesn't allow me to do this and it doesn't default to the branch you're in the context of currently.

Dremio

Комментарии