- Видео 566
- Просмотров 9 361 474
Dremio
США
Добавлен 11 фев 2017
The Unified Lakehouse Platform for Self-Service Analytics
Bring users closer to the data with lakehouse flexibility, scalability, and performance at a fraction of the cost. Dremio's intuitive Unified Analytics, high-performance SQL Query Engine, and Lakehouse Management service for next-gen dataops let you shift left for the fastest time to insight
Bring users closer to the data with lakehouse flexibility, scalability, and performance at a fraction of the cost. Dremio's intuitive Unified Analytics, high-performance SQL Query Engine, and Lakehouse Management service for next-gen dataops let you shift left for the fastest time to insight
Whoop's Carlos Peralta on Building a Data-Driven Culture at Whoop and Moderna
Data Disruptors - A Podcast for Data Leaders. Listen to Carlos Peralta, Whoop Data Leaders disrupting technologies giving the company access to cutting-edge insights on Building a Data-Driven Culture at Whoop and Moderna: Collaboration and Alignment.
In this episode, Tomer Shiran interviews Carlos Peralta, MLOps & Data Eng Global Director, WHOOP discusses his experience building data platforms and driving a data-driven culture. He emphasizes the importance of collaboration between technical teams and business stakeholders, as well as the need for data quality and integrity. Carlos also highlights the challenges of model interpretability and fairness, as well as automated feature engineerin...
In this episode, Tomer Shiran interviews Carlos Peralta, MLOps & Data Eng Global Director, WHOOP discusses his experience building data platforms and driving a data-driven culture. He emphasizes the importance of collaboration between technical teams and business stakeholders, as well as the need for data quality and integrity. Carlos also highlights the challenges of model interpretability and fairness, as well as automated feature engineerin...
Просмотров: 52
Видео
EP57 - From Hadoop & Hive to Minio & Dremio: Moving Towards a Next Gen Data Architecture
Просмотров 11912 часов назад
Legacy data platforms often fall short of the performance, processing and scaling requirements for robust AI/ML initiatives. This is especially true in complex multi-cloud (public, private, edge, airgapped) environments. The combined power of MinIO and Dremio creates a data lakehouse platform that overcomes these challenges, delivering scalability, performance and efficiency to ensure successfu...
Hands-on with Dremio #5 - Dremio with Python and BI Tools
Просмотров 11216 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Hands-on with Dremio #4 - Apache Iceberg and Git-for-Data
Просмотров 7716 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Hands-on with Dremio #3 - Data Reflections
Просмотров 5916 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Hands-on with Dremio #2 - Preparing Data Across Sources (Joins, Type Conversions, Drop Columns, etc)
Просмотров 4616 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Hands-on with Dremio #1 - Setup and Source Connections
Просмотров 7016 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-On Demo #8 - BI Dashboards with Dremio
Просмотров 5219 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-on Demo #7 - Using Dremio in Python
Просмотров 5419 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-On Demo #6 - Apache Iceberg & Git-for-Data
Просмотров 3619 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-On Demo #5 - Reflections
Просмотров 6819 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-On Demo #4 - Preparing Data (Change Data Types, Drop Columns, Joins)
Просмотров 4719 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-On Demo #3 - Connecting Sources
Просмотров 3619 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-On Demo #2 - Running SQL & Creating Views
Просмотров 5319 часов назад
In this video series, Dremio's Alex Merced tours the basics of working with the Dremio Lakehouse Platform. Repo with Environment: github.com/developer-advocacy-dremio/dremio-demo-env-092024 Get Started with Dremio: www.dremio.com/get-started?
Dremio Hands-on Demo #1 - Environment Setup (Docker Compose)
Просмотров 7219 часов назад
Dremio Hands-on Demo #1 - Environment Setup (Docker Compose)
Special Edition 1 - Apache Iceberg Q&A
Просмотров 23714 дней назад
Special Edition 1 - Apache Iceberg Q&A
Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio...
Просмотров 41214 дней назад
Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio...
EP56 - What’s New in Dremio: Improved Automation, Performance + Catalog for Iceberg Lakehouses
Просмотров 13314 дней назад
EP56 - What’s New in Dremio: Improved Automation, Performance Catalog for Iceberg Lakehouses
Dremio Demo - Federated Queries Joining Mongo and Postgres Data (Breaking Data Silos)
Просмотров 11921 день назад
Dremio Demo - Federated Queries Joining Mongo and Postgres Data (Breaking Data Silos)
Real-Time Analytics Across Data Sources Using Dremio
Просмотров 29521 день назад
Real-Time Analytics Across Data Sources Using Dremio
The Iceberg REST Catalog - Meetup - Tampa Bay Data Engineers Group
Просмотров 202Месяц назад
The Iceberg REST Catalog - Meetup - Tampa Bay Data Engineers Group
End-to-End Data Engineering from CSV/JSON/Parquet to Apache Iceberg to Apache Superset Dashboard
Просмотров 441Месяц назад
End-to-End Data Engineering from CSV/JSON/Parquet to Apache Iceberg to Apache Superset Dashboard
Cyber Lakehouse for the AI Era, ZTA and Beyond
Просмотров 100Месяц назад
Cyber Lakehouse for the AI Era, ZTA and Beyond
EP55 - Unite Data Across Dremio, Snowflake, Iceberg, and Beyond
Просмотров 178Месяц назад
EP55 - Unite Data Across Dremio, Snowflake, Iceberg, and Beyond
EP54 - Mastering Semantic Layers: The Key to Data-Driven Innovation
Просмотров 415Месяц назад
EP54 - Mastering Semantic Layers: The Key to Data-Driven Innovation
Tampa Bay Data Engineers Group - July 2024 - Dremio's Reflections
Просмотров 2042 месяца назад
Tampa Bay Data Engineers Group - July 2024 - Dremio's Reflections
Unifying On-Prem and Cloud Data with Dremio: Cloud Data on Snowflake + On-Prem Data with Minio
Просмотров 2012 месяца назад
Unifying On-Prem and Cloud Data with Dremio: Cloud Data on Snowflake On-Prem Data with Minio
Apache Iceberg Lakehouse crash course
Просмотров 7392 месяца назад
Apache Iceberg Lakehouse crash course
EP53 - Build the next-generation Iceberg lakehouse with Dremio and NetApp
Просмотров 3742 месяца назад
EP53 - Build the next-generation Iceberg lakehouse with Dremio and NetApp
Is that hive table sub partioned on date string ? And then further bucketing.. would drastically reduce those 3 million files ... Without much info on that structure could not agree on performance issues in this example..
the audio is very very soft. even at 100%
the audio is very very soft. even at 100%
Can you do more of an intro ? Telling people what your tools does ? Tool looks amazing but it was only at the end I found out what it did.
Audio is too low, can't hear much even at max volume and it's crackled
May have to re-record this, not sure which of my microphones it’s trying to use that much that it’s overdriving the audio, I’ll double check and re-record another iteration of this series, probably as one longer video.
Great content as always! Can you do some demo talking about the engines inside dremio config?
Thanks Alex! I have been watching most of your video and blog tutorials. You have really improved my knowledge in Datalakehouse, Apache Iceberg, and Dremio. However, I can't find any detailed enterprise BI cloud solution in Dremio Solar and Arctic. Can you please take us through an enterprise cloud BI solution using the Azure Data Lake Gen2 as the storage layer? Otherwise, could you point me to an existing blog on the Dremio website? Regarding this tutorial, do I really need Spark since I can accomplish all my tasks in Dremio? Cheer!
Nope, I could do everything in Dremio or other tools. I was just demonstrating how different tools can work with a single copy of the data in Iceberg. Regarding BI tools, what tool are you looking to use. When using Dremio connecting to a BI tool is exactly the same regardless of storage layer. Essentially, you connect your sources to Dremio then Dremio to your BI tool and Dremio can serve all your data to that BI tool on a single connection.
@@Dremio Well noted and thank you! My BI tool is PowerBI.
So to understand/use iceberg you also need 20 other components on top of running hadoop and all of the things that that requires? Seems like an... "improvement"? aka. Hadoop should have never been turned into a data warehouse to begin with.
Hadoop wasn't used in this video, and you don't need all the tools shown. A Variety of tools were used in the demo to demonstrate the portability of the data.
Very poor sound quality!
First comment
Very nice and concise 😂
Great video ❤
Very nice tutorial, am beginning to see how branching is useful
This was super helpful!
Amazing ---
Instructor talks too fast that difficult to cop with specially for english as a second language speaker. I advice the instructor to slow down to be able to convey his message clearly , after all thank you for the architectural lecture about data lakehouse as it is a confusing term for many of us.
Pretty awesome talk! I would love to see geospatial formats be supported in Apache Iceberg, but I'm glad that this fork exists!
There are so many tools used here my head hurts. For "getting data into dwh" how can this possibly be seen as better than classic parque files in hive-partitioned storage with an external table on? In gcp this is 5 lines of sql code, in redshit you might have to add Glue to the mix. And then any sql-running tool can do the rest inside the database/dwh. In 99% of companies theres at best 3-4 data engineers, managing all this + data modelling would mean nothing ever gets done.
Alex is the man - explains these things and the context so well.
Excellent overview - thanks! - My question is about how best to get that first "Raw Copy" into Iceberg. Using the SQL Connector I can hook Dremio up to MS-SQL data sources, and they are then available for reflections - but I don't see any clear way to copy the data into an Iceberg table that would persist (in the case that the source data became unavailable) am I missing something?
If you connect an Iceberg catalog source (hive, Nessie) you can write data to it, so you can use syntax like CTAS to write the data and it will be persisted in the configured storage when you connected the catalog. You’d have to orchestrate workloads so it updates the new iceberg table from the source system or you can use an external tool to land the data as iceberg into your lake like spark, Flink, Upsolver, Airbyte, etc.
@@Dremio Using airbyte to do the ingestion, the first iceberg table created using nessie will automatically update itself?
@@guilhermeranieri8445 depends on how you are doing, how are you ingesting with Airbyte, directly into Iceberg?
@@Dremio using minio. I send data to minio and manipulate data on spark, but already created views on dremio for testing
How does Dremio support the Iceberg Rest Catalog........I cannot seem to find a way, other than Nessie to create an S3 resident Iceberg table in Dremio......
You could also use AWS Glue, but the rest catalog connector is coming soon in Dremio is on the horizon.
Why is the spark configuration with all of the lakehouse services hardcoded in a notebook? Shouldn’t these configurations be incorporated into the docker image you’re using for Spark?
I do that primarily for educational purposes to help people learn the spark configs so they can apply the learning to their environment. Many tutorials abstract configs then when people try to apply what they learned they don't know what the configs are or where they come from. - Alex
This is so nice. Now I dont have to pay for databricks in order to learn Spark!
why did they create manifest list & manifest files as two separate layer? They could have just created metadata file & manifest file for each snapshot. What challenges we could have if we don't have manifest list in iceberg design
the reason being as tables get larger that manifest list if it listed files would get larger and take longer to traverse. By breaking it up into manifests you can more efficiently only scan the portions of file listing you need for the query. Using partition pruning I may have 100 manifests in a snapshot but only need to scan 10 relevant to the query resulting in much faster scan planning. Also these manifests can be reused, so you use up less storage space not having to rewrite lists of of files that have already been listed in a pre-existing manifest.
❤
Thanks for the tutorial. If I use CDC based ingestion as the data source, where the Spark Job for writing Iceberg Table ( steamWrite) runs on ? Is it Inside Airbyte?
Thanks!!
This was really helpful, thanks!
when we expire the snapshot fi our table created copy-on-write ot merge-on-read the what happen in that case.
Same thing, if a file is associated with a valid snapshot it will not be deleted.
The infrastructure is knowledgeable and able to address the new technology in a detailed pace. the challenge that I can see the instructor's pronunciation is not clear in some points. I advise that the instructor articulate the idea in clear pronunciation, as sometime he is too fast and some times too slow.
great presentation! just one question. In dremio, which would be the case scenario where a view WON'T need reflections activated? (and view has many columns and high volume)
I would actually default to not activating reflections, the Dremio engine is very performant on all sources directly. Dremio has a reflection recommended feature that'll analyze query history and identify reflections you should activate and even give the SQL to create them.
Pregunta. ¿Necesitas realizar alguna conexión desde Dremio hacia Docker? A mi se me presenta el error Invalid staging location provided Al momento de cargar los archivos. ¡Por favor, ayuda!
¿Tiene más detalles sobre a qué está intentando conectarse desde Dremio? Si solo está intentando evaluarlo con Docker, siga las instrucciones de este blog. www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/
⚠⚠Pregunta. ¿Necesitas realizar alguna conexión desde Dremio hacia Docker? A mi se me presenta el error Invalid staging location provided Al momento de cargar los archivos. ¡Por favor, ayuda!
⚠⚠Pregunta. ¿Necesitas realizar alguna conexión desde Dremio hacia Docker? A mi se me presenta el error -Invalid staging location provided- Al momento de cargar los archivos. ¡Por favor, ayuda!
How do you persist the data using volumes? When container is deleted I don't want to recreate everything. Please send me link to solution if there is one.
github.com/developer-advocacy-dremio/dremio-compose look through different version docker-compose files here may find some examples that help.
nice overview
🐬 fan already, next please show their corporate headquarters
🔥
Your voice is like an angel to fall asleep😇
Hi thanks this is really a great information to start with Apache Iceberg. But I have a question, when modern databases are already doing it with so much advance technology to prune and scan the data, why would we need to store the data in files format instead of directly loading them to a table ?
When you start talking about 10TB+ datasets yo run into issues on whether database can hold the dataset and performantly. Also different purposes need different tools so you need your data in a way that be used by different teams with different tools.
Also with data lakehouse tables there doesn’t have to be any running database server when no one is querying the dataset since they are just files in storage while traditional database tables need a persistently running environment.
@@Dremio wow! Now I have got full clarity. Thank you so much for your response.
@@Dremio cost saving. Thanks for the tip 😀.
Thanks for the great video. Question: when we first the DELETE command in the lesson2 branch, does the data also appear in minio ? Like, does minio object storage shows both lesson2 branch and main branch separately ? I am curious this because on minio, there is only data and metadata partition, and there is not directory for main vs lesson2 branch.
I think I got it now. Storage layer does not have this concept of branches, so in the waraehouse/data/ directory, it stores parquet files both lesson2 branch and main branches. I can tell this because there are files with different timestamp associated with my sql operations in each branch.
Thank you so much! I have a question. I'm wondering if there might be any way to do these procedures automatically in Iceberg. Do I have to do these things in person every time?
Dremio Cloud has the ability to automate these types of operations
it's really helpful for me!! Thank you so much
Hi Alex, Really tankful to you for such nice explanation and handson. I got stuck at 'CREATE BRANCH IF NOT EXISTS lesson2 IN nessie' . This keeps failing with error message "syntax error at or near 'BRANCH'". Am I missing something? Kindly assist.
If you want pm me (Alex Merced) your spark configs. Usually it’s a typo or an update that needs to be made the spark configs. Spark can be very touchy on he config side which is one reason using Dremio for a lot of iceberg operations is so nice (much easier).
Awsome Video !! At 3:18 when explaining different delete format I have question regards to the implementation : As the delete mode only accept MOR or COW , how exactly do I specify the delete operation to use Equality delete or Positional delete ??
It’s mainly based on the engine, most engines will use position delete but streaming platforms like Flink will use equality deleted to keep write latency to a minimum
very well explained...great job Dipankar
Great explanation and details
Great article Alex. Slight issue creating a view in Dremio, I get the following exception "Validation of view sql failed. Version context for table nessie.names must be specified using AT SQL syntax". Nothing obvious in the console output, any ideas?
That means the table is in Nessie and it needs to know which branch your using so it would be AT BRANCH main
@@AlexMercedCoder Thanks Alex. This would seem to be a limitation of the 'Save as View' dialogue, as it doesn't allow me to do this and it doesn't default to the branch you're in the context of currently.