Looking forward about the Unity Catalog integration and how users will be able to query Unity Catalog tables with DuckDB from their laptop, using permissions set on Unity Catalog. Very exciting !
Thanks you for your work ! I still have a few questions (that will most likely sound a little stoopid) Definition of catalogs - From your experience, is there a possible way to automate the content of the data governance catalog (the union part of objects metadata) with the content of the database object catalog? Why are catalogs important - When you say that the lakehouse adds a table format enabling ACID transactions and schemas management, what does it mean in practice? From what I understand, this table format are only used to catalog the content of the Object storage. Could we also use them, to reformat the object storage data in the table format? - Additionally, how do you originally populate your tables, in your example of the deltaLake folder, how was the data (the parquet file) documented? - With advance Table format, the raw data is not necessarily stored in its latest and correct version, but the json metadata file will inform you about it and track the history of the raw data, right? In that case, where is stored the "wrong raw data" that is mentioned in the table format (the verison that is not up to date in your example)? DuckDB as a portal catalog - - From what I understand, you suggest that DuckDB has its own file format that contains tables for data and metadata. Then you mention the possibility of having a file containing all the metadata, without the data itself (in a super light file). So it unclear for me, the DuckDB does or does not contain the data? - From your extreme example, it seems that your DuckDB DB, was in fact just a catalog of 4 DB: air_quality, customer, duck, lineitem? - If I don't to move my DuckDB file to the cloud, how can I share it with other users? - Can I modify the content of the DB 'linked' in the DuckDB file, other than on motherDuck? Sorry for the numerous questions and thanks again for your work!!
Hey there! I'll do my best to answer here, but please do not hesitate to join slack.motherduck.com to continue the conversation over there! Definition of catalogs - From your experience, is there a possible way to automate the content of the data governance catalog (the union part of objects metadata) with the content of the database object catalog? --> It depends on the existing stack. Some data catalogs will cover both, some of them you'll have to tight things together through APIs. Why are catalogs important - When you say that the lakehouse adds a table format enabling ACID transactions and schemas management, what does it mean in practice? From what I understand, this table format are only used to catalog the content of the Object storage. Could we also use them, to reformat the object storage data in the table format? --> Table formats have other capabilities like handling the many small files problem for instance. Here they would rewrite behind the scene multiple small files into one to improve query performance. So in that case, you are not only cataloging, you are actively transforming the files in the object storage. - Additionally, how do you originally populate your tables, in your example of the deltaLake folder, how was the data (the parquet file) documented? --> There are different clients that support write atm, would be SQL engine (BigQuery/Snowflake) or library (e.g deltars) DuckDB as a portal catalog - - From what I understand, you suggest that DuckDB has its own file format that contains tables for data and metadata. Then you mention the possibility of having a file containing all the metadata, without the data itself (in a super light file). So it unclear for me, the DuckDB does or does not contain the data? --> It can contains both. If you load only VIEW definition, that would be only metadata. If you store actual table, that would be data + metadata. - From your extreme example, it seems that your DuckDB DB, was in fact just a catalog of 4 DB: air_quality, customer, duck, lineitem? --> It's one DB (1 file) that contains these 4 tables. - If I don't to move my DuckDB file to the cloud, how can I share it with other users? --> If by Cloud, you mean MotherDuck, you can use any object storage. - Can I modify the content of the DB 'linked' in the DuckDB file, other than on motherDuck? --> Yes, any duckdb file can be opened and shared in any way you can share any file.
All this is very nice but since the "catalog" contains only views definition, any queries to the S3 parquet - supported views will require the download of these parquets, right? Unless duckdb performs any caching of the parquets locally (and it seems it does not the last time I checked), there will be a continuous latency tax imposed again and again, isn't it so?
That's right. You have different options to speed up things : - Leverage the cloud network, have DuckDB hosted, or use MotherDuck - Use adequate partitioning to filter the data you need - Load the data into an internal DuckDB table to act as a cache
It is not fully clear how it recognizes the data catalog... Does the "attach" command work like a crawler? Or is it the initial definition? Also, it would be nice to have a video showing how it would work for a data wrangling... As of now, you would set up a data catalog, crawl it to set up the open tables, and then use glue to do whatever (in AWS). This is a nice concept, but it brings me a few questions, like... does the definition remain inside the S3, or locally? When the file contents are read (select * from table in catalog), does it charge me as a download for the file contents (meaning, it would become expensive with multiple files, as DuckDB would download it first and then read)? Would it be best if we set up a serverless function to act as a glue while using DuckDB as portable catalog? Sorry for the lots of questions haha That's a pretty interesting concept to me (portable catalog), and It might work well for small/medium companies (not sure for streaming and big data) 😊
| Does the "attach" command work like a crawler? Or is it the initial definition? Neither of these actually. As these are VIEWS, it depends on how you define them. For instance, if you define a view with `s3://my_bucket/my_table/file1.parquet` and you have other parquet files, this VIEW will not be up to date. If you define it as `s3://my_bucket/my_table/*.parquet`, then this will crawl all parquet files when you do a query. | does the definition remain inside the S3, or locally? The ATTACH command just attaches a local or remote database. So if you update a DuckDB file with the new definitions remotely, it would be on S3. | When the file contents are read (select * from table in catalog), does it charge me as a download for the file contents (meaning, it would become expensive with multiple files, as DuckDB would download it first and then read)? Would it be best if we set up a serverless function to act as a glue while using DuckDB as portable catalog? Yes, that's also where MotherDuck can help, you not only leveraging the cloud compute, but also the cloud network. Hope this clarify a bit, please don't hesitate to join our slack slack.motherduck.com/ if you want to continue the conversation over there :)
I don't see clearly the need in your case as your only source would be SQL server. If you have many different sources but you don't want to share the actual data (e.g csv, parquet, whatever have you), you could declare them as VIEWS in DuckDB and just share the metadata file which would be lighter and easier to manage (1 file) rather than all the data sources.
Let's us know if you are already using such a working pattern and if you are missing any feature!
Looking forward about the Unity Catalog integration and how users will be able to query Unity Catalog tables with DuckDB from their laptop, using permissions set on Unity Catalog. Very exciting !
Thanks you for your work ! I still have a few questions (that will most likely sound a little stoopid)
Definition of catalogs
- From your experience, is there a possible way to automate the content of the data governance catalog (the union part of objects metadata) with the content of the database object catalog?
Why are catalogs important
- When you say that the lakehouse adds a table format enabling ACID transactions and schemas management, what does it mean in practice? From what I understand, this table format are only used to catalog the content of the Object storage. Could we also use them, to reformat the object storage data in the table format?
- Additionally, how do you originally populate your tables, in your example of the deltaLake folder, how was the data (the parquet file) documented?
- With advance Table format, the raw data is not necessarily stored in its latest and correct version, but the json metadata file will inform you about it and track the history of the raw data, right? In that case, where is stored the "wrong raw data" that is mentioned in the table format (the verison that is not up to date in your example)?
DuckDB as a portal catalog
- - From what I understand, you suggest that DuckDB has its own file format that contains tables for data and metadata. Then you mention the possibility of having a file containing all the metadata, without the data itself (in a super light file). So it unclear for me, the DuckDB does or does not contain the data?
- From your extreme example, it seems that your DuckDB DB, was in fact just a catalog of 4 DB: air_quality, customer, duck, lineitem?
- If I don't to move my DuckDB file to the cloud, how can I share it with other users?
- Can I modify the content of the DB 'linked' in the DuckDB file, other than on motherDuck?
Sorry for the numerous questions and thanks again for your work!!
Hey there! I'll do my best to answer here, but please do not hesitate to join slack.motherduck.com to continue the conversation over there!
Definition of catalogs
- From your experience, is there a possible way to automate the content of the data governance catalog (the union part of objects metadata) with the content of the database object catalog?
--> It depends on the existing stack. Some data catalogs will cover both, some of them you'll have to tight things together through APIs.
Why are catalogs important
- When you say that the lakehouse adds a table format enabling ACID transactions and schemas management, what does it mean in practice? From what I understand, this table format are only used to catalog the content of the Object storage. Could we also use them, to reformat the object storage data in the table format?
--> Table formats have other capabilities like handling the many small files problem for instance. Here they would rewrite behind the scene multiple small files into one to improve query performance. So in that case, you are not only cataloging, you are actively transforming the files in the object storage.
- Additionally, how do you originally populate your tables, in your example of the deltaLake folder, how was the data (the parquet file) documented?
--> There are different clients that support write atm, would be SQL engine (BigQuery/Snowflake) or library (e.g deltars)
DuckDB as a portal catalog
- - From what I understand, you suggest that DuckDB has its own file format that contains tables for data and metadata. Then you mention the possibility of having a file containing all the metadata, without the data itself (in a super light file). So it unclear for me, the DuckDB does or does not contain the data?
--> It can contains both. If you load only VIEW definition, that would be only metadata. If you store actual table, that would be data + metadata.
- From your extreme example, it seems that your DuckDB DB, was in fact just a catalog of 4 DB: air_quality, customer, duck, lineitem?
--> It's one DB (1 file) that contains these 4 tables.
- If I don't to move my DuckDB file to the cloud, how can I share it with other users?
--> If by Cloud, you mean MotherDuck, you can use any object storage.
- Can I modify the content of the DB 'linked' in the DuckDB file, other than on motherDuck?
--> Yes, any duckdb file can be opened and shared in any way you can share any file.
All this is very nice but since the "catalog" contains only views definition, any queries to the S3 parquet - supported views will require the download of these parquets, right? Unless duckdb performs any caching of the parquets locally (and it seems it does not the last time I checked), there will be a continuous latency tax imposed again and again, isn't it so?
That's right. You have different options to speed up things :
- Leverage the cloud network, have DuckDB hosted, or use MotherDuck
- Use adequate partitioning to filter the data you need
- Load the data into an internal DuckDB table to act as a cache
Thank you.....
It is not fully clear how it recognizes the data catalog... Does the "attach" command work like a crawler? Or is it the initial definition?
Also, it would be nice to have a video showing how it would work for a data wrangling... As of now, you would set up a data catalog, crawl it to set up the open tables, and then use glue to do whatever (in AWS). This is a nice concept, but it brings me a few questions, like... does the definition remain inside the S3, or locally? When the file contents are read (select * from table in catalog), does it charge me as a download for the file contents (meaning, it would become expensive with multiple files, as DuckDB would download it first and then read)? Would it be best if we set up a serverless function to act as a glue while using DuckDB as portable catalog?
Sorry for the lots of questions haha That's a pretty interesting concept to me (portable catalog), and It might work well for small/medium companies (not sure for streaming and big data) 😊
| Does the "attach" command work like a crawler? Or is it the initial definition?
Neither of these actually. As these are VIEWS, it depends on how you define them. For instance, if you define a view with `s3://my_bucket/my_table/file1.parquet` and you have other parquet files, this VIEW will not be up to date. If you define it as `s3://my_bucket/my_table/*.parquet`, then this will crawl all parquet files when you do a query.
| does the definition remain inside the S3, or locally?
The ATTACH command just attaches a local or remote database. So if you update a DuckDB file with the new definitions remotely, it would be on S3.
| When the file contents are read (select * from table in catalog), does it charge me as a download for the file contents (meaning, it would become expensive with multiple files, as DuckDB would download it first and then read)? Would it be best if we set up a serverless function to act as a glue while using DuckDB as portable catalog?
Yes, that's also where MotherDuck can help, you not only leveraging the cloud compute, but also the cloud network.
Hope this clarify a bit, please don't hesitate to join our slack slack.motherduck.com/ if you want to continue the conversation over there :)
Can you explain this much more simpel :-). I work with PowerBI, Python (Streamlit) and queries sql servers. In what cases can this help me ? :-)
I don't see clearly the need in your case as your only source would be SQL server.
If you have many different sources but you don't want to share the actual data (e.g csv, parquet, whatever have you), you could declare them as VIEWS in DuckDB and just share the metadata file which would be lighter and easier to manage (1 file) rather than all the data sources.