Thanks for the amazing explanation on parquet file system. Coming from a wood business, parquet as a flooring is not new to me. I have done many projects on parquet installation. Interesting to see it coming back in Big Data and Data Engineering.
I can def see your channel explode in a few months. Good quality content of difficult topics, often covered in other videos that last 1 hr, with poor sound quality and no logic flow. You are going places my dude.
Very well explained Encoding and Compression...So I have a Q: Delta versus Dictionary Encoding, How would one decide which given Dictionary seems so much more efficient? But then I suppose it depends on repitition.
Thanks for the video Riz! I was curious what's the practical use-case for LZO, 00:05:14, cuz I see when comparing with Snappy, assuming we're dealing with hot data, the only advantage of LZO would be faster decompression. Anything I'm missing? Thanks in advance
Nice summary. Although, it would help to explain why querying parquet files is more efficient compared to csv, especially for select * queries (where row store format is usually much more efficient). Is it because the type definition and metadata features of parquet? Thanks
There are few main differences Sri Ram, notably Parquet is column based files while Avro is row based (like excel), so Parquet is better if you're using querying the data column by column (e.g. analytics), whereas Avro would be better (compared to parquet) if u want to query the scan/query the whole data. Plus, Avro is also written in JSON (more human readable) while Parquet comes with its own format and not as readable.
Hi Riz, Thanks for this info, One of the best explanations i have seen. one doubt 1) When you give a table to Parquet, does it - first partition by rows --> than each partition is converted to columnar and stored inside the parquet. OR - Does it directly store the data into columnar and into parquet And could you please explain ORC and difference between Parquet, ORC, AVRO and when to use what.
That's a good Q and I don't know the answer, please do let me know when you do! Currently I'm really full with "Life" at the moment, but yeah already plan to create videos about ORC and Avro. Stay tune!
I am currently capturing live data in csv format. But for storage benefit, i want to live data is saved in direct parquet format. that is possible or not?
Thanks Conaill! You can convert data into parquet with many tools in the market these days, some notable examples in Azure worls is Spark (via Databricks or Synapse) and Data Factory (as part of the integration).
Sir.. thanks for this detailed contents.. I have below query, that i didnt get clarified from anywhere... People use to say for Hive use ORC, and for spark use Parquet.. dont understand what is the deep logic behind this.. if ORC is more efficient, why we cant use ORC, insted of parquet?
Hi Riz, I am doing development from parquet to delta lake. I’m parquet in-line we have change data capture which only reads the data if it has a change from the previous. How good is it ? Do you recommend using it for our SCDs? Do you see value ?
Hi brother, I have issue in sending the parquet file to snowflake. The problem is the .parquet file is been sent to the snowflake table but the date column is not in the shows 1day minus. i.e if the date is 12-01-2022 then in snowflake it is showing as 11-01-2022. I looking for help. I appreciate your time for reading this. Thanks in advance!
Assuming you're using Azure cloud, you can use Polybase to query CSV and parquet file directly (i.e. creating external table) from Azure blob storage (or Data Lake) within Azure SQL Database or Azure Synapse SQL :)
Hi! I would like to start a personal project of creating a data warehouse in Azure Synapse Analytics. Do you have any suggestions of how I can do so without having to pay hundreds of dollars a month minimum for provisioning a dedicated SQL pool in azure for my project (as per pricing I've seen) Thanks so much! I hope I simply misunderstood Azures DW pricing.
hello so i need a data federation tool which has a python client, I need to be able to connect and query data from wide variety of data storage platforms , as of now I sore data in ADLS and sqlserver on azure , what would you reccomend
@@RizAngD so a platform which can connect to many data storage places (s3, adls, mysql, mssq etc )so that regardless of where the data is stored , I have a central platform through which I can access all of it
hey Riz, i want your help can you please provide me one sample parquet file with LZO compression ---> i am stucked, i tried alot of things in pyspark and pyarrow to convert but unable to create parquet with lzo compression. please provide me 1 sample file if you can help me
I finally understand what is the parquet file format thanks to your video, great job!
No one has explained on RUclips better than you Riz.
Thank you for making such a great video.
Wow, thanks
Thanks from India. Love the way you explain. Very simple and concise information
Thanks for the amazing explanation on parquet file system. Coming from a wood business, parquet as a flooring is not new to me. I have done many projects on parquet installation. Interesting to see it coming back in Big Data and Data Engineering.
Thank you Riz. Very helpful video to get a high level understanding of the Parquet files!
Glad to hear that!
Great video. Loved the fact that you used Physical Graffiti - one of my fave albums of all time.
thanks!!
Really greatly explained & really nice.. keep going Riz !!!
Thanks, will do!
Excellent and crisp explanation
Glad you liked it
Thanks for the clear explanation! It helps a lot!
Thank you Riz for the wonderful explanation!
My pleasure!
thanks for the explanation. very nicely done.
Thanks mate. A very good and quick explanation. Really good work.
Glad you liked it!
Love this video! Less than 10 minutes and in depth about the topic. Thanks you!
Glad it was helpful!
I can def see your channel explode in a few months. Good quality content of difficult topics, often covered in other videos that last 1 hr, with poor sound quality and no logic flow. You are going places my dude.
that's very kind words Luis!
I'm still learning to be a better RUclipsr myself :)
You make some excellent content my man!
Glad you think so!
Lovely explanation Riz and thank you for the video ! I would recommend your channel to all my colleagues who do database related jobs !
Thanks for sharing!
Thank you for this very useful video!
Glad it was helpful!
Really great explanation! thank you so much
Glad you enjoyed it!
You are a genius! Fantastic video! Thanks!
Glad it helped!
Very well explained Encoding and Compression...So I have a Q: Delta versus Dictionary Encoding, How would one decide which given Dictionary seems so much more efficient? But then I suppose it depends on repitition.
Thank you.. Very well explained.. Crystal clear :)
Glad it was helpful!
How to retrieve latest file in to the destination folder ..... Can u please explain...!?
Great explanation!!!!
Very helpful, thanks!🙏🏽
thank you. wonderful explanation
Great!!!. Saludos desde Perú
thanks!
Thanks for the video Riz! I was curious what's the practical use-case for LZO, 00:05:14, cuz I see when comparing with Snappy, assuming we're dealing with hot data, the only advantage of LZO would be faster decompression. Anything I'm missing? Thanks in advance
that's also my understanding :)
Nice summary. Although, it would help to explain why querying parquet files is more efficient compared to csv, especially for select * queries (where row store format is usually much more efficient). Is it because the type definition and metadata features of parquet? Thanks
I really like this video, very useful.. Can't wait next video.. ;)
Thank you! 😃
Thanks for the Parquet video Riz.
What is the difference between Parquet and Avro?
There are few main differences Sri Ram, notably Parquet is column based files while Avro is row based (like excel), so Parquet is better if you're using querying the data column by column (e.g. analytics), whereas Avro would be better (compared to parquet) if u want to query the scan/query the whole data.
Plus, Avro is also written in JSON (more human readable) while Parquet comes with its own format and not as readable.
Hi Riz,
Thanks for this info, One of the best explanations i have seen.
one doubt
1) When you give a table to Parquet, does it
- first partition by rows --> than each partition is converted to columnar and stored inside the parquet.
OR
- Does it directly store the data into columnar and into parquet
And could you please explain ORC and difference between Parquet, ORC, AVRO and when to use what.
That's a good Q and I don't know the answer, please do let me know when you do!
Currently I'm really full with "Life" at the moment, but yeah already plan to create videos about ORC and Avro. Stay tune!
Subtitles are covering the content. Please enable option to switch off captions
thanks for the feedback!
Wonderful overview, thank you!
Glad it was helpful!
gran video, me aclaro todo
Thanks!
Good and to the point.
thanks!
I am currently capturing live data in csv format. But for storage benefit, i want to live data is saved in direct parquet format. that is possible or not?
Very useful overview Riz. As a total noob to this format, I have a simple question: how do you convert data into the parquet format? Is that possible?
Thanks Conaill! You can convert data into parquet with many tools in the market these days, some notable examples in Azure worls is Spark (via Databricks or Synapse) and Data Factory (as part of the integration).
nice overview. thank you.
Thanks for watching!
Riz, The presentation looks good. I use the parquet file thru cognos analytic’s dataset. Does parquet files structure column based by default?
Yes it does Karthikeyan
Hi Riz, can you do a video on the use case for AVRO compared to Parquet?
Already in my backlog, I've just been too busy procrastinating!! :P
Nice video🎉
great video, how do i combine multiple snappy.parquet files to single file and load it to snowflake ??
Sir.. thanks for this detailed contents.. I have below query, that i didnt get clarified from anywhere...
People use to say for Hive use ORC, and for spark use Parquet.. dont understand what is the deep logic behind this.. if ORC is more efficient, why we cant use ORC, insted of parquet?
Hi Riz, I am doing development from parquet to delta lake. I’m parquet in-line we have change data capture which only reads the data if it has a change from the previous. How good is it ? Do you recommend using it for our SCDs? Do you see value ?
Hi brother, I have issue in sending the parquet file to snowflake. The problem is the .parquet file is been sent to the snowflake table but the date column is not in the shows 1day minus. i.e if the date is 12-01-2022 then in snowflake it is showing as 11-01-2022. I looking for help. I appreciate your time for reading this. Thanks in advance!
What tools are used to query files (csv, parquet) directly? I've never heard of doing this.
Assuming you're using Azure cloud, you can use Polybase to query CSV and parquet file directly (i.e. creating external table) from Azure blob storage (or Data Lake) within Azure SQL Database or Azure Synapse SQL :)
@@RizAngD Ah, thank you! Never used Azure before. Understood.
DuckDB
@@ularkadutdotnet Thank You!
Hi! I would like to start a personal project of creating a data warehouse in Azure Synapse Analytics. Do you have any suggestions of how I can do so without having to pay hundreds of dollars a month minimum for provisioning a dedicated SQL pool in azure for my project (as per pricing I've seen) Thanks so much! I hope I simply misunderstood Azures DW pricing.
hello so i need a data federation tool which has a python client, I need to be able to connect and query data from wide variety of data storage platforms , as of now I sore data in ADLS and sqlserver on azure , what would you reccomend
Help me explain what you mean by data federation tool?
@@RizAngD so a platform which can connect to many data storage places (s3, adls, mysql, mssq etc )so that regardless of where the data is stored , I have a central platform through which I can access all of it
Can you please elaborate more on whats repetition levels and definition levels with a simpler example. It would really help. Thanks in advance. ! 😊
I suggest referring this blog, very comprehensive explanation :)
www.waitingforcode.com/apache-parquet/nested-data-representation-parquet/read
Thanks boss
Welcome
hey Riz, i want your help can you please provide me one sample parquet file with LZO compression ---> i am stucked, i tried alot of things in pyspark and pyarrow to convert but unable to create parquet with lzo compression. please provide me 1 sample file if you can help me
It's not butter its Parquet..
not very helpful video without practical .
Sorry to hear that. Tx