AWS Tutorials - AWS Glue Job Optimization Part-3
HTML-код
- Опубликовано: 12 мар 2022
- AWS Glue Job and Lake Formation Crash Course - • AWS Tutorials Crash Co...
Building AWS Glue Job using PySpark - • Building AWS Glue Job ...
AWS Tutorials - AWS Glue Job Optimization Part-1 - • AWS Tutorials - AWS Gl...
Job Code - github.com/aws-dojo/analytics...
Data File- github.com/aws-dojo/analytics...
Optimization of AWS Glue Job is an interesting and most-asked topic. There are many ways to optimize AWS Glue Job such as optimizing memory or capacity. In this video, you learn how to control parallelism in workers and spark task by grouping files based on size. Наука
Simply explained.
Thanks for the appreciation
I am forwarding your videos in my office telegram group as well
Many thanks for appreciation. If you have any specific requirement, please let me know - I would love to cover that in some video if not already done.
Also if use groupSize, as I understand your output files should have a larger file size and have a lesser number of files, because it is being group?
Can we connect on LinkedIn? . Your content goes beyond most of the paid courses. Thanks to workflow video, I was able to create a scalable system for my ETL.
How can I achieve using the _from_catalog() method, it seems if I add groupFiles, groupsize settings it would not work? Please advise.
This is a very great video, I am facing issues as well regarding loading the data while performing some transformations using pyspark sql. How can we improve performance for data which has more than 60+ Million records in glue? Currently using 10 DPUs.
Need more info like source type, number of tables, full dump or increamental.
Wonderful videos and highly useful to learn concepts that are not widely discussed elsewhere. I need to create a table in Snowflake (dynamically) based on the schema definition from Glue catalog (that crawls a few parquet files). Is it possible?
Hi, sorry but I don't have much idea about snowflake
@@AWSTutorialsOnline Not a problem. Once again thanks for your enlightening videos with valuable contents
Nice! But - "The AWS Glue Parquet writer has historically been accessed through the glueparquet format type. This access pattern is no longer advocated"... Use classical "parquet" format
As per aws documentation, groupFiles is not supported for parquet format.
Can we control parallelism if we are reading only one file with huge data. like a text file with 3 million records data ?
Parallelism does not work for large files. I will recommend you write ETL to break large files into small ones.