Building AWS Glue Job using PySpark
HTML-код
- Опубликовано: 30 июл 2024
- The workshop URLs
Part1- aws-dojo.com/workshoplists/wo...
Part2- aws-dojo.com/workshoplists/wo...
AWS Glue Jobs are used to build ETL job which extracts data from sources, transforms the data, and loads it into targets. The job can be built using languages like Python and PySpark. PySpark is the Python API for Spark and it used for big data processing. It can perform data transformation on large scale data in fast and efficient way. This workshop will be covered in two parts.
Part-1: You learn about setting up a data lake, creating development environment for PySpark and finally building a Glue job using PySpark.
Part-2: You learn about PySpark for various types of transformations especially when using AWS Lake Formation and Glue based data lake. These transformations can be used in AWS Glue job. Наука
Hi, I just comment here to say thank you for creating this awesome workshop. It helps me a lot while making acquaintance with datalake concept. One again: Thank you so much and keep the good work. :)
You're very welcome!
Thank you so much for such a wonderful workshop.
Thanks for appreciation.
This was helpful. I am a new Data Engineer and this helped with my job.
Glad it helped.
Thanks so much for all learning !!!!
Thank you for uploading this video, very helpful.
Glad it was helpful!
Great tutorial.Thanks for creating and sharing .
You are so welcome!
Thank you for such a brief information ,please make more videos about aws glue job , I am totally new to Amazon glue so this video is very usefull for me
Thanks for the workshop document.
You are welcome!
Thanks a'lot it's was nice session .. God bless You😊
You're welcome 😊
Awesome tutorial!.
Thank you!
Thanks for videos.its so detailed
Glad you like them!
love ur videos
Thanks
it'd be a great step if you do this but not only showing images but also doing it in the aws console
can we use plain pyspark syntex for filter data from the df in glue?
My question I can we use local jupyter notebook ...because sagemaker notebook carries high cost and at the same time how to schedule this notebooks
Hi! thank you for the great video. When I ran the Glue Job at the end of the part 1 tutorial, the job created over 80 JSON objects containing two rows each. I'd expect a single JSON file which was the result I got when running the PySpark script from SageMaker. Did I miss anything?
not sure what went wrong without looking at the code. But you can force repartition by using "repartition" method. Here is one link which will give you details - stackoverflow.com/questions/56294720/aws-glue-output-one-file-with-partitions
Hope it helps,
@@AWSTutorialsOnline thank you!
Great video. Could you please advice on some performance improvement measure for AWS Glue? Looking forward to your valuable input.
my first recommendation is using Glue Version 2.0 which has improved the start-up time big way. Memory management is very critical for the Glue Job. Please refer this article for that - aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/
Glue generally uses Athena query for data. Optimizing Athena query improves performance of the Glue Job. Please refer this link for Athena Query optimization - aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Hope it helps,
@@AWSTutorialsOnline Thanks a lot
Hello, thank you for amazing workshop. I have specific scenario-I am getting files from oracle via DMS migration task to s3 raw bucket. Initial load brings all the columns as is but for Incremental CDC load I get extra column OP with values U, I and D, which tells me if records is updated, inserted or deleted. When i create a crawler, it detects the columns but when I see the data in Athena it is completely disorganized. Any suggestion how to address this issue? Should I be creating 2 crawlers, then use py spark glue job to merge based on OP columns?
Thanks for the mail. Let's isolate the problem first. Without CDC, when you catalog data using and query in Athena; does it work? If yes - then probably CDC column is messing up. In that case; I would like to keep CDC data separate and use PaySpark job to marge back. Hope it helps,
@@AWSTutorialsOnline As indicated in workshop I created glue job with bookmark enabled, it wrote expected file in target but when I ran the job second time, it created the file again. I was under impression that turning bookmark would not created another file as it would save the state from previous run.
@@pk-wanderluster Hi, I think the problem is with transformation context. In my example, I didn't use transformation context. As per AWS documentation - "For job bookmarks to work properly, enable the job bookmark parameter and set the transformation_ctx parameter. If you don't pass in the transformation_ctx parameter, then job bookmarks are not enabled for a dynamic frame or a table used in the method.". Please try with transformation context and it should solve the problem. Please let me know,
@@AWSTutorialsOnline perfect! As you said, added the parameter and it worked!
Thanks for the tutorial, it was really helpful.
However, I have a problem. When granting permissions to customer an sales tables in AWS Lake formation, I get this error message: Resource does not exist or requester is not authorized to access requested permissions.
Can someone help me?
It seems you have missed some of the previous steps. Your logged in account should be added as lake formation administrator. You might want to configure Lake Formation step again.
Were you able to solve this issue?
your videos are super useful in building my knowledge on AWS. Thank you sir. could you provide the part 2 of this session?
Both part are together in this video. You can see two labs in the description.
Hi, Can’t we use native Pyspark code,instead of GlueContext. If there is a way ,can you explain that as well?
you can use native pyspark code but for that youd have to use spark dataframe instead of aws glues dynamic frame. dynamic frame js better optimized to work in glue etl as it works performantly with gluecontext methods. there are some downsides too as spark dataframe offers a better variety of methods (glue methods are very limited). so if you can process your source data just by using methods from glue context, use dynamic frame else convert to spark data frame and use all the native pyspark methods.
superb workshop
but whenever I am trying to query this s3 datalake data using athena,i am getting an error called COLUMN_NOT_FOUND: line 1:8: SELECT * not allowed from relation that has no columns. So please help me out with this thanks in advance.
Hi, First time I use root user, when running crawler, tables created but not found anywhere. then I create IAM user and login, retry and I cannot create tables by crawler. can you tell me why?
Hello Ryan. The crawler runs with a role authorization. The role should have permission to S3 bucket where the data is and also have create / alter permission in the database in the data lake. This permission could be one reason. The other reason could be wrong configuration of crawler. Many times people (including me) , people configure crawler to the actual data file. But the crawler should be configured for the folder where the data file is. This is another reason when the crawler will not be able to create table.
Try these options and let me know if it worked for you. Otherwise we will find someway to help you fix it.
@@AWSTutorialsOnline thanks a lot. I finally removed all and recreated, and it works. I will continue following your tutorial
Can you dive deeper into Glue and Pyspark code? Eg do we need Gluecontext even though pyspark scripts do also work. What are the most important functions and actions?... Thanks!
Great suggestion! Yeah will plan for it.
Can you please let me know if this is programmatic error or its related to underlying infra ?
with the error : an error occured while
calling 0137 pywritedynamicframe
exception thrown in await result
what are you trying to do when you get this error.
Adding new columns at redshift and subsequently updating the glue code with the same
@@sakinafakhri1320 can you please share the code snippet?
Can I use this method for upsert purpose?
Good one. how do you set a Spark UI history server for glue job using to docker to view them locally
Sorry for coming back late. Do you want local monitoring of the Glue jobs?
Yes monitoring and see the graphs generated as we see in onprem
Thank you for such a brief information! I am totally new to Amazon glue so just wanted to know, Can we convert csv file to .txt file?
when we talk about conversion; we talk about format conversion like CSV to TSV to JSON to PARQUET. So what would be the format in the .txt file.
can you please tell me how to read the data directly from s3 trough job with spark script
to read data directly from s3 - you can create Glue Job with Python Shell. In the job, you can use Python Boto3 SDK to read data directly from S3 bucket.
Have a look at this lab provided by AWS. It has example of Python Shell script which you are looking for - lakeformation.aworkshop.io/40-beginner/402-etl/4022-python-container.html
Thank you
can you explain how to do the row partition using config file in s3 and through glue job
Hi, this link will help - docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
i have done the workshop of part1 and part 2 . thankyou for helping us lean to building ETL jobs. This dojo is really useful to have hand on experience and basic step for building production ETL. I have a suggestion . After Create Developer Endpoint currently in aws there is no option to create sage maker notebook. could you add a video on how to use zeppelin. else i will go through the AWS Glue Studio video .
It could be because of sagemaker notebook supportability for a particular AWS Region. Which region, you are using?
@@AWSTutorialsOnline Paris eu-west-3
i have a doubt. when we run crawler, the date filed is showing as string. it there a way to show date as date ? do we need to convert 'dd-mm-yyyy' to timestamp
Could you help me in doing the workshop, it is urgency for me
@@divyathanikonda3666 i am also looking my project also aws glue
Why you emphasized on creating the s3 bucket in frankfurt region any not any other ?
any region will work :)
Hi bro, I was working normally till create my notebook using sagemaker but this error that you can see below appear and I cannot create my notebook , what can I do ?
Platform identifier (notebook-al1-v1) is not supported in this region
I would recommend you should contact AWS Support for this
@@AWSTutorialsOnline I'm gonna try to do it in another region first
Can we use another bucket to upload the data after done the glue job
sorry could not get the question? Can you please explain what you want to achieve?
Same Dataset was used to run 3 jobs with different workers using Glue 2.0
G2.X with 10 workers executed in 15 minutes and consumed 4.97 DPU hours
G2.X with 149 workers executed in 15 minutes and consumed 76.52 DPU hours
G2.X with 149 workers executed in 13 minutes and consumed 62.98 DPU hours
If its consuming more resources it should reduce execution time right? May i know what the reason why this is happening
Thanks for sharing. This is interesting. Couple of things to investigate
1) Compare first run vs. second run duration. First run includes start-up time as well and it can get higher for high number of worker nodes.
2) The code is not able to leverage parallelism. We would need to see the code to identify improvements. But he following two links can help -
aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/
aws.amazon.com/blogs/big-data/optimizing-spark-applications-with-workload-partitioning-in-aws-glue/
aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/
Hope it helps,
@@AWSTutorialsOnline start up time was similar for 3 jobs i.e. 7 secs.
And i did not run the jobs parallel it was in sequential order.
I was testing the jobs with different worker and got that result.
Can you help me to understand why that happened?
Can you please share some details of the transformation you are performing? possibly share code. I can then try to investigate.
Could you help me in doing this workshop, it would be helpful for me to crack an interview
Hello. Could you please show the steps of loading the data in to a table which is a part of data catalog. Before loading the target table should be emptied(truncated). Thanks in advance!.
by table, you mean relational database table? or data file in S3?
@@AWSTutorialsOnline It's a table created using a crawler whose source is a S3 file.
@@swethadharmappa Then, you can use Python boto3 library to delete S3 files before data load.
@@AWSTutorialsOnline Thanks a lot!. Tried using boto3 and was able to drop the table from catalog(truncate option was not available).
can you do one combining- aws wrangler, glue, athena & mwaa using boto3
Sure - let me see - how do I fit it all together :)
Hi sir great video, how to READ data from database (PostgreSQL) to glue then to S3
Hi, please check this lab - aws-dojo.com/workshoplists/workshoplist33/
Hope it helps,
Any suggestions on how i can list all the tables under a db in glue datacatalog so i can iterate over each table individually??
You can use Glue API for that. Please check this out - docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-GetTables
Thanks@@AWSTutorialsOnline , any better way to do that from a Glue Job ?
There could be a better way but that depends upon your requirement. Please tell me your requirement - why you need to iterate over the tables in the database?
@@AWSTutorialsOnline my etl job needs to be generic so that if u specify a catalog database name as job arg, it should load all the tables from it and perform needed transformations on them. In short, same py file should be reusable in multiple jobs without any code change
ok. but the transformation per table would be specific right? or would it be standard.
How can we load deep learning trained model .pt format files?
Hi, I have limited knowledge about the .pt files; but I understand .pt files are use to package PyTorch models that can be served inside C++ applications. It seems we need c++ libraries to access .pt file. Since, Glue Job support Pyspark/Python and Scala only, I assume it would not be possible to access .pt files from the job.
@@AWSTutorialsOnline thanks for your response. I have one question like I have a Deep learning model pytorch code base, how can I integrate it with Pyspark then? I actually checked many resources online but can't figure it out, as mostly worked with a csv file, not a model. Any help regarding this, like any videos of deploy Pytorch model into Pyspark or documentation.
@@1UniverseGames Again, not expert in this area but I understand how ML works in AWS. I recommend to deploy your pytorch model in AWS SageMaker as SageMaker Endpoint. This SageMaker endpoint can be called using python in Glue Job.
I am unable to load tables to Lake formation
I followed all the steps I tried for several times.
Could you help me with this?
what part you are not able to do? What errors you are getting?
@@AWSTutorialsOnline created crawler and once it is done , clicked on run crawler, once it is ready then I checked tables in data lake database, tables are not retrieved there
Could i get ur mail id
@@divyathanikonda3666 Did you give crawler role permission on the database for create table?
@@AWSTutorialsOnline yes I gave
Thanks for the very informative tutorial. Do you have any tutorials on how a job can write data to Snowflake?
Hi - sorry I don't have one for Snowflake. but I found one article which might help you - www.snowflake.com/blog/how-to-use-aws-glue-with-snowflake/
@@AWSTutorialsOnline Thanks. Also, are the sample datasets used in the workshop available?
Hi, I used two sample datasets - sales and customer. They available for download in the part-1 for the workshop. I am not sure if that is what you were asking for.
@@AWSTutorialsOnline Yes, I could not find them in the Data section
They are at step 4 when you create s3 bucket and the data