How to run python scripts for ETL in AWS glue?

Calcey

Просмотров 51 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 5 июл 2024
Presenter - Manuka Prabath (Software Engineer - Calcey Technologies)

Комментарии • 81

@priteshbhanderi4664 4 года назад ⁺³
One of the rare video across the youtube on Glue
@racingnerd1425 4 года назад ⁺²
Thank You!!! I get so importance information from this video.
@CalceyTech 3 года назад
Glad it was helpful!
@gravitycuda 3 года назад ⁺²
Best ever tutorial on AWS-Glue. Thank you very much :)
@CalceyTech 3 года назад
Glad it was helpful!
@Rapidmotivationnow 2 года назад
@@CalceyTech Do you have any courses on Coding or how to write clean code...I love your code
@CalceyTech 2 года назад
@@Rapidmotivationnow
Thanks for the kind words! We are a software development services provider and maintaining high coding standards is obviously key.
Here's good guide with some tips that will help.
This is one of the books that we can recommend - www.amazon.com/dp/0132350882/ref=emc_b_5_t
Also the articles/books and resources mentioned here are highly recommended to get a better understanding of best practices and industry standards related to enterprise software development - martinfowler.com/tags/clean%20code.html
@okelitsenyathi8515 4 года назад ⁺³
Exactly what i was looking for. Thanks
@CalceyTech 3 года назад
Glad I could help
@shovan3112 4 года назад ⁺¹
Thank you very much for the nice explanation, loved it. Would like to request you one more detailed video of Glue ETL for SCD type 1 (merge upsert) type logic on s3 to S3
@hayekianman 4 года назад ⁺¹
checkout the aws-data-wrangler library released by aws github.com/awslabs/aws-data-wrangler many usecases here including upserts
@ranmehra 4 года назад ⁺³
Is that possible to debug the code which you developed in the video in VS Code or pycharm IDE by setting aws creds in your IDE locally instead of running on console?
@shovan3112 4 года назад ⁺³
Thank you. Nice detailed way of step by step. Liked it. Please upload more videos on aws glue & pyspark coding in it for various transformations...
@CalceyTech 3 года назад
Will upload soon
@adesojialu6208 Год назад
yes, i support
@philsongtg 3 года назад ⁺⁴
Hi Manuka - This is a good video because normally no one touches the scripting part of it. Even the AWS documentation is missing that. So kudos to you for presenting the grey region.
Also, I have been trying to emulate certain ETL use cases from Informatica to Glue and I am thinking if it makes sense to first create a dynamic dataframe from the glue catalog --> convert to dataframce --> do all the transformations like Trimming, date format conversion, creating new columns(with case statement logics) etc etc and then finally convert it back to dynamic frame and then write it back to the catalog table. does that sound reasonable?
@Warrior_praful 3 года назад ⁺²
Awesome man, keep sharing keep learning
@CalceyTech 3 года назад
Thanks a ton
@SoumilShah 3 года назад
Hello i have a Question
how can you run glue job locally to make sure it works
aka i saw some article running glue jobs on docker ?
@wcmad7250 4 года назад ⁺²
Great video
@eric321ification 2 года назад ⁺¹
This was very helpful
@jhontorres9519 4 года назад ⁺²
Thank You!!!
@VenkatSamaOfficial 4 года назад
Hi, how to create Informatica kind of workflows on Glue?
@mayurthakkar3066 3 года назад ⁺¹
Excellent excellent video. There are not many good videos that explains AWS Glue basic programs.
I am pretty new to AWS Glue and trying to create an upsert job that will insert else update my data present in a Redshift table and source is a csv file on S3. Can you please post a video that will explain how upsert works in Glue? Thanks in advance!
@CalceyTech 3 года назад
Hi Mayur,
Glad you enjoyed the video. We will definitely take into consideration your request.
@VamsiDhar-jt3lp 10 месяцев назад ⁺¹
very helpful
@sushantdewulker 4 года назад ⁺⁶
Hi thanks for the great video. I have learnt that recently AWS has given full setup to execute glue job locally. It would be of great help if you can make the same setup on visual studio code and create a video of the same. There no documentation or setup steps any where.
@CalceyTech 4 года назад ⁺²
@sushant thanks for the information.We could definetly give a try. keep updated with us...
@pk-wanderluster 3 года назад
I’m also looking for guidance on how to configure open source jars with glue job. Could you please make a video on it?
@udaynayak4788 Год назад
thank you for covering detailed info, can you please create one UDF in aws glue, like how to create then register and call to execute
@CalceyTech Год назад
Thanks for the suggestion!
@vivekprasad6342 3 года назад
Good work
@CalceyTech 3 года назад
Thank you! Cheers!
@nandkishoringavale2845 3 года назад
Very helpful.. just one thing i have four sql tables and want create four parquet files with respective that tables. So for that can i create four python script job or handle in one script file using looping.? please advise me.
@CalceyTech 3 года назад ⁺¹
Simply can run the job on a single script instead of running multiple scripts. Complete read source table and write of parquet file one after another to avoid loss all read data in the case of job failure.
@nandkishoringavale2845 3 года назад
@@CalceyTech Thanks for the reply.
@vivekdbit 4 года назад
One of my column in MySql is of JSON datatype. How do I transform it into flatten along with my other columns data??
@CalceyTech 3 года назад
First, convert the dynamic frame to Spark Dataframe
---------------------------------
datasource0 = datasource0.toDF()
Then add a new column to spark data frame. Use spark user defined function to extract value from json object.
---------------------------------
from pyspark.sql.functions import udf
getNewValues = udf(lambda val: val+1) # you can extract value from json here instead of val+1
datasource0 = datasource0.withColumn('New_Col_Name', getNewValues(col('existing_col'))
@redolfmahlaule9893 3 года назад ⁺¹
Lets say my data its residing in PostgreSQL ... How can u connect it to glue then to S3
@luliu5094 3 года назад
Really useful tutorial! Thank you! Just wondering what I should do if I need to import another python script such as `import script2`, how should I setup in the job config? I've tried to store this script in s2 bucket and add the location in 'Security configuration, script libraries, and job parameters (optional)'->'Python library path', but it gave me error `ModuleNotFoundError: No module named 'script2`, does anyone know how to fix this? Thanks.
@CalceyTech 3 года назад ⁺¹
If your python script is just a single python file (ibb.co/HB6grhL), Upload it to the S3 bucket and add the S3 path as a glue job's python library path. (ibb.co/PgPrLCx)
First, make sure your Glue job's IAM role has access to S3. Then you need to add import statements on top of the script file to use definitions on the external script file. (ibb.co/23Kx0vp)
@luliu5094 3 года назад
@@CalceyTech Thank you very much
@upzk7752 4 года назад
How this will be different if I create a job using the type as python shell? Can you demonstrate that as well?
@CalceyTech 4 года назад
You can use the python shell for general purpose python script which You can use these jobs to schedule and run tasks that don't require an Apache Spark environment.
@marlonholland955 4 года назад
I've been writing my etl scripts in the amazon web browser. How can I do this in the Vs Code IDE like you? I'm pretty new to programming.
@CalceyTech 4 года назад
Hey Marlon, there is no direct way to deploy your python script from ide on the local machine. What I have done is to create a work space on vs code with AWS glue python library file which helps us to get the advantage of IntelliSense. Then after that implementation is done, I just copy and paste the script to AWS glue console
@marlonholland955 4 года назад
@@CalceyTech Hi, thanks for the reply! That's what I meant to ask, sorry. My actual question now is how can I get the awsglue library in VS code in order to get the advantage of IntelliSense? I tried pip install awsglue..
@CalceyTech 4 года назад ⁺²
@@marlonholland955 First clone AWS Glue python library repository - github.com/awslabs/aws-glue-libs. Then copy the awsglue folder from cloned content into your working workspace. Finally, create a new python file(custom-script.py) within the workspace. Hereafter, you'll able to use python imports from AWS glue within your custom script files in your workspace
@marcin2x4 3 года назад ⁺¹
Do you have a tutorial on how to set up local environment? Getting awsglue package etc. Thanks!
@95SUJITH 3 года назад ⁺¹
do you have an answer to this ?
@marcin2x4 3 года назад
@@95SUJITH I'm trying to install environment on Win10 but no luck... (github.com/awslabs/aws-glue-libs/issues/82)
@95SUJITH 3 года назад ⁺¹
@@marcin2x4 I think you can only run it in Linux
@CalceyTech 3 года назад ⁺¹
Not yet! Coming soon!
@marcin2x4 2 года назад
It seems that once aws-glue-libs is installed, glue scripts are to be placed there as well. 4 me this fixed the moduleNotFound error even though everything is installed.
@snehgoyal5617 2 года назад
Hi Manuka
I need to copy json files data from AWS elastic search to S3 bucket using glue can u plz help me in that
@CalceyTech Год назад
You should be able achieve it using the same flow in this tutorial with some changes to the data extraction step. Use a JDBC driver for ElasticSearch instead.
@Videos-rj1ek Год назад
HI Manuka, can you make a video of moving glue code from dev to qa and prod..
@CalceyTech Год назад
Hi there,
We will tackle this in one of our incoming videos, so please make sure to follow us on RUclips.
@vivekdbit 4 года назад
Hi Friend, Great Work. I have 1 question for you.
I have data to get it from 2 tables with join. E.g. users_table and users_add_table (both having one is to one mapping) join on user_id
Which is the best way of the following??
1. Get users_table_df and users_add_table_df, then Join.apply on user_id to get the final dataFrame
2. glueContext.read.format("jdbc")
.option("driver", jdbc_driver_name)..........
.option("dbtable", YOUR_QUERY)
In the 2nd approach i have written SQL joins in YOUR_QUERY
@CalceyTech 3 года назад
I think performance-wise 2nd option is better. In the data frame approach, there are a lot of python objects involved to compute the result and execute 2 DB queries to bring the final result.
In the SQL approach, everything is done in-memory.
@vandenjain2399 3 года назад
Hi, I am trying to perform ETL in Glue and I am using the snowflake-connector-python module. it shows module error as it cannot import the module, can you please tell me how can I use the custom libs of python in glue?
Thanks
@CalceyTech 3 года назад
Hi Vanden, Snowflake community blog provides several examples of how to use their python connector and JDBC connector on AWS Glue Job. You'll find proper ways to do and discussions on issues with their modules.
community.snowflake.com/s/article/AWS-Glue-Job-in-Python-Shell-using-Wheel-and-Egg-files
community.snowflake.com/s/article/How-To-Use-AWS-Glue-With-Snowflake
@mithaleemohapatra5153 4 года назад
I am new to AWS glue.Can you please create a video on how to get the AWS glue lib to the local VS code IDE.
@CalceyTech 4 года назад
First, take a clone from github.com/awslabs/aws-glue-libs. Open an empty workspace from VSCode. Then copy "awsglue" folder from the cloned repository to the VSCode workspace as I have done in the video.
@mithaleemohapatra5153 4 года назад
@@CalceyTech Thank you for your reply.Yes i have done that and getting an error: "ModuleNotFoundError: No module named 'dynamicframe'". Do I need to install the spark distribution in my local? I have already installed the pyspark client.
@kingstonxavier 4 года назад
Thanks dude, for the video. Where can I download the jar file? Could you please comment the link?
@CalceyTech 4 года назад
You can download it from here: search.maven.org/artifact/mysql/mysql-connector-java/8.0.15/jar
@kingstonxavier 4 года назад
@@CalceyTech Thank you. I am still getting an error "An error occured while calling o70.load. Communication link failure" any suggestions?
@prabhakarachyuta6397 4 года назад ⁺¹
HELLO BRO, I have created job and updated script As per your tutorial but I am getting an error saying that "connection Timedout". Please see full error message "com.amazon.support.exceptions.GeneralException: [Amazon](500150) Error setting/closing connection: Connection timed out.' ..... Please advise what else I missed. thanks
@debanjanbose8205 4 года назад
I am also getting same error, Can any one tell me what to do.
@CalceyTech 4 года назад
Actually this issue is not related to your GLUE script. It seems like the AWS environment that is running your Glue script does not have permission to access your external or internal database using DB information you have provided into the script. Because of that AWS automatically cut off the connection with time out error.
@VenkatSamaOfficial 4 года назад
I want the data to be picked from my on-premise DB and than put to On-premise DB
@CalceyTech 3 года назад
AWS Glue ETL jobs can interact with a variety of data sources inside and outside of the AWS environment. For optimal operation in a hybrid environment, AWS Glue might require an additional network, firewall, or DNS configuration. Have a look: aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/
@LifeofRam 4 года назад ⁺¹
Can you give a training
@CalceyTech 4 года назад
Hi @reddy since we are a customer-centric software development company we can't actually focus on it, but don't hesitate to ask anything if you need to clarify
@bhavanivani448 2 года назад
Hi,can please share this site URL for get phython script
@CalceyTech Год назад
I've used AWS Glue for the demo and the code was written on AWS glue's script editor. Here are the references to follow
AWS Glue - aws.amazon.com/glue/
AWS Glue Labs Git - github.com/awslabs/aws-glue-libs
AWS Glue PySpark Extension - docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-extensions.html
Apache Spark - JDBC Data Sources - spark.apache.org/docs/latest/sql-data-sources-jdbc.html
@surendraagraharapu7061 Год назад
this code is not visible ,pls share te code
@CalceyTech Год назад
Hi Surendra, you can find the code in the attached link:
gist.github.com/manukaprabath95/72816c32b3f0fcadc5260180f39889d0

Следующие

Автовоспроизведение

AWS Glue Tutorial for Beginners| Learn everything about Glue in 30 mins| Glue Data Catalog| Glue ETL