AWS Tutorials - Using Concurrent AWS Glue Jobs
HTML-код
- Опубликовано: 2 июл 2024
- Script Example - github.com/aws-dojo/analytics...
Using Concurrent Glue Job Runs to ingest data at scale is a very scalable and maintainable approach. Learn how to configure and run Glue jobs for the concurrent execution. Наука
Thanks a lot, everything is put in as simple as possible format for us to understand.
Pleased to return again, this time to clarify an additional limitation to be taken into account, it is about the ips available in the vpc, because the glue job occupies ec2 instances and if there are not enough ips the job will crash, so it is important to verify the ips available to paralyze
I agree. Glue will occupy IP only if your are working with VPC based resources.
Excellent explanation. I'm working on a similar use case - however, I need to run the same job multiple time for same table (writing to different partition). The problem I'm facing with that is - the moment 1 of the many parallel job executions finishes, it wipes the temporary directory (created by spark) in the table directory, leading to deletion of temp data of other execution writing to the same table, which results into data loss as the execution of other parallel execution was still in progress, but the 1st job to complete deleted the temp data(created by Spark). Do you have solution to that problem?
Nice tutorial,Thanks.
can it run in sequence? i want to run the jobs with different parameter, but i want the second job run after the first one is finished. Like a queue. Or we must set the max concurrent to 1 and handle the retry ourself if max concurrent error occurred?
Hi, I have a lambda function where I pass a list of tables + lambda triggers a glue job.
Glue job has been configured with workers 2 and max concurrency = 1.
Later I saw that only one element(one table in the list of tables passed in lambda) gets executed.
What is the reason for it?
will it cost higher if I increase concurrency?
In this case, is it important to keep max concurrency equal to the length of list(number of elements in python array list) ? If not, then what is the best possible approach such that glue job executes all the table elements in the array list passed in Lambda.
Fyi, storing results in S3 bucket.
Please do reply.
Thanks in Advance :)
Hello, thank you for the tutorial. It is fantastic as always. On where the actual Concurrent (parallel) job run, are those jobs are run in one serverless Glue compute cluster or multiple serverless Glue compute clusters? If it is the formal, it means it is Concurrent but not pure parallelisation. If it is the latter, then the actual Glue job we are creating acts as a job definition, whereby such job defnition can be deployed across multiple serverless compute in parallel (within the Max Concurrency)?
It is like one job definition which can run for more than one instances at the same time - no matter how you start the job.
Hi Sir,
Is it possible to enable job bookmark for concurrent job run but single script with step function?
What happens if the python script itself uses multiprocessing for achieving concurrency
It turns out you can't really use Glue Workflows for running them in parallel. When trying to add a job multiple times in different nodes in the workflow, it throws an error that the "action contains duplicate job name", which prevents one from adding the same job more than once in sequence or in parallel. Really silly, since Glue inherently lets you have concurrent runs. Luckily Step Functions works fine, but really disappointing that Glue natively doesn't support this in Workflows. Maybe I'm doing something wrong?
Thank you for the video by the way! It was really informative
You are right. Unfortunately workflow does not allow running parallel jobs.
I got some idea what the max concurrency=4 is for.
Based on your example, you still need to create multiple AWS Glue Jobs ( more precisely 200 "Runs" in a Job ) since you set "Source Table Name" and "Target Table Name" with the same Glue Job.
Basically,
you can group jobs in a job by increasing max concurrency. but you still need to create 200 Runs in a Job.
And, you can still share a code across 200 Jobs or 200 Runs.
I really appreciate to your video.
It helps me get an idea what the parameter is for.
Thank you.
yeah. It is one code based and configuration for the job. But you are running multiple instances of it with different parameters.
Nice tutorial. One question here, how to configure the glue job to run multiple SQL queries in parallel instead of reading from multiple tables
I think you are looking for this one - ruclips.net/video/QH1Jc9Wrp9Y/видео.html
@@AWSTutorialsOnline Thanks brother i will check and let you know
@@AWSTutorialsOnlineThanks. I looked into it and seems that video explains we can have parallel runs corresponding to one column. But my solution is something like we need to pass SQL query as a job parameter and using that job parameter i should pass more than one SQL query either through just CLI or step function.
Example my job concurrency is 2
So the job should run parallel with a queries like "select * from emp inner join students where std_id = 5" and "select * from emp inner join class where class_id = 10" and fetch results in respective locations(S3 locations).
Also I have a solution like i can run more than one SQL query in my glue job but that approach will work sequentially not parallely
Nice tuturial just now i make 5 jobs.. But try the 3 aproach. My dubt is what hapend when the size of table is variable... The num of worker can change?
I don't think you can change job capacity at the time of job run when calling in Glue Workflow or Step Function. However - if you are calling the job using CLI or Code then you do have opportunity change allocated capacity, max capacity and worker type.
Thank you for this video, very insightfull. How is this working with job bookmarks (transformation_ctx)?
Can you pls clarify i have a 15 data sets in one of the source how to run concurrent run from raw layer to cleansed layer maybe the script is different based on DQ in this scenario how to run concurrent job ?
is each job doing the same things between raw to cleansed layer?
@@AWSTutorialsOnline yes
You are implementing through step function can you pls suggest how to do concurrent run on glue work flow
but there is a drawback here in term of pricing, let's say you have 20 tables and you run with concurrency and let's say each job finish in 1 minute, G1.X would bills for min 10 minutes, it means you will pay 20*10 (min), instead of 20*1 (min).
Hello Sir,
In case of concurrent runs how are the resources shared in different runs?
each run is allocated the same capacity as configured in the job,