AWS Tutorials - Using Concurrent AWS Glue Jobs

Поделиться
HTML-код
  • Опубликовано: 2 июл 2024
  • Script Example - github.com/aws-dojo/analytics...
    Using Concurrent Glue Job Runs to ingest data at scale is a very scalable and maintainable approach. Learn how to configure and run Glue jobs for the concurrent execution.
  • НаукаНаука

Комментарии • 30

  • @trinath89
    @trinath89 Год назад

    Thanks a lot, everything is put in as simple as possible format for us to understand.

  •  2 года назад +1

    Pleased to return again, this time to clarify an additional limitation to be taken into account, it is about the ips available in the vpc, because the glue job occupies ec2 instances and if there are not enough ips the job will crash, so it is important to verify the ips available to paralyze

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      I agree. Glue will occupy IP only if your are working with VPC based resources.

  • @ashishvishwakarma8790
    @ashishvishwakarma8790 Год назад +1

    Excellent explanation. I'm working on a similar use case - however, I need to run the same job multiple time for same table (writing to different partition). The problem I'm facing with that is - the moment 1 of the many parallel job executions finishes, it wipes the temporary directory (created by spark) in the table directory, leading to deletion of temp data of other execution writing to the same table, which results into data loss as the execution of other parallel execution was still in progress, but the 1st job to complete deleted the temp data(created by Spark). Do you have solution to that problem?

  • @rtzkdt
    @rtzkdt 2 месяца назад

    Nice tutorial,Thanks.
    can it run in sequence? i want to run the jobs with different parameter, but i want the second job run after the first one is finished. Like a queue. Or we must set the max concurrent to 1 and handle the retry ourself if max concurrent error occurred?

  • @pulkitdikshit9474
    @pulkitdikshit9474 Год назад

    Hi, I have a lambda function where I pass a list of tables + lambda triggers a glue job.
    Glue job has been configured with workers 2 and max concurrency = 1.
    Later I saw that only one element(one table in the list of tables passed in lambda) gets executed.
    What is the reason for it?
    will it cost higher if I increase concurrency?
    In this case, is it important to keep max concurrency equal to the length of list(number of elements in python array list) ? If not, then what is the best possible approach such that glue job executes all the table elements in the array list passed in Lambda.
    Fyi, storing results in S3 bucket.
    Please do reply.
    Thanks in Advance :)

  • @hsz7338
    @hsz7338 2 года назад +1

    Hello, thank you for the tutorial. It is fantastic as always. On where the actual Concurrent (parallel) job run, are those jobs are run in one serverless Glue compute cluster or multiple serverless Glue compute clusters? If it is the formal, it means it is Concurrent but not pure parallelisation. If it is the latter, then the actual Glue job we are creating acts as a job definition, whereby such job defnition can be deployed across multiple serverless compute in parallel (within the Max Concurrency)?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад +1

      It is like one job definition which can run for more than one instances at the same time - no matter how you start the job.

  • @zubinbal1880
    @zubinbal1880 3 месяца назад

    Hi Sir,
    Is it possible to enable job bookmark for concurrent job run but single script with step function?

  • @MahimDashoraHackR
    @MahimDashoraHackR 10 месяцев назад

    What happens if the python script itself uses multiprocessing for achieving concurrency

  • @victorgueorguiev6500
    @victorgueorguiev6500 2 года назад +2

    It turns out you can't really use Glue Workflows for running them in parallel. When trying to add a job multiple times in different nodes in the workflow, it throws an error that the "action contains duplicate job name", which prevents one from adding the same job more than once in sequence or in parallel. Really silly, since Glue inherently lets you have concurrent runs. Luckily Step Functions works fine, but really disappointing that Glue natively doesn't support this in Workflows. Maybe I'm doing something wrong?

    • @victorgueorguiev6500
      @victorgueorguiev6500 2 года назад

      Thank you for the video by the way! It was really informative

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад +1

      You are right. Unfortunately workflow does not allow running parallel jobs.

  • @gatsbylee2773
    @gatsbylee2773 2 года назад +2

    I got some idea what the max concurrency=4 is for.
    Based on your example, you still need to create multiple AWS Glue Jobs ( more precisely 200 "Runs" in a Job ) since you set "Source Table Name" and "Target Table Name" with the same Glue Job.
    Basically,
    you can group jobs in a job by increasing max concurrency. but you still need to create 200 Runs in a Job.
    And, you can still share a code across 200 Jobs or 200 Runs.
    I really appreciate to your video.
    It helps me get an idea what the parameter is for.
    Thank you.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      yeah. It is one code based and configuration for the job. But you are running multiple instances of it with different parameters.

  • @sheikirfan2652
    @sheikirfan2652 Год назад +1

    Nice tutorial. One question here, how to configure the glue job to run multiple SQL queries in parallel instead of reading from multiple tables

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Год назад

      I think you are looking for this one - ruclips.net/video/QH1Jc9Wrp9Y/видео.html

    • @sheikirfan2652
      @sheikirfan2652 Год назад

      @@AWSTutorialsOnline Thanks brother i will check and let you know

    • @sheikirfan2652
      @sheikirfan2652 Год назад

      ​@@AWSTutorialsOnlineThanks. I looked into it and seems that video explains we can have parallel runs corresponding to one column. But my solution is something like we need to pass SQL query as a job parameter and using that job parameter i should pass more than one SQL query either through just CLI or step function.
      Example my job concurrency is 2
      So the job should run parallel with a queries like "select * from emp inner join students where std_id = 5" and "select * from emp inner join class where class_id = 10" and fetch results in respective locations(S3 locations).

    • @sheikirfan2652
      @sheikirfan2652 Год назад

      Also I have a solution like i can run more than one SQL query in my glue job but that approach will work sequentially not parallely

  •  2 года назад +1

    Nice tuturial just now i make 5 jobs.. But try the 3 aproach. My dubt is what hapend when the size of table is variable... The num of worker can change?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      I don't think you can change job capacity at the time of job run when calling in Glue Workflow or Step Function. However - if you are calling the job using CLI or Code then you do have opportunity change allocated capacity, max capacity and worker type.

  • @anti2117
    @anti2117 Год назад

    Thank you for this video, very insightfull. How is this working with job bookmarks (transformation_ctx)?

  • @veerachegu
    @veerachegu 2 года назад +1

    Can you pls clarify i have a 15 data sets in one of the source how to run concurrent run from raw layer to cleansed layer maybe the script is different based on DQ in this scenario how to run concurrent job ?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      is each job doing the same things between raw to cleansed layer?

    • @veerachegu
      @veerachegu 2 года назад

      @@AWSTutorialsOnline yes

    • @veerachegu
      @veerachegu 2 года назад

      You are implementing through step function can you pls suggest how to do concurrent run on glue work flow

  • @IranianButterfly
    @IranianButterfly 2 года назад

    but there is a drawback here in term of pricing, let's say you have 20 tables and you run with concurrency and let's say each job finish in 1 minute, G1.X would bills for min 10 minutes, it means you will pay 20*10 (min), instead of 20*1 (min).

  • @siddharthsatapathy1366
    @siddharthsatapathy1366 2 года назад

    Hello Sir,
    In case of concurrent runs how are the resources shared in different runs?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      each run is allocated the same capacity as configured in the job,