AWS Tutorials - Single AWS Glue Job & Multiple Transformations

Поделиться
HTML-код
  • Опубликовано: 16 май 2022
  • AWS Tutorials - AWS Glue Pipeline to Ingest Multiple SQL Tables - • AWS Tutorials - AWS Gl...
    Code Location - github.com/aws-dojo/analytics...
    There are scenarios where one has to ingest data from multiple SQL tables to the data lake. When ingesting data from different table, one also need to perform different transformation per table. Learn how you can create a single pipeline using single Glue Job to perform multiple individual table level transformation at the time of ingestion.
  • НаукаНаука

Комментарии • 16

  • @tcsanimesh
    @tcsanimesh Год назад

    Awesome!!Best in the entire you tube inventory. Please don't stop making these type of videos.

  • @simij851
    @simij851 2 года назад +2

    Thank you, awesome video. Without using the step functions, and the same concept, will I be able to read them sequentially. I have 150 tables to read, creating parallel tasks in step function might be tedious, so was wondering if we can have it read ( by using loop) ?

  • @tamaralazarevic2889
    @tamaralazarevic2889 10 месяцев назад

    How would you manage version control for the transformation code stored in S3?

  • @afjalahamad2465
    @afjalahamad2465 Год назад +1

    please make videos on AWS Glue Schema Registries.

  • @sriadityab4794
    @sriadityab4794 2 года назад +1

    Can we assign Spark properties like driver and executor memory for glue job?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад +1

      You cannot for both as Glue is AWS managed service. However, you can select WorkerType and NumberofWorkers as parameters which decides overall vCPU, Memory and Disk Space allocated to the job.

  • @gunjanagrawal7014
    @gunjanagrawal7014 Год назад +1

    Hi, it was really nice explanation.
    Question: We have multiple inhouse json source data files which comes with header and footer as well as on different timing and different sources.
    What do be need? : We want to source these files in S3 and then want to run glue job to write this data to different aurora postgres SQL.. we have 20 sources, so looking some parameterizec solution .
    Please guide or share if you have any code snippet.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Год назад

      Unless there is a common pattern across these files which can be parameterized, I would recommend you create separate jobs for each files.

  • @arvindsinha1566
    @arvindsinha1566 8 месяцев назад

    i have chart CSV files having.1 minute duration OHLC (open, high, low, closed). data. I want to generate 5 minutes, 30 minutes, 1 hour duration OHLC data . How to achieve using glue? I can have multiple CSV files.

  • @faingtoku
    @faingtoku Год назад +1

    Is it posible to do something similar while streaming different jsons with kinesis and storing to db?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Год назад

      It might not be possible to do it with streaming data because it works with fixed schema for the data coming in Kinesis.

    • @faingtoku
      @faingtoku Год назад

      @@AWSTutorialsOnline thank you for your response ! Then how could I stream different json from multiple sources to kinesis and dump it to a db different tables with pyspark/glue? Should I add a special key to each json so I can detect which transformation I should use ?

  • @peterpan9361
    @peterpan9361 2 года назад

    can you make a video how to move sharepoint data to AWS s3 ?
    This is a common requirement for many big companies, but no automated solution I could fine.
    I believe we can do using AWS lambda, doing api call to sharepoint, but not sure how to do.
    Can you please assist :)

  • @Bee-ib1pb
    @Bee-ib1pb Год назад

    j

  • @simij851
    @simij851 2 года назад

    Thank you for doing this, I tried this, and it was super helpful. But randomly, I would get this error ..
    An error occurred while calling z:com.amazonaws.services.glue.util.Job.commit. Continuation update failed due to version mismatch. Expected version 103 but found version 105
    reason being with concurrency and bookmark being enabled, while parallel jobs complete and do a job commit(), glue gets confused. If you know how you've handled this situation that would be awesome

    • @simij851
      @simij851 2 года назад

      Removing book marks, helps with resolving the error, but I need the book marks enabled for all the tables that I'm running concurrently. wondering if i I try changing in the glue job script to job.init(args["JOB_NAME"] + args["ctbl"],args), and within step function while I specify the job name to give "JobName": "JOBNAME+ctbl"