AWS Tutorials - Building Event Based AWS Glue ETL Pipeline

Поделиться
HTML-код
  • Опубликовано: 2 окт 2021
  • Python handler code - github.com/aws-dojo/analytics...
    AWS Glue Pipelines are responsible to ingest data in the data platform or data lake and manage data transformation lifecycle from raw to cleansed to curated state. There are many methods to build such pipelines. In this video, you learn how to build event based ETL pipeline.
  • НаукаНаука

Комментарии • 39

  • @danieldani5495
    @danieldani5495 Год назад

    Excellent video, you saved my time.

  • @user-fn3zs2wq5c
    @user-fn3zs2wq5c 10 месяцев назад +1

    You've explained well about the execution flow, but you've not explained the creation of Glue Database, Catalog tables creation, Dynamo DB table, Lambda function, Event Bridge creation. You've created backend and just explaining the flow again. Please explain the creation part as well.

  • @veerachegu
    @veerachegu 2 года назад +1

    Good to see this demo and Pls do demo for incremental data upload in to S3 bucket

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      Sure. Can you please tell few scenario like data sources and I would try to make one

    • @veerachegu
      @veerachegu 2 года назад

      @@AWSTutorialsOnline can you give your mail id pls

  • @DanielWeikert
    @DanielWeikert 2 года назад +1

    Great work. You got my sub, you deserved it. Highly appreciate your work.
    Could you do a Workshop Excercise for setting up such a pipeline?
    Can you also do a tutorial/workshop in setting up glue job pipelines with cloudformation?
    Thanks and best regards

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      Thanks for the appreciation. What do you mean by workshop? Also I am planning to do a tutorial on using CDK / CloudFormation for setting up such pipleine.

    • @DanielWeikert
      @DanielWeikert 2 года назад

      @@AWSTutorialsOnline Thanks I was referring to the step by step excercises you provide on your homepage

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      @@DanielWeikert ah ok. I will plan about it.

  • @hirendra83
    @hirendra83 2 года назад

    Thanks, very helpful tutorial. Please continue your good work. Sir can you cover how to create monitoring or observability dashboard for such pipeline using cloudwatch logs

  • @aniket9602
    @aniket9602 Год назад

    Can someone pls explain the below code which is written in the lambda script :
    target = ddresp["Items"][0]["target"]["S"]
    targettype = ddresp["Items"][0]["targettype"]["S"]
    What should be the expected output of the above lines !!

  • @coldstone87
    @coldstone87 2 года назад +1

    Hello. Thanks for the tutorial. I have small clarification. So Basically every Glue Job and Glue Crawler by default writes an event to default bus of EventBridge and then based on rule filtering we are invoking the Lambda. Correct? Because I don't see code/any configuration done in job or crawler to publish an event into Eventbridge. Please confirm my understanding.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад +1

      You are right. Most of the AWS Services (including Glue Job and Crawler) automatically publish events to eventbridge default event bus. Then you need to use rules to hook into a particular event and do what you want to do at the raise of this event.

  • @poojakarthik93
    @poojakarthik93 2 года назад +1

    Hello. Thanks a lot for this video. It is really helpful. I have one question here to run your second glue job how we will know that all our files are copied to S3 ?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      good question - you need to watch this video - ruclips.net/video/UBhG_UMuFEo/видео.html

  • @spp3607
    @spp3607 2 года назад

    Thank you for the Tutorials.
    I have a question on deployment, after developing this pipeline(Glue, crawler, Lambda, and event bridge) in the Development environment how to move /deploy all this code in Production

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      You should not create these resource manually. Rather, use CloudFormation or CDK as infrastructure as Code services to script the resource creation and move between environments.

  • @suneelkumar-kn4ds
    @suneelkumar-kn4ds Год назад

    Hi Sir, would you claifyone query. I hae this doubt while you are explaining the data pipeline at 3:20. Why we are using data catalog here

  • @ballusaikumar873
    @ballusaikumar873 Год назад +1

    Thank you for making useful videos on AWS. I learnt a lot of knowledge by watching your videos. I have a use case where I need your inputs. A job that writes multiple parquet files (usually a single dataset splitted to multiple files due to spark partitions) to an S3 bucket. I wanted create an event to eventbridge when all files are written successfully. How do I implement this using S3 and eventbridge. Currently I see multiple events are getting triggered.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Год назад

      I did created a video on that. Please check - hope you find it useful. Link - ruclips.net/video/UBhG_UMuFEo/видео.html

  • @arunt741
    @arunt741 2 года назад +1

    Thank you very much for your excellent work with this channel. If I have multiple Glue Jobs but I want to publish to Event Bridge only for some Glue jobs, How do I handle it in Event Pattern? If I am not wrong, with this Event pattern all the Glue jobs completion will trigger the lambda, correct? Can we use some tokens in event pattern? Eg: Glue job name starts with GJ_% etc.? Thanks in advance.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад +1

      Hi, It seems there is no way to filter on name in event bridge rule. You will have to filter out at the handler level. You can build two step handler like EventBridge to SNS Topic (filter handler) to Lambda (actual handler). At SNS, you can configure subscription filter on messages to stop processing for certain glue jobs.

    • @arunt741
      @arunt741 2 года назад

      @@AWSTutorialsOnline Thank you very much. It is a great suggestion.

  • @canye1662
    @canye1662 2 года назад

    Nice video but will like to know if you have a code that can be embedded in the glue job script to prevent duplicate data if the jobs runs every hour.
    and I know bookmark will help but am looking it u have a code that can be included in the script section.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      do you want to use job bookmark or want to build custom business logic for the incremental data?

  • @veerachegu
    @veerachegu 2 года назад +1

    Can we use S3 instead of using Dynamo DB to Lambda execution data

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад +1

      Technically you can. Use a json document so that it is easy to query.

    • @veerachegu
      @veerachegu 2 года назад

      Pls One more quiery from my end
      In my project We need to pull the files to S3 through API and those files contains 25k per day records and next day updated along existing records so for this scenario lambda will support? I think it can support up to 15min but files will call API in this time so pls guide me in this process which is the best method to store data in to S3 ? Without getting any conflicts. Like sqs or step function is good to go or any other service is best pls suggest me

  • @MrErPratikParab
    @MrErPratikParab Год назад +1

    Airflow how much useful here

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Год назад

      I see Apache Airflow can anther workflow engine which can deliver the same result

  • @DanielWeikert
    @DanielWeikert 2 года назад +1

    When I triggered a glue workflow with lambda to write csv to another folder as parquet I received this error
    Unsupported case of DataType: com.amazonaws.services.glue.schema.types.LongType@538fb895 and DynamicNode: stringnode
    Did not found any help on google. Any ideas?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      Cannot figure out unless I see data and work a little with it.

  • @pvchennareddy
    @pvchennareddy 2 года назад +1

    Can you please share glue jobs code

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      I added job code in the same link

    • @pvchennareddy
      @pvchennareddy 2 года назад

      Thanks

    • @pvchennareddy
      @pvchennareddy 2 года назад +1

      Thanks for the code. I was working on application side and now I need to work on data lake setup which is new to me. As per my understanding industry is moving towards Data Lake House. I am new to this. I want to know difference between data lake and lake house. When should I go for data lake and when should I go for data lake house. Let me know if you have done anything on or drop me a note if you do it in future. Thanks

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      @@pvchennareddy Data lake is more about managing and governing data at single place. Data lake house goes beyond it. The following links would help you understand it -
      aws.amazon.com/blogs/big-data/harness-the-power-of-your-data-with-aws-analytics/
      aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/

  • @abhijeetjain8228
    @abhijeetjain8228 3 месяца назад

    Demo part is not good. things are not properly explained.
    Just reading, not shown up to how to create them up.
    please focus on practical part instead of theory.