AWS Tutorials - Using Glue Job ETL from REST API Source to Amazon S3 Bucket Destination

Поделиться
HTML-код
  • Опубликовано: 3 янв 2025

Комментарии •

  • @ganeshmaximus9604
    @ganeshmaximus9604 2 года назад +2

    NIce Work !! Your Channel is unique. You deserve have more Suscribers.
    I have tested this with Lambda and it works Great.

  • @Ankit-df8rw
    @Ankit-df8rw 3 месяца назад

    Amazing tutorial sir

  • @muqeempashamohammed3394
    @muqeempashamohammed3394 4 года назад +3

    Thank you and very nice tutorial to understand

  • @Ankit-df8rw
    @Ankit-df8rw 3 месяца назад

    Can you provide more tutorial for like scd type 2 using RDS , dabezium and glue

  • @veerachegu
    @veerachegu 3 года назад +1

    Tq nice explaination

  • @josemanuelgutierrez4095
    @josemanuelgutierrez4095 Год назад +1

    do you have any videos about APIGEE integrating it with aws ??

  • @savirawat6671
    @savirawat6671 2 года назад +1

    Is it possible to pull data from S3 json body ..use that json body to trigger (post json data into the external Api)
    and load data into my Api via AWS glue?
    Could you share some sample /reference?

  • @veerachegu
    @veerachegu 3 года назад +1

    Same process is applicable while I am pulling the on prem air table to Redshift via API

  • @devopsdigital2834
    @devopsdigital2834 Год назад

    How you have used the Public subnet in this demo ?

  • @drew4849
    @drew4849 4 года назад +2

    Hello there. I'm new to aws. We have to create something like this at my new job. We need an etl to extract data from an API on the internet and save the data to S3. My question is about the pricing. Since we will only use the ETL once a month, what do you suggest to do about the vpc, allocated ip, nat that are paid per hour. Sorry if I'm misunderstanding some things. These are all pretty new to me. Your help will be greatly appreciated. Thanks.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад +4

      Hello Drew. Apologies for late reply. There are two approach you can take -
      1) If you already have a VPC which is used for other workloads, you can leverage that for this external API call.
      2) Otherwise - create a CloudFormation template which sets up everything - VPC, Subnets, NAT, Glue etc. You schedule a Lambda function which run every month. It first creates the whole infrastructure using CloudFormation stack, then run Glue Job and finally delete the CloudFormation Stack.
      Hope it helps,

    • @drew4849
      @drew4849 4 года назад

      @@AWSTutorialsOnline hey, thanks for the reply!

    • @alwayssporty8102
      @alwayssporty8102 Год назад

      Hey maybe I’m couple of yrs late but we have the exact same requirement and I’m quite new to some AWS services, did you implemented that?

  • @gowthamavinash9
    @gowthamavinash9 3 года назад +2

    Thank you for the tutorial. I am new to Glue and have a question. Is it possible to insert the data directly into the database instead of storing in s3 in the same script?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +2

      Yes. It is possible. You can use AWS Data Wrangler to populate dataframe with API output and then use AWS Data Wrangler RDS methods (to_sql) to write to RDS database. Have a look at this link - github.com/awslabs/aws-data-wrangler

    • @gowthamavinash9
      @gowthamavinash9 3 года назад

      @@AWSTutorialsOnline Thank you for the response. I am trying to connect to sql server database but receiving pyodbc module not found error even after I pointed the .whl files for aws wrangler and pyodbc. Is it possible to connect to sql server database with python core shell in Glue without using pip install pyodbc?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      @@gowthamavinash9 Where is your SQL Server located?

    • @gowthamavinash9
      @gowthamavinash9 3 года назад

      @@AWSTutorialsOnline it is a RDS database. I am able to connect to the db and import the tables using a crawler.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      @@gowthamavinash9 ah ok. Then your job is very simple. Using PySpark itself, you can read data from SQL and write back to it. You don't need any other python module. Please have a look at these two labs related to Redshift and RDS, they will give you idea about what to do. aws-dojo.com/workshoplists/workshoplist33/
      aws-dojo.com/workshoplists/workshoplist30/
      Hope it helps. If you have more questions, please feel free to reach out to me.

  • @sodiqafolayan4921
    @sodiqafolayan4921 4 года назад +1

    Hello, after completing the workshop, it tried to build it with cloudformation. I however keep getting Validation for connection properties failed (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException. See below my glue connection resource code Can you advise on how to correct this?
    MyGlueConnection:
    Type: AWS::Glue::Connection
    Properties:
    CatalogId: !Ref AWS::AccountId
    ConnectionInput:
    ConnectionProperties:
    JDBC_CONNECTION_URL: jdbc:dummy:80/dev
    ConnectionType: JDBC
    PhysicalConnectionRequirements:
    AvailabilityZone: us-east-1b
    SubnetId: !Ref GlueJobPrivateSubnet

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад +1

      Hi, I think I replied to you few days back but I am not sure what happened to that comment. I will recommend not to use this fake JDBC method. AWS has now come up with a connection type of "Network". It enables connection to the internet as long as the subnet has internet access. Hope it helps.

  • @hsz7338
    @hsz7338 3 года назад +1

    Thank you for the tutorial. I have a question on What connection type we should use if we want to connect external Kafka such as Confluent Cloud Kafka?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Never worked with Kafka but I think for external Kafka, you can use Network type connection because all you need is outbound network connection to the external system. Please let me know if it helped.

    • @hsz7338
      @hsz7338 3 года назад

      @@AWSTutorialsOnline Thank you. That is what I am thinking of testing out. Because the Kafka Connection Type works for AWS MSK, not sure it works for external Kafka. I Will let you know the outcome either way.

    • @hsz7338
      @hsz7338 3 года назад

      @@AWSTutorialsOnline I couldn't make it work, it complains SSL hand_shake.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      @@hsz7338 Hmm. For Kafka type connection, it does ask for certificate configuration. Could be external kafka is also demanding the same. Is there no way, you can provide specific certificate in the code. I am thinking to use network connection just to setup network access and then kafka-python for the python based access. like here - kafka-python.readthedocs.io/en/master/apidoc/BrokerConnection.html

    • @hsz7338
      @hsz7338 3 года назад

      @@AWSTutorialsOnline I have tried to use the Kafka connection previously. Confluent Cloud Kafka doesn't provide private CA (Confluent Platform does provide). The authentication protocol being SASL_SSL, which is the reason I had to use the Network Connection type in Glue ETL. Kafka-python package works fine to a local apache Kafka. But Confluent suggest using confluent-kafka-python in the Python Client application, for which I can't seem to make it work in a Glue ETL.

  • @pragtyagi6262
    @pragtyagi6262 4 года назад +1

    I am getting error at the last step when running the Glue job although all steps i have performed in US-WEST 2 region . Even logs i am not able to see as it is showing log group is not available in the mentioned region. So i am not able to see the error logs also .

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад

      The workshop uses eu-west-3 region. Have you made sure that all the resources you created are in us-west-2 region? Also in the job code, eu-west-3 region is hard-coded as the parameter. Have you updated that as well? Please let me know,

    • @pragtyagi6262
      @pragtyagi6262 4 года назад

      @@AWSTutorialsOnline YEs all my resources are in US-West-2 Oregon region. Also in my code i have replaced the S3 bucket name with my S3 bucket and region with us-west-2. Still it is failing . I checked and found that AWS GLUE is supported in us-west-2 . So not sure what is the issue . Any suggestions how to enable cloudwatch logs in this POC .

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад

      CloudWatch log is by default enabled. Do you have permission to see the log? What permission is there for your logged in account?

    • @pragtyagi6262
      @pragtyagi6262 4 года назад

      @@AWSTutorialsOnline I assigned AWS administrator policy to my role which I created for this task along with S3 full access policy

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад

      Thanks for the details. That role is for Glue Job. That means - Glue Job has administrative permission. What permission your AWS logged in account has? Can it see the logs in general?

  • @luisg6965
    @luisg6965 3 года назад +1

    Hello, I have an error "Error 110 time out ". Can you help me ?
    Thanks for the video

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      It seems you have not configured glue connection properly. This blog is little old where Glue Network Connection feature was not available. Try this new workshop which uses network connection. Hope it also solves your problem.
      aws-dojo.com/workshoplists/workshoplist26/

  • @toshitmavle1707
    @toshitmavle1707 3 года назад +1

    Hi
    Can you give some code example to call POST Api instead of GET
    We have requirement to call couple of rest post call (external) --
    1. OAuth2 API
    2. Service API

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Does the below code helps?
      import json
      from botocore.vendored import requests
      def lambda_handler(event, context):
      # TODO Change URL to the private API URL
      url = 'rest api url'
      r = requests.post(url)
      return {
      'statusCode': 200,
      'body': json.dumps(r.text)
      }

  • @sodiqafolayan4921
    @sodiqafolayan4921 4 года назад

    Hi, Thank you for this cool workshop. However, i recreated this on my own using a custom VPC but the job never ran successfully. Below are the steps i followed and i will be glad if you can point out what else i need to do
    1. I created EIP (to be used by NAT)
    2. I created a VPC
    3. I created Public and Private subnet
    4. I created IGW and attach it to the VPC i created in step 2
    5. I created a route table, add IGW as route and associate it with Public Subnet
    6. I created NAT Gateway in the Public Subnet and associated the EIP created in step to it
    7. I created another route table, open 0.0.0.0/0 route and made NAT the target, then i associate the route table with Private Subnet
    8. I created IAM Role for Glue to access S3
    9. I created s3 bucket
    10. I created a dummy jdbc connection and put it inside the vpc and private subnet that i created above.
    11. I created glue job accordingly but after running the job, it failed. Unfortunately, it did not give me any log and i can't understand the reason it failed.
    Note that i edited the python script, used my created s3 bucket and changed the Region to the region i was working in but yet it did not work.
    Obviously there is something i am not doing right but i couldn't figure it out.
    I will appreciate your feedback

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад

      Hi - you configuration seems right. It seems your code is not able to call s3 bucket. Can you please check two things -
      1) is your job able to make outbound call to internet? Try API provided by me and see it works. This will ensure that your job and VPC configuration is working.
      2) if 1 works, then put focus on code to access S3 bucket. It might have some error. Please send me the code - I can check as well.
      Hope it helps move forward.

    • @sodiqafolayan4921
      @sodiqafolayan4921 4 года назад

      @@AWSTutorialsOnline I used exactly the code you provided using the API you provided. i only changed the bucket name region as expected. I also noticed that anytime i run the job, the code get stored in the bucket but i do not get the return text as expected.
      My question: How do i specifically check if my job is able to make out bound call? I think what i have done in this regard is to create a route in the private subnet (0.0.0.0/0) and make NAT the target, and also i gave NAT access to the internet. So i think this should make it access the internet. I will be glad if there is any other way i can check if it can make internet call from the private subnet.
      Once again, thank you so much for the awesome job. I already bookmarked about 5 of your workshops and turning them to mini projects. You have really helped me to learn
      NB: The code i used
      import requests
      import boto3
      URL = "jsonplaceholder.typicode.com/todos/1"
      r = requests.get(url = URL)
      s3_client = boto3.client('s3',region_name='i-inserted-my-region-here')
      s3_client.put_object(Body=r.text, Bucket='i-have-my-bucket-name-here', Key='mydata.txt')

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад

      @@sodiqafolayan4921 Your n/w configuration looks good. One way to test is to launch an EC2 instance in the same private subnet and access the URL. Once URL access is confirmed. Few more things to check - 1) Glue Job and S3 bucket are in the same region. 2) Glue Job Role has permission to write to S3 bucket.
      Please let me know if it helps.

    • @sodiqafolayan4921
      @sodiqafolayan4921 4 года назад

      @@AWSTutorialsOnline Thank you so much. I was able to complete the workshop successfully. I am now replicating it with CFN

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  4 года назад

      @@sodiqafolayan4921 Great. What fixed the problem?