AWS Tutorials - Using AWS Glue Workflow
HTML-код
- Опубликовано: 2 дек 2020
- The Workshop URL - aws-dojo.com/workshoplists/wo...
AWS Glue Workflow help create complex ETL activities involving multiple crawlers, jobs, and triggers. Each workflow manages the execution and monitoring of the components it orchestrates. The workflow records execution progress and status of its components, providing an overview of the larger task and the details of each step. The AWS Glue console also provides a visual representation of the workflow as a graph.
In this workshop, you create a workflow to which orchestrates Glue Crawler and Glue Job. Наука
Best and to the point, your channel should be the official AWS learning channel.
Many thanks for the appreciation
I just want to say thank you for all the tutorials you have done
Glad you like them!
Thanks a ton for all your effort in making this video
Thanks for the appreciation
Really like your videos, they are simple which helps us easily understand the concept.
Thanks for appreciation
I really like your video. Thank you so much for a wonderful video.
Glad you liked it
Thank you very much!! amaizing videos
Thanks for the appreciation
Very good explanation in detail...
Thanks
Nice video.. very easy steps thanks!
Glad it helped
You Are Awesome 🙂
You are doing great work. Please keep making videos on glue. Your content is best. Can you make video on reading from rds with secure ssl connection using glue.
sure - I will put it to the backlog.
Thank you so much for such a wonderful tutorial, really appreciate.
Can you please tell us how we can set a global variable in glue job. Thank you
Apologies for the late response due to my summer break.
There is no concept of global variables. But jobs can maintain states between them in the workflow - here is a video about it - ruclips.net/video/G6d6-abiQno/видео.html
Hope it helps,
Hello Sir, Thanks for the wonderful session. I have a quick question: I was able to create 2 different data loads in the same glue job and it was successfully loading 2 targets. But i would like to know how we can configure the target load plan(similar to Informatica ) in a AWS Glue studio job.?.
Glue job supports parameters. You can parameterize target location when running the glue job.
It was a good tutorial but I would recommend a better mic as it is hard to hear you at some times.
Good tutorial, but the audio fades in and out.
AWS Glue has been updated enough to make some of this information irrelevant. I would update with the latest UI and correct the audio issues.
Thank you.
Please How do you make use of the properties, is there another tutorial on that? thanks
yes there is - ruclips.net/video/G6d6-abiQno/видео.html
you better use a headset or earphone while speaking.. otherwise the session is very good.
How we can add DPU'S in Glue job using glue workflow.
Not sure why you want to add DPU to Glue Job from Workflow. When you configure Glue Job, you can configure default DPUs for it.
Hi, Is it possible to move an s3 file(csv) after it has been imported to RDS mysql table by a glue job to an processed S3 folder? Great content as always.
Sure, it is possible. I created a workshop for this scenario which might help you. aws-dojo.com/workshoplists/workshoplist33
Hope it helps,
@@AWSTutorialsOnline Thank you and much appreciated.
22:40 AWS Glue Workflows
Hi sir, My que is, when any push happens in s3 that time my workflow is runs automatically how i can do plz help.
Configure event for S3 bucket. Event will call a Lambda function and the Lambda function will call Glue workflow using SDK like python boto3
Sir, is there any way were we can set a trigger for S3 and Glue Job?
What I mean is , whenever a new file upload in S3 one trigger should get active and it run the Glue job and same thing for Crawler also.
So whenever new file upload in S3 it active trigger for crawler and job. Thank you
You can do it. Configure event for S3 bucket which gets trigger on put and post event. On the raise of the event, call a Lambda function. In the lambda function, use Python Boto3 API to start glue job and crawler.
Thanks for sharing knowledge... I am not sure why we should use workflow instead of stepfunction... we do have better control in stepfunction... can you please advise ?
You raised a very good question. Simple answer is - use Glue Workflow only when you are orchestrating jobs and crawlers only. If you have need to orchestrate other AWS services, StepFunction is more suited. I personally believe - over period of time, StepFunction would become main orchestrator service for Glue as well.
@@AWSTutorialsOnline Thanksl you..
Nice and clear explanation. I have query here, how can we run one after another workflow (not job/crawlers) i.e. one workflow for dim and another for fact. once dimension is loaded it should another workflow for fact.
Nested workflow is not available. The best approach will be - at the end dimension workflow, you run a job (using Python Shell) which simply starts the workflow for fact.
You can also use other mechanism such as orchestration using Lambda based business logic or Step Function but it will be little complicated because between dimension and fact workflow you need to make API call to check successful end of the dimension workflow before you start the fact workflow.
So probably - the first approach I talked about is the best way.
@@AWSTutorialsOnline Thanks for your time. I really appreciate it. you answered my query and i got an idea what to do, let me try create one specific job to call fact workflow at the end of dimension workflow using python scripts.
Hi @@AWSTutorialsOnline, I tried some blogs and google, I don't find code to call AWS workflow using python shell, is that possible to share any our blog and git where I can find some info regarding to execute the workflow using python. Thanks in advance.
@@venkateshanganesan2606 Hi, basically - you need to use boto3 Python SDK in python shell based job. You can google plenty of examples for that. if not let me know. In this job, you use Glue API to start the workflow. API for this method is here - docs.aws.amazon.com/glue/latest/dg/aws-glue-api-workflow.html#aws-glue-api-workflow-StartWorkflowRun
Hope it helps. Otherwise - let me know,
@@AWSTutorialsOnline Thanks a lot, it works as you suggested. I used the below piece of code in end of my dimension job to invoke the fact workflow. I really appreciate that your sharing your knowledge wisely.
import boto3
glueClient = boto3.client(service_name='glue', region_name='eu-west-1',
aws_access_key_id='access_key',
aws_secret_access_key='secret_access_key'
)
response = glueClient.start_workflow_run(Name = 'wfl_load_fact')
Thanks again for sharing your knowledge.
Thank for sharing knowledge but can you create video on read data from s3 and writing to database while we need to handle bad records while reading and only insert good records in rds table and badrecords in s3 location
how you differentiate between good and bad records?
@@AWSTutorialsOnline if record don't not match schema I mean data type is like datatype is int like 1,2,3 are coming but sometimes it comes as four ,five i will share you example link
Basically i m looking for whenever any corrupt record found so I want write in S3 path and normal record I want to write in database ,i don't want my job to stop corrupt record found then it must continue my job running in AWS glue
I need to see some example of corrupt data in order to understand how to check for the same. But once you know whether dataset is corrupt or not; you can use dynamic frame write method to write to S3 bucket or database.
@@bhuneshwarsingh630 I am publishing a video in 1/2 days about doing data quality check. Please have a look. I think it might help you.
This video is really helpful but the audio is not good, please fix the audio if possible
Thanks for the feedback. I have improved audio in the later videos. Need to find time to fix these old ones.
Unfortunately, the audio is not good in this video
Please fix audio
Thanks for the feedback. I did it in the later videos.
GlueArgumentError: the following arguments are required: --WORKFLOW_NAME, --WORKFLOW_RUN_ID, I am getting this error.