Orchestrate Glue Jobs With Step Functions
HTML-код
- Опубликовано: 4 июл 2022
- This is a step-by-step tutorial on how to create a step function to orchestrate a single or multiple glue jobs and configure the I am role.
#aws #awsglue #stepfunctions
IAM Permission Link: docs.aws.amazon.com/step-func...
Excellent video. To the point, called out common failure points. Well done all around.
Thanks for the comment Patrick, much appreciated!
I am really interested in Step Functions as well. Thanks for this, hope you do more!
Thanks! Absolutely! More videos to come!
Thank you very much, very well explained very precise. greetings from Chile
Thank you my Chilean friend!
The additional policy adds that you mentioned helped a lot. My machine was hanging.
Your welcome, glad you got it working
could you please upload the complete AWS data engineering playlist?
It will be helpful for us.
your tutorials are easy to watch and grab things faster.
Thank you.
Hey, that's a good idea, I can put them all into 1 playlist. It will be a lot of videos though, I kind of broke them down into different aws services
Thanks Brother! You Great!
Thanks Nehal!
Well explained... Thanks 👍🏻
Glad it was helpful!
Really great video
Thank you for posting
Hope you don't get demotivated by view count 😭
Your videos are really good.
Thanks! Much appreciated!
As of today, there are about 6k views! That's a lot more people than you could reach through normal means. I think they're doing a great job!
Hi great tutorial as usual but I am struggling with get a choice working I am not sure how to get the result input path from the Glue job and then pass it onto the choice state please if you know how do this I would really appreciate it
Thank you so much for this video. It was a huge help to show the IAM permissions for the Glue job. Is there anything about the "permission_to_glue_topic" permission that we should know?
Also, In my lambda invocation I'm pasting the lambda "event" json object into the the payload options which seems to work beautifully. Is there a way to reference the event configuration in lambda from the step function directly without having to copy-and-paste?
Hi Joe, You're welcome! If you are trying to pass your event payload to your lambda function through step functions, when you are running your step function execution in the console manually, you can paste your test payload there. You should set up your step function so the payload gets passed directly to your lambda function with the parameters your lambda needs. I hope this is what you are looking for.
thanks for the demo. Can you provide more details on what Glue publishes to SNS? So we dont have to write any custom json message to sns from glue, that Glue writes success or failure depending the run state automatically?
Hi Pradeep, if you attempt to configure a rule with eventbridge with the glue sample, it will tell you what the general payload will look like being passed to sns:
for example:
{
"version": "0",
"id": "66fbc5e1-aac3-5e85-63d0-856ec669a050",
"detail-type": "Glue Job Run Status",
"source": "aws.glue",
"account": "123456789012",
"time": "2018-04-24T20:57:34Z",
"region": "us-east-1",
"resources": [],
"detail": {
"jobName": "MyJob",
"severity": "INFO",
"notificationCondition": {
"NotifyDelayAfter": 1
},
"state": "STARTING",
"jobRunId": "jr_6aa58e7a3aa44e2e4c7db2c50e2f7396cb57901729e4b702dcb2cfbbeb3f7a86",
"message": "Job is in STARTING state",
"startedOn": "2018-04-24T20:55:47.941Z"
}
}
Excellent video, thanks for sharing!
I have a question, I want to run a bash script and trigger it via Lambda with Step Functions. Is that possible?
Yes, you can “wrap” your bash script within a supported language like Node.js or Python. For example, in Node.js, you can use the child_process module to execute a bash script.
Remember to package your bash script and any other necessary files into a ZIP file and upload it to AWS Lambda. Also, ensure that your bash script has the appropriate permissions to be executable.
Very useful, thanks, but, if I need to call 5 glue have bs for example, I can tell crate a workflow an then call whit workflow from this same way?
Hi Steven, can you edit your sentance, I don't understand what you trying to do.
What would be the rationale for using Glue in Step Functions vs. Glue Orchestration?
If you're doing more than using GlueJob and GlueCrawler, Step Functions make sense, but is that all?
The choice between using AWS Glue in Step Functions vs. Glue Orchestration (Glue Workflows) depends on the complexity of your data pipeline and the services you’re using.
AWS Glue Workflows are beneficial when you’re chaining together multiple Glue jobs and/or crawler. They are particularly useful for batch processing, where you can schedule workflows directly. However, Glue Workflows lack several features common in flow control tools, such as conditional branching, loops, dynamic maps, and custom steps.
On the other hand, AWS Step Functions are more suitable when the complexity exceeds simple triggers and the services used extend beyond Glue. Step Functions provide more advanced orchestration capabilities, including support for error handling, parallel execution, and conditional logic. They also integrate with over 220 AWS services, making them a more flexible choice for complex, multi-service workflows.
In addition, Step Functions can handle quick start and shutdown, which can manage a reasonable throughput. They also allow for the execution of parallel jobs, which is not possible in Glue Workflows.
I'm curious. How did you learn data engineering?
Working as a data engineer and in the data analytics field for 10 years. Also doing Udemy courses, AWS certifications and side projects to continue to learn as the field is changing so fast with new services coming out all the time.
What are the permissions for the publish_to_glue_topic?
Hi Oscar, It just had the sns:Publish action. The full statement looks like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "sns:Publish",
"Resource": "arn:aws:sns:us-east-1:account#:glue_jobs"
}
]
}
is their any way to give s3 path and database as input to JobRun s3 stepfunction?
Yes, you can pass the S3 path and database as input parameters to an AWS Step Functions State Machine that includes an AWS Glue JobRun S3 Step.
When you define your Step Function state machine, you can include an input parameter section that specifies the input data that will be passed to the state machine when it is executed. You can define the input parameters as key-value pairs in JSON format.
@@DataEngUncomplicated thanks,
How to enable the step function to run the jobs in parallel
Hi Mallikarjun, there is a parallel state which will allow you to run whatever jobs in parallel
how do i get to personalize the message that sns sends?
In the sns step there should be a box where you can customize the message
@@DataEngUncomplicated then how do i use the parameters of the job? for example if i want to send "The job state is (~SUCCEDED~ or ~FAILED~). At this time ~endtime~ ", thanks