AWS Tutorials - Using Job Bookmarks in AWS Glue Jobs
HTML-код
- Опубликовано: 2 окт 2024
- The exercise URL - aws-dojo.com/e...
AWS Glue uses job bookmark to track processing of the data to ensure data processed in the previous job run does not get processed again. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.
Hi Sir,
Thanks a lot for this wonderful video.
I have a doubt. Like I am using AWS Glue as ETL which is reading data everyday from Oracle RDBMS.
But in Oracle I have update and delete with insert. You mentioned that we can use incremental read using bookmarking but what about the delete and update in Oracle side.
How can we handle this situation.
Thank you sir in advance.
When i try to speicify multiple keys in jobbookmarkkeys , its not working, its taking only the primary of jdbc always. even when there is some modifcations on existing records also its not given, it processing again, anything i am missing here
you can multiple key as long as they increasing or decreasing in values. it that happening in the table?
@@AWSTutorialsOnline no, it means u are saying like the key field should be auto increment kind of field
@@tylerdurden8692 yes, increment or decrement. Please check this link, it has rules about JDBC - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
can I run the workflow recursively? I use boundedSize in my glue job. So I need to run the job multiple time in every month until the bookmark was done
a job can start another instance of the same job in the job code as long as concurrency allows. But is not a true recursive call - so think about exist condition when doing so.
Hi Sir , Its very helpful but how configure s3 timestamp based job bookmark instead of using bookmark key
I think when you just enable job bookmark without mentioning any key; it uses timestamp for the bookmark purpose. Please check this link - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
Great , It works
Thank you for the video, i have a question that does job bookmark work with DataFrame, suppose i use glueContext.create_data_frame_from_catalog, and then do some transformation to the Dataframe and and write the Dataframe to S3 bucket
yes it does
Tq so much explanation is very clear cut
Welcome 😊
How i can manage multiple join table through bookmarks.Because When joining table I don't have unique key so that I concatanate multiple id then I get unique key. I need to set bookmark with multiple key. Please tell me how we can do
Apologies for the late response due to my summer break.
Joining tables for bookmark not possible. You might want to create an ETL Glue Job which merges these datasets together and create primary key. Then run bookmark based processing on the merged dataset. Hope it helps,
Hello ,
Your videos are simply superb 👌, I have pgp encrypted files in s3 and I need to implement bookmarks ,can you help either it work or not . If not any another approach to follow
Hi, sorry never worked with pgp files. Hard to say without testing,
How does job bookmark scale on massive data set ?
sir, how we handle updated records using jobbookmark.? or How jobBookmarkKey identify given record is been updated . becoz once particular record is processes and bookmark and if for some reason process record got updated in source table so how we handle this situation using jobBookMark..?
Where is the state information stored? Is it persisted in any data store? What happens behind the scenes?
The information is not public so cannot say with confidence.
Hi, I hope you are doing great. Currently I'm saving the entries for new files on SQS and then read from Glue to read those files but now I want to use the bookmark option. I'm using Python shell job and it's not supported in it. Now I will move to spark job but I will not use spark context there. can you please guide me how I can do this?
In order to use job bookmark, you have to program in certain way using spark context. This link might help - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
Please put some demo for jdbc with bookmarking for a table and shows the daily updated records only in the output
Very Useful Video. Time saver
Glad to hear that
What if I want to delete a .csv? Can some process automatically delete the parquet file?
You need to use boto3 S3 API to delete the file. Please check this link - boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_object
@@AWSTutorialsOnline Hi, but how can I know what parquet file belongs to a deleted .csv?
Do u have any video for incremental load in aws glue for newly inserted updated and deleted data from source to target??
I don't have any video on this. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3.
I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel.
Thnks for the help. I will check the options that u have suggested..🙂
This is a great video. The visualization helped a lot also. Thank you so much!
If u have plz provide me th link.
I don't have any video incremental update. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3.
I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel and go through AWS documentation to understand incremental update part.
Hi Sir,
Thanks for sharing the information on Bookmark.
I have a task to work on building the bookmark functionality using the PySpark & bookmarking in DynamoDB.
I am new to the Bigdata framework technologies & we're moving from glue bookmarking to our own customized code (written in pyspark or java).
Can you please suggest any material or sample code when I can use as a reference. We're trying to update based on lastUpdatedTime & DelayTime as motioned by you in this tutorial. Please reply & help me. Thank you..
Hi ,, Were you able to find a solution ?
Hello, how can we rest glue job state ?
Excellent Sir Very Useful
Thanks and welcome
Hi , when I try to reset the bookmark I am getting "entitynotfoundexception , continuation for job not found" source is s3 I hav not altered the transformation ctx also, what might be the error
not sure. never come across this error. Can you share more details about what you are doing - some how which I can reproduce.
@@AWSTutorialsOnline I fixed the issue, it was due to job_name which I have given as parameter which shouldn't be given as per aws documentation..
Much clear and helpful.
Glad it was helpful!
Best explanation of maxBand I have heard.
Thanks