AWS Tutorials - Using Job Bookmarks in AWS Glue Jobs

AWS Tutorials

Просмотров 12 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 окт 2024
The exercise URL - aws-dojo.com/e...
AWS Glue uses job bookmark to track processing of the data to ensure data processed in the previous job run does not get processed again. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

Комментарии • 50

@VishalSharma-hv6ks 2 года назад ⁺²
Hi Sir,
Thanks a lot for this wonderful video.
I have a doubt. Like I am using AWS Glue as ETL which is reading data everyday from Oracle RDBMS.
But in Oracle I have update and delete with insert. You mentioned that we can use incremental read using bookmarking but what about the delete and update in Oracle side.
How can we handle this situation.
Thank you sir in advance.
@tylerdurden8692 Год назад ⁺¹
When i try to speicify multiple keys in jobbookmarkkeys , its not working, its taking only the primary of jdbc always. even when there is some modifcations on existing records also its not given, it processing again, anything i am missing here
@AWSTutorialsOnline Год назад ⁺¹
you can multiple key as long as they increasing or decreasing in values. it that happening in the table?
@tylerdurden8692 Год назад
@@AWSTutorialsOnline no, it means u are saying like the key field should be auto increment kind of field
@AWSTutorialsOnline Год назад
@@tylerdurden8692 yes, increment or decrement. Please check this link, it has rules about JDBC - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
@yusnardo 2 года назад ⁺²
can I run the workflow recursively? I use boundedSize in my glue job. So I need to run the job multiple time in every month until the bookmark was done
@AWSTutorialsOnline 2 года назад
a job can start another instance of the same job in the job code as long as concurrency allows. But is not a true recursive call - so think about exist condition when doing so.
@deepakshrikanttamhane285 2 года назад ⁺²
Hi Sir , Its very helpful but how configure s3 timestamp based job bookmark instead of using bookmark key
@AWSTutorialsOnline 2 года назад
I think when you just enable job bookmark without mentioning any key; it uses timestamp for the bookmark purpose. Please check this link - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
@deepakshrikanttamhane285 2 года назад
Great , It works
@tiktok4372 2 года назад ⁺¹
Thank you for the video, i have a question that does job bookmark work with DataFrame, suppose i use glueContext.create_data_frame_from_catalog, and then do some transformation to the Dataframe and and write the Dataframe to S3 bucket
@AWSTutorialsOnline 2 года назад ⁺¹
yes it does
@veerachegu 2 года назад ⁺¹
Tq so much explanation is very clear cut
@AWSTutorialsOnline 2 года назад
Welcome 😊
@mohdshoeb5101 3 года назад ⁺¹
How i can manage multiple join table through bookmarks.Because When joining table I don't have unique key so that I concatanate multiple id then I get unique key. I need to set bookmark with multiple key. Please tell me how we can do
@AWSTutorialsOnline 3 года назад
Apologies for the late response due to my summer break.
Joining tables for bookmark not possible. You might want to create an ETL Glue Job which merges these datasets together and create primary key. Then run bookmark based processing on the merged dataset. Hope it helps,
@creativeminds7397 2 года назад ⁺¹
Hello ,
Your videos are simply superb 👌, I have pgp encrypted files in s3 and I need to implement bookmarks ,can you help either it work or not . If not any another approach to follow
@AWSTutorialsOnline 2 года назад
Hi, sorry never worked with pgp files. Hard to say without testing,
@abir95571 3 месяца назад
How does job bookmark scale on massive data set ?
@deepakbhutekar5450 Год назад
sir, how we handle updated records using jobbookmark.? or How jobBookmarkKey identify given record is been updated . becoz once particular record is processes and bookmark and if for some reason process record got updated in source table so how we handle this situation using jobBookMark..?
@YogithaVenna 3 года назад ⁺¹
Where is the state information stored? Is it persisted in any data store? What happens behind the scenes?
@AWSTutorialsOnline 3 года назад
The information is not public so cannot say with confidence.
@abdulhaseeb4980 3 года назад ⁺¹
Hi, I hope you are doing great. Currently I'm saving the entries for new files on SQS and then read from Glue to read those files but now I want to use the bookmark option. I'm using Python shell job and it's not supported in it. Now I will move to spark job but I will not use spark context there. can you please guide me how I can do this?
@AWSTutorialsOnline 3 года назад
In order to use job bookmark, you have to program in certain way using spark context. This link might help - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
@mylikeskeyan2055 Год назад
Please put some demo for jdbc with bookmarking for a table and shows the daily updated records only in the output
@harishnttdata2325 3 года назад ⁺²
Very Useful Video. Time saver
@AWSTutorialsOnline 3 года назад
Glad to hear that
@joseabzum3073 3 года назад ⁺¹
What if I want to delete a .csv? Can some process automatically delete the parquet file?
@AWSTutorialsOnline 3 года назад
You need to use boto3 S3 API to delete the file. Please check this link - boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_object
@joseabzum3073 3 года назад
@@AWSTutorialsOnline Hi, but how can I know what parquet file belongs to a deleted .csv?
@vishalrajmane7649 3 года назад ⁺¹
Do u have any video for incremental load in aws glue for newly inserted updated and deleted data from source to target??
@AWSTutorialsOnline 3 года назад
I don't have any video on this. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3.
I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel.
@vishalrajmane7649 3 года назад
Thnks for the help. I will check the options that u have suggested..🙂
@howards5205 Год назад
This is a great video. The visualization helped a lot also. Thank you so much!
@vishalrajmane7649 3 года назад ⁺¹
If u have plz provide me th link.
@AWSTutorialsOnline 3 года назад ⁺¹
I don't have any video incremental update. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3.
I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel and go through AWS documentation to understand incremental update part.
@kumark3176 2 года назад
Hi Sir,
Thanks for sharing the information on Bookmark.
I have a task to work on building the bookmark functionality using the PySpark & bookmarking in DynamoDB.
I am new to the Bigdata framework technologies & we're moving from glue bookmarking to our own customized code (written in pyspark or java).
Can you please suggest any material or sample code when I can use as a reference. We're trying to update based on lastUpdatedTime & DelayTime as motioned by you in this tutorial. Please reply & help me. Thank you..
@sukanyabanu6785 2 года назад
Hi ,, Were you able to find a solution ?
@भ्रमंती-ज5ज Год назад
Hello, how can we rest glue job state ?
@sivahanuman4466 Год назад ⁺¹
Excellent Sir Very Useful
@AWSTutorialsOnline Год назад
Thanks and welcome
@selvaganesh2529 3 года назад
Hi , when I try to reset the bookmark I am getting "entitynotfoundexception , continuation for job not found" source is s3 I hav not altered the transformation ctx also, what might be the error
@AWSTutorialsOnline 3 года назад
not sure. never come across this error. Can you share more details about what you are doing - some how which I can reproduce.
@selvaganesh2529 3 года назад
@@AWSTutorialsOnline I fixed the issue, it was due to job_name which I have given as parameter which shouldn't be given as per aws documentation..
@pulakhazra5792 2 года назад ⁺¹
Much clear and helpful.
@AWSTutorialsOnline 2 года назад
Glad it was helpful!
@victorfeight9644 Год назад
Best explanation of maxBand I have heard.
@AWSTutorialsOnline Год назад
Thanks

Следующие

Автовоспроизведение

AWS Tutorials - Using Machine Learning with Amazon Redshift