AWS Tutorials - ETL Pipeline with Multiple Files Ingestion in S3

AWS Tutorials

Просмотров 14 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 окт 2024
The code link - github.com/aws...
Handling multiple file ingestion is a Glue ETL Pipeline is challenge if you want to process all the ingested files at once. Learn how to build a pipeline which can handle processing of multiple files.

Комментарии • 37

@darkcodecamp1678 3 месяца назад
what we use in production is when glue job put data in raw s3 bucket it will create an AWS SNS notification which is subscribed by SQS then with the help of queue we trigger lambda :)
@swapnilkulkarni6719 2 года назад ⁺³
Really good..Thanks a lot for making such nice videos..Lot of learning from it..
@AWSTutorialsOnline 2 года назад ⁺¹
It's my pleasure
@BhanuNatva 2 месяца назад
sir, can you also pls make a vedio on using DynamoDB ?
@thegeekyreview2916 2 месяца назад
what happens to the s3 data in the next run? is it overwritten or appended
@udaynayak4788 Год назад ⁺¹
Thank you for the valuable information, can you please cover incremental, wherein RDS is the source and redshift target with the SCD2 approach, pyspark script under glue should handle SCD2
@abhijeetjain8228 5 месяцев назад
that would be nice to cover up!
@helovesdata8483 Год назад ⁺¹
why write five files from the database? is that just to show how separate files would work in this example?
@AWSTutorialsOnline Год назад
yes.
@deep6858 2 года назад ⁺¹
Excellent. I am new to AWS and its services. Related question, with multiple files in S3 we trigger Lambda and further Lambda calls Glue Job and we have kept concurrency of both Lambda and Glue Job as 1. Will this work the same way or differently .Thanks
@AWSTutorialsOnline 2 года назад
not sure about your question. But with concurrency 1 as well, the lambda will trigger for each file upload. Only difference is it will queue up in execution for concurrency.
@lakshminarayanau3989 2 года назад ⁺¹
Thanks for your videos, this channel is a good learning source..
is there any video which talks abt json files multiple nested arrays i.e arrays with in arrays flattening and move to redshift?
@AWSTutorialsOnline 2 года назад
I have the following videos on nested JSON, hope they help.
ruclips.net/video/4AvBv-Rxrv4/видео.html
ruclips.net/video/2ChiQ_2f97U/видео.html
ruclips.net/video/PR15TVZDgy4/видео.html
@vivekjacobalex 2 года назад ⁺²
Good video 👍 . I have one doubt, while pulling data from postgressql to raw folder, where did it mention to write files divided on employee record?
@AWSTutorialsOnline 2 года назад
I did not. Once you choose parquet format with snappy compression, it does automatic partitioning based on size.
@cloudcomputingpl8102 2 года назад
how to run a glue job only on new files and not full data? If for example you have 700GB, it will take ages to run few hours job every file. Can anyone target me to a resource?
@markkinuthia6178 Год назад
Thank you very Sir, I love how you teach using cases. My question is can the approach be used in production and can the same also be used in Redshift. Thanks.
@akshaybaura 2 года назад
this is acceptable if you have control over the first glue process which is dumping files for you. What is the intended solution if you cant create a token/indicator file sort of thing?
@ladakshay 2 года назад ⁺¹
Good smart solution. We can also orchestrate the entire flow using Glue workflow or Step functions so we don't have to depend on S3 event and Lambda.
@AWSTutorialsOnline 2 года назад ⁺²
indeed you can. if you search in my channel, I made two more videos about building pipeline using Glue Workflow and Step Function. But some of the audience were asking for handling S3 event in case of multiple file ingestion. So I made this video.
@ladakshay 2 года назад
@@AWSTutorialsOnline Yes this use case can come up in any pipeline where we want to trigger next step after data is written in S3.
@sivaprasanth5961 2 года назад ⁺¹
how can I select destination as my state machine?
@AWSTutorialsOnline 2 года назад
sorry, could not get your question? Can you please elaborate a bit?
@tan2784 Год назад
Interesting. Is there an alternative way to create a single lambda function without the token? Suppose a user doesn't have control over how data is loaded to S3, but has to work with the files loaded regularly., i.e. every hour on a new s3 object level.
@AWSTutorialsOnline Год назад
There has to be some trigger to know that all files have come. It could total file size or file count. You can configure an event to log all new / updated files arriving and let Lambda check their count / total size and if threshold is reached - trigger the pipeline.
@saravninja 2 года назад ⁺¹
Thanks for Great explanation!
@AWSTutorialsOnline 2 года назад
You're welcome!
@imtiyazali7003 Год назад
Great info and great tutorial . Thank you !!
@arunasingh8617 2 года назад ⁺¹
You are doing an excellent job! Get going :)
@AWSTutorialsOnline 2 года назад
Thank you! 😃
@prakashr9221 Год назад
I was looking for this use case, and this is helpful. Thank you.
@AWSTutorialsOnline Год назад
Glad it was helpful!
@misekerbirega3510 2 года назад
Thanks a lot, Sir.
@SandeepKumar-ne1ln 2 года назад
Given Glue is server-less, is it really a problem having multiple Glue jobs triggered for individual file in raw zone?
@AWSTutorialsOnline 2 года назад ⁺¹
no really. but sometime when you are doing aggregation based processing, you want all the files to land before processing. Also multiple instances of Glue will also increase cost.
@SandeepKumar-ne1ln 2 года назад
@@AWSTutorialsOnline Another question I have is - If multiple files are being created, then instead of an S3 event trigger Lambda function... can't we trigger Lambda on a Glue event (when Glue job completes writing all files in S3)?
@AWSTutorialsOnline 2 года назад ⁺¹
@@SandeepKumar-ne1ln you can using EventBridge based event. I did talk about it some other videos.

Следующие

Автовоспроизведение

AWS Tutorials - Auto Scaling AWS Glue Job