Use airflow to orchestrate a parallel processing ETL pipeline on AWS EC2 | Data Engineering Project
HTML-код
- Опубликовано: 7 авг 2024
- In this data engineering project, we will learn how to parallelize tasks. We will run airflow on AWS EC2 and use AWS RDS Postgres instance database as the database.
If you have any questions or comments, ok to ask or leave comments in the comment section below.
Please don’t forget to LIKE, SHARE, COMMENT and SUBSCRIBE to our channel for more AWESOME videos.
*Books I recommend*
1. Grit: The Power of Passion and Perseverance amzn.to/3EZKSgb
2. Think and Grow Rich!: The Original Version, Restored and Revised: amzn.to/3Q2K68s
3. The Book on Rental Property Investing: How to Create Wealth With Intelligent Buy and Hold Real Estate Investing: amzn.to/3LLpXRy
4. How to Invest in Real Estate: The Ultimate Beginner's Guide to Getting Started: amzn.to/48RbuOb
5. Introducing Python: Modern Computing in Simple Packages amzn.to/3Q4driR
6. Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter 3rd Edition: amzn.to/3rGF73G
**************** Commands used in this video ****************
sudo apt update
sudo apt install python3-pip
sudo apt install python3.10-venv
python3 -m venv airflow_venv
sudo pip install pandas
sudo pip install s3fs
sudo pip install fsspec
sudo pip install apache-airflow
sudo pip install apache-airflow-providers-postgres
psql -h rds-db-test-yml-4.cvzpgj7bczqy.us-west-2.rds.amazonaws.com -p 5432 -U postgres -W
CREATE EXTENSION aws_s3 CASCADE;
aws iam create-role \
--role-name postgresql-S3-Role-yml-4 \
--assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Principal": {"Service": "rds.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
aws iam create-policy \
--policy-name postgresS3Policy-yml-4 \
--policy-document '{"Version": "2012-10-17", "Statement": [{"Sid": "s3import", "Action": ["s3:GetObject", "s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::testing-ymlo", "arn:aws:s3:::testing-ymlo/*"]}]}'
aws iam attach-role-policy \
--policy-arn arn:aws:iam::177571188737:policy/postgresS3Policy-yml-4 \
--role-name postgresql-S3-Role-yml-4
aws rds add-role-to-db-instance \
--db-instance-identifier rds-db-test-yml-4 \
--feature-name s3Import \
--role-arn arn:aws:iam::177571188737:role/postgresql-S3-Role-yml-4 \
--region us-west-2
airflow standalone
**************** USEFUL LINKS ****************
How to build and automate a python ETL pipeline with airflow on AWS EC2 | Data Engineering Project • How to build and autom...
Extract current weather data from Open Weather Map API using python on AWS EC2: • Extract current weathe...
How to remotely SSH (connect) Visual Studio Code to AWS EC2: • How to remotely SSH (c...
PostgreSQL Playlist: • Tutorial 1 - What is D...
Weather Map API: openweathermap.org/api
Github Repo: github.com/YemiOla/build_auto...
Linux downloads (Ubuntu)
www.postgresql.org/download/l...
Importing data from Amazon S3 into an RDS for PostgreSQL DB instance docs.aws.amazon.com/AmazonRDS...
DISCLAIMER: This video and description has affiliate links. This means when you buy through one of these links, we will receive a small commission and this is at no cost to you. This will help support us to continue making awesome and valuable contents for you. - Наука
you did a fine job, I plan to watch the whole series.
Thanks so much.
Thank you very much for creating this project. I followed all 3 videos from this series and learnt a lot. Thank you!
Great to hear! You r welcome.
Thank you very much , you video lesson helped me to build my first data pipeline.
Awesome. I'm glad you find it valuable and were able to build your first data pipeline.
phenomenal work, super informative and clear explanations. keep it up
Thanks a lot!
thank you for the true masterpiece tutorial again!!! following the video and practicing is truly a joy of learning!
P.S. Guys. if you are changing "houston" to other city, then sql "JOIN" part will not work at all. so make sure to just use "houston".
Part1 is worth watching , I learnt a lot looking forward to complete part 2. thank you so much please keep doing what you are doing :)
Thanks so much for the comment. I'm glad you found the videos valuable and learnt a lot.
Excellent Tutorial
Thanks for your comment.
thank you so much for this help i really appreciate that , please keep working and don't forget subtitle is do helpful for me
Thanks for the comment and feedback.
good stuff. Initial cost of building a DE channel is high but it's worth it. Keep up the good work.
Thank you!
Men you are the best, thx for this 🙌🏻
Thank you!
So cool!
Thanks so much!
hi may i ask why is it that in your previous video, we were required to expose the AWS credentials (using session token) to access S3 to load the final results in the bucket? however, we do not need to do so in this video.
hello, nice work sir. this is highly resourceful. but my question is when creating the tables, what about conditions where the columns from the API changes constantly, do we have to always go and change our code which is not a good engineering practice.
One way to do this is to ask your code to check the tables in your database and if there is any columns that is in your incoming data but not already in your table, you should alter the table and add the columns.
Is it work with Instance type
t2.small
Cause its cheaper than medium one
I referred your last video and my airflow goes smoothly with t2.small
I need to start working on this project so I'm asking like should it goes smoothly with this project or i have to use medium version
Reply me as soon as possible &
Thanks a lot for making such great videos 🙏❣️
Airflow works better on medium than t2 small. Airflow has frozen a couple of times for me on t2.small. if you are thinking about the cost, I think you can give t2.small a try and see how it works for you.
failed to connect the ssh error is showing
i am getting error while connecting psql postgres=> this is not starting
I have a question regarding the csv file. why do you create csv file in the first place? isn't it better to just upload "df_data" to the postgres?
for example, let's say we run this dag every day. there will be too many csv files in the folder. so why create the csv file at all?
You are correct, you may not need to produce csv files. Your architecture depends on the requirement. So whatever I have taught is for educational purpose.
My DAG is failing at 'tsk_uploadS3_to_postgres' with error 'HTTP 301. No response body.' - any ideas? I uploaded the same .csv from your GitHub to my S3 bucket and followed all the steps for importing S3 data into RDS PostgreSQL
Ok super simple fix but I'll leave here in case anyone runs into this - just had to make sure the region specified in the SQL for this task aligns with the S3 region (in my case changing us-east-1 to us-east-2)
Btw thanks for an amazing tutorial! Been learning a lot
Im stuck 1:25:49 - My CSV will not upload to postgres, I am getting a 'extra data after last expected column' log error. Ive even copy pasted your code and tried saving the excel file several ways and still nothing
I'm guessing your csv has an extra column data. Did you use the csv in my github? Also ensure the file is a .csv
@@tuplespectra I figured it out, I had a coding error on my first run of the postgres table, and didnt have all the correct things loaded in, once I deleted the table out of postgres with DROP TABLE and reran it, it worked!
Hi, am getting error while importing airflow.providers.postgres
from airflow.providers.postgres.operators.postgres import PostgresOperator
ModuleNotFoundError: No module named 'airflow.providers.postgres'
I followed the same installation approach but am getting error.. checked all the possibilities, can you give me the solution ?
After debugging, got to know PostgresOperator is deprecated. So we should use SQLExecuteQueryOperator and pass the conn_id as postgres_conn 🙂
Thanks for your comment and the knowledge sharing.
Hello sir please help me
I'm getting bill for using RDS it's goes to $75
I'm student , I'm not getting how to stop getting those bills it's, yesterday I have terminated all EC2 instance and RDS instance still I got bill for RDS $62
Day before yesterday it was $20 but that time I didn't stop RDS yesterday i deleted RDS and stop all services regarding to it at 7 in evening still in today's morning I see I have bill of $75
Please help me how to stop getting bills I'm student I have ask for money to my parents please help me inr 6000 is big amount for me but I wish this bill will not get exceeds
Please reply me as soon as possible
I'm begging for here for help please help me to find way to stop getting those bills