Use airflow to orchestrate a parallel processing ETL pipeline on AWS EC2 | Data Engineering Project

Поделиться
HTML-код
  • Опубликовано: 7 авг 2024
  • In this data engineering project, we will learn how to parallelize tasks. We will run airflow on AWS EC2 and use AWS RDS Postgres instance database as the database.
    If you have any questions or comments, ok to ask or leave comments in the comment section below.
    Please don’t forget to LIKE, SHARE, COMMENT and SUBSCRIBE to our channel for more AWESOME videos.
    *Books I recommend*
    1. Grit: The Power of Passion and Perseverance amzn.to/3EZKSgb
    2. Think and Grow Rich!: The Original Version, Restored and Revised: amzn.to/3Q2K68s
    3. The Book on Rental Property Investing: How to Create Wealth With Intelligent Buy and Hold Real Estate Investing: amzn.to/3LLpXRy
    4. How to Invest in Real Estate: The Ultimate Beginner's Guide to Getting Started: amzn.to/48RbuOb
    5. Introducing Python: Modern Computing in Simple Packages amzn.to/3Q4driR
    6. Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter 3rd Edition: amzn.to/3rGF73G
    **************** Commands used in this video ****************
    sudo apt update
    sudo apt install python3-pip
    sudo apt install python3.10-venv
    python3 -m venv airflow_venv
    sudo pip install pandas
    sudo pip install s3fs
    sudo pip install fsspec
    sudo pip install apache-airflow
    sudo pip install apache-airflow-providers-postgres
    psql -h rds-db-test-yml-4.cvzpgj7bczqy.us-west-2.rds.amazonaws.com -p 5432 -U postgres -W
    CREATE EXTENSION aws_s3 CASCADE;
    aws iam create-role \
    --role-name postgresql-S3-Role-yml-4 \
    --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Principal": {"Service": "rds.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
    aws iam create-policy \
    --policy-name postgresS3Policy-yml-4 \
    --policy-document '{"Version": "2012-10-17", "Statement": [{"Sid": "s3import", "Action": ["s3:GetObject", "s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::testing-ymlo", "arn:aws:s3:::testing-ymlo/*"]}]}'
    aws iam attach-role-policy \
    --policy-arn arn:aws:iam::177571188737:policy/postgresS3Policy-yml-4 \
    --role-name postgresql-S3-Role-yml-4
    aws rds add-role-to-db-instance \
    --db-instance-identifier rds-db-test-yml-4 \
    --feature-name s3Import \
    --role-arn arn:aws:iam::177571188737:role/postgresql-S3-Role-yml-4 \
    --region us-west-2
    airflow standalone
    **************** USEFUL LINKS ****************
    How to build and automate a python ETL pipeline with airflow on AWS EC2 | Data Engineering Project • How to build and autom...
    Extract current weather data from Open Weather Map API using python on AWS EC2: • Extract current weathe...
    How to remotely SSH (connect) Visual Studio Code to AWS EC2: • How to remotely SSH (c...
    PostgreSQL Playlist: • Tutorial 1 - What is D...
    Weather Map API: openweathermap.org/api
    Github Repo: github.com/YemiOla/build_auto...
    Linux downloads (Ubuntu)
    www.postgresql.org/download/l...
    Importing data from Amazon S3 into an RDS for PostgreSQL DB instance docs.aws.amazon.com/AmazonRDS...
    DISCLAIMER: This video and description has affiliate links. This means when you buy through one of these links, we will receive a small commission and this is at no cost to you. This will help support us to continue making awesome and valuable contents for you.
  • НаукаНаука

Комментарии • 39

  • @zuesbenz
    @zuesbenz 4 месяца назад +2

    you did a fine job, I plan to watch the whole series.

  • @go27sia
    @go27sia 2 месяца назад +1

    Thank you very much for creating this project. I followed all 3 videos from this series and learnt a lot. Thank you!

    • @tuplespectra
      @tuplespectra  2 месяца назад

      Great to hear! You r welcome.

  • @bethuelthipe-moukangwe7786
    @bethuelthipe-moukangwe7786 Год назад +2

    Thank you very much , you video lesson helped me to build my first data pipeline.

    • @tuplespectra
      @tuplespectra  Год назад

      Awesome. I'm glad you find it valuable and were able to build your first data pipeline.

  • @AjaySinghTomar05
    @AjaySinghTomar05 8 месяцев назад

    phenomenal work, super informative and clear explanations. keep it up

  • @gyungyoonpark
    @gyungyoonpark 6 месяцев назад

    thank you for the true masterpiece tutorial again!!! following the video and practicing is truly a joy of learning!
    P.S. Guys. if you are changing "houston" to other city, then sql "JOIN" part will not work at all. so make sure to just use "houston".

  • @vaibhavverma1340
    @vaibhavverma1340 Год назад

    Part1 is worth watching , I learnt a lot looking forward to complete part 2. thank you so much please keep doing what you are doing :)

    • @tuplespectra
      @tuplespectra  Год назад

      Thanks so much for the comment. I'm glad you found the videos valuable and learnt a lot.

  • @viishhnudatta5124
    @viishhnudatta5124 2 месяца назад +2

    Excellent Tutorial

  • @fatimaezzahrasoubari5928
    @fatimaezzahrasoubari5928 Год назад +2

    thank you so much for this help i really appreciate that , please keep working and don't forget subtitle is do helpful for me

    • @tuplespectra
      @tuplespectra  Год назад +1

      Thanks for the comment and feedback.

  • @yixiangzhang2834
    @yixiangzhang2834 Год назад +1

    good stuff. Initial cost of building a DE channel is high but it's worth it. Keep up the good work.

  • @facuoppi
    @facuoppi Год назад +1

    Men you are the best, thx for this 🙌🏻

  • @TvsCar30
    @TvsCar30 9 месяцев назад

    So cool!

  • @user-zb5jl6tj1v
    @user-zb5jl6tj1v 11 месяцев назад

    hi may i ask why is it that in your previous video, we were required to expose the AWS credentials (using session token) to access S3 to load the final results in the bucket? however, we do not need to do so in this video.

  • @femiotitolaiye1531
    @femiotitolaiye1531 11 месяцев назад

    hello, nice work sir. this is highly resourceful. but my question is when creating the tables, what about conditions where the columns from the API changes constantly, do we have to always go and change our code which is not a good engineering practice.

    • @tuplespectra
      @tuplespectra  10 месяцев назад

      One way to do this is to ask your code to check the tables in your database and if there is any columns that is in your incoming data but not already in your table, you should alter the table and add the columns.

  • @atharvbajare7398
    @atharvbajare7398 4 месяца назад

    Is it work with Instance type
    t2.small
    Cause its cheaper than medium one
    I referred your last video and my airflow goes smoothly with t2.small
    I need to start working on this project so I'm asking like should it goes smoothly with this project or i have to use medium version
    Reply me as soon as possible &
    Thanks a lot for making such great videos 🙏❣️

    • @tuplespectra
      @tuplespectra  4 месяца назад +1

      Airflow works better on medium than t2 small. Airflow has frozen a couple of times for me on t2.small. if you are thinking about the cost, I think you can give t2.small a try and see how it works for you.

  • @latabharti8175
    @latabharti8175 8 месяцев назад

    failed to connect the ssh error is showing

  • @atharvbajare7398
    @atharvbajare7398 4 месяца назад

    i am getting error while connecting psql postgres=> this is not starting

  • @gyungyoonpark
    @gyungyoonpark 6 месяцев назад

    I have a question regarding the csv file. why do you create csv file in the first place? isn't it better to just upload "df_data" to the postgres?
    for example, let's say we run this dag every day. there will be too many csv files in the folder. so why create the csv file at all?

    • @tuplespectra
      @tuplespectra  6 месяцев назад

      You are correct, you may not need to produce csv files. Your architecture depends on the requirement. So whatever I have taught is for educational purpose.

  • @josephostrow4876
    @josephostrow4876 5 месяцев назад

    My DAG is failing at 'tsk_uploadS3_to_postgres' with error 'HTTP 301. No response body.' - any ideas? I uploaded the same .csv from your GitHub to my S3 bucket and followed all the steps for importing S3 data into RDS PostgreSQL

    • @josephostrow4876
      @josephostrow4876 5 месяцев назад

      Ok super simple fix but I'll leave here in case anyone runs into this - just had to make sure the region specified in the SQL for this task aligns with the S3 region (in my case changing us-east-1 to us-east-2)
      Btw thanks for an amazing tutorial! Been learning a lot

  • @JuanCruz-nu4mg
    @JuanCruz-nu4mg Год назад

    Im stuck 1:25:49 - My CSV will not upload to postgres, I am getting a 'extra data after last expected column' log error. Ive even copy pasted your code and tried saving the excel file several ways and still nothing

    • @tuplespectra
      @tuplespectra  Год назад +1

      I'm guessing your csv has an extra column data. Did you use the csv in my github? Also ensure the file is a .csv

    • @JuanCruz-nu4mg
      @JuanCruz-nu4mg Год назад

      @@tuplespectra I figured it out, I had a coding error on my first run of the postgres table, and didnt have all the correct things loaded in, once I deleted the table out of postgres with DROP TABLE and reran it, it worked!

  • @vasudevreddy3527
    @vasudevreddy3527 7 месяцев назад

    Hi, am getting error while importing airflow.providers.postgres
    from airflow.providers.postgres.operators.postgres import PostgresOperator
    ModuleNotFoundError: No module named 'airflow.providers.postgres'
    I followed the same installation approach but am getting error.. checked all the possibilities, can you give me the solution ?

    • @vasudevreddy3527
      @vasudevreddy3527 7 месяцев назад +1

      After debugging, got to know PostgresOperator is deprecated. So we should use SQLExecuteQueryOperator and pass the conn_id as postgres_conn 🙂

    • @tuplespectra
      @tuplespectra  7 месяцев назад

      Thanks for your comment and the knowledge sharing.

  • @atharvbajare7398
    @atharvbajare7398 4 месяца назад

    Hello sir please help me
    I'm getting bill for using RDS it's goes to $75
    I'm student , I'm not getting how to stop getting those bills it's, yesterday I have terminated all EC2 instance and RDS instance still I got bill for RDS $62
    Day before yesterday it was $20 but that time I didn't stop RDS yesterday i deleted RDS and stop all services regarding to it at 7 in evening still in today's morning I see I have bill of $75
    Please help me how to stop getting bills I'm student I have ask for money to my parents please help me inr 6000 is big amount for me but I wish this bill will not get exceeds
    Please reply me as soon as possible
    I'm begging for here for help please help me to find way to stop getting those bills