@@beastmaroc7585 You can horizontally scale your set up to cluster if your data exceeds the resources of a single machine. This will allow you to process large datasets as cluster nodes will offer more compute and memory resources. You can also use a distributed engine like Spark and/or kafka to process large datasets. I have discussed the Airflow execution and cluster based approach here. Feel free to check it out. ruclips.net/video/In7zwp0FDX4/видео.html&ab_channel=BIInsightsInc
Thank you sir, you helped me understand airflow, and I did the same thing following the same process but from mysql - extract-load -> transformation -> load with free employees database and I did share it on my github and linkedin tagging this video.
Thank you, i have a question If the task is scheduled to run daily and new data has been inserted into the source since the last transfer, will just new data get transferred next task or all data again
This would bring in all data. This is a truncate and load approach. If you need to bring only newly inserted data then you want to look into the incremental data load approache(s). I have covered those in the following videos: ruclips.net/video/a_T8xRaCO60/видео.html ruclips.net/video/32ErvH_m_no/видео.html
Hey, this is really helpful. It would be even more insightful if you provided or suggested ways to run this process (along with those described in this recent series of tutorials) in the cloud or in a server less environment. Thanks!
Hey Giovanni thanks for stopping by. We've covered AWS data lake in the last few videos and will cover ETL topics going forward. I will be sure to include Airflow and how to process data from S3 to Redshift with it. Stay tuned.
Was able to connect to import airflow.providers.oracle and oracle hook. However, when I use OracelHook, it keeps throwing an error saying ‘conn_id’ not found even thought the connection has been configured fine via the airflow UI. Do you have any idea? What could go wrong ?
is it better to extract the entire tables to the stagings and then final table with consolidated data ?? why? wouldn't recommend to have queried data first (extract) and then transforming and loading it to the final table? I am trying to understand why the first approach is better over the latter....
If you are referring to the technique showcased in the video then it is following the dimensional modeling approach. After staging we are building a proper dimension table. With the basic ETL you would extract, transform and load the tables. I'd advise to pick up a good book on ETL, dimensional modeling if you're curious. Hope this helps.
@@BiInsightsInc Thank you so much for the explanation and advice. Can you please recommend the books/tutorials/courses which have practical hands on projects and where I can delve into data engineering space. I have worked as a quant/data analytics for about 5 + years and expanding my skills/knowledge in data engineering
in load_src_data, can we not use postgres connection object(conn) to load the data instead of using create_engine? because we need to know all the same connection details again which we used to create connection id in airflow
Hi Ulhas, when we use df.to_sql, we need to pass in a SQLAlchemy engine and not a standard connection object (and not a cursor either). If you try it on your end it will throw an error. Hope this helps.
Hi, Sir! Thanks for the insightful video! However I'd like to ask if we need to place the ETL python file in a particular folder for it to be recognized as a DAG by Airflow?
Can you ask me a question? I need to transfer data from Azure SQL server to MSSQL Server (cloud to server). I can do it directly from source and load to destiny, or need to land in postgre?
@@BiInsightsInc can you indicate to me a content that explain this, i loved your vídeo, but is so much complex to me, i just need to transfer data, from source to destiny. (sorry about my english).
It's better to break the each operation in a different task so it is easier to manage and troubleshoot. This in turns results in a neater dag structure extract >> transform >> load. You can write the entire operation in a single function but it would be hard to maintain and debug.
do you have any suggestions for me to run etl python in command line ubuntu server? to read a csv file or a connection with a pre-existing mysql database?
Hi Dzakwan, I have done a video on querying existing mysql database. You can follow along to establish database connection and query the database. Once you write and test the script then save as Python file i.e. etl.py. You can type: Python etl.py in the terminal to execute the script. Hope this helps. ruclips.net/video/mKnMY2GqaGI/видео.html
Hi Shruti, I think we can reverse the order of the source and destination and it will give you the code to move from Postgres to SQL Server. Hope this helps!
Great playlist. Your method of building the videos is very practical and lovable. One Question: How can you perform the "paste" line by line in the recording? is it ctrl+y after so many ctrl+z ops?
Glad you find it helpful. Yes, you're right. I write the code and then it's ctrl+y and ctrl+z as I am going through the code. It takes practice and sometimes few retakes to get it right.
I have done a video on this topic sinc a lot of folks were raising this. Also, look into big data processing frameworks design for large data processing i.e., Dask, Kafka, Spark and Flink. I have covered Spark and Kafka on this channel. ruclips.net/video/8Awk8CpfeGc/видео.html&ab_channel=BIInsightsInc
Thank you so much for the video, I have one question ,suppose i have 1000 different source files how to compare this file with one table present in database
Hi Varsha thanks for stopping by. If the files structure is the same then you can combine them into a DataFrame. Then query the database. From there is just comparing the two DataFrames. Hope this helps. Also, if you are trying to do change detection for incremental load then I will be doing a video on how to perform incremental load. Stay tuned.
@@BiInsightsInc Thank you for your reply . I will wait for your video and request u to make more videos on operations which can be performed using pandas dataframe
@@varshachauhan5185 I have a video on the Pandas library with various tips and trips. Feel free to check it out: ruclips.net/video/-jerzfh2bS0/видео.html
Thanks for the video. One question is why you didn't use Airflow's built-in PostgresHook or PostgreSQLOperator instead of SQLAlchemy and Pandas. I think this would simplify the code and make it more consistent with the way the SQL server connection is established using MsSqlHook.
This is a continuation of the Python ETL series. This session orchestrate what we developed previously and the (E*T*L) transformations are carried out with Pandas. I am not sure if PostgresHook offers same capabilities as Pandas therefore, I went with Pandas. I could've used the PostgresHook to get a dataframe from Postgres but not sure if I can persist dataframe as easily as SQLAlchemy. In addition, If I were to use the PostgreSQLOperator then I would've to switch to SQL as oppose to Python.
@@BiInsightsInc I guess this is just an example of how to use the hooks, cause airflow is not a processing framework but an orchestrator. Transformation should be handle by an external source like Spark, Database engine etc
@@alonsom.donayre1992 Yes, it is an orchestrator however with TaskFlow API 2.0, which is used here, they're claiming you can carry out the execution (transformation) within Airflow. I am not sold on it because it is designed for small to medium jobs, unless you are working with a cluster. I am waiting for Dagster to mature as it can handle orchestration and processing. More to come on Dagster. So yeah I would advise to process large datasets with an external database engine or distributed engines like Spark & Glue. airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html
As you can see that I am using SQL Server as the source and Postgres as the target therefore, MySQL is out of the scope. This is why it is not covered. Here is screenshot on how to connect to MySQL via Airflow: github.com/hnawaz007/pythondataanalysis/blob/main/AirflowSession2/airflow_mysql.png
@@BiInsightsInc Thanks, how about sql server? Can you make that as well? And how did you host your db to that ip address in the picture? Im asking these because there really isnt any good tutorials out there for stuff like these. Many people underestimate how difficult sometimes it is to make connections work between these tools, while the actual development on the other hand is most of times easy. That is a really gap that you could fill by making some videos on how to make it work between docker, microsoft sql server/myssql etc., airflow, and write back to connected database
@@mycousins896 Check out the Airflow installation with SQL Server provider. I go over on how to install SQL Server provider and how to create a SQL Server connection. How to install Apache Airflow on Docker? | Build Custom Airflow Docker Image | Airflow | Docker ruclips.net/video/t4h4vsULwFE/видео.html
I have covered the Airflow's installation as a docker container. You can follow the steps outline in the following video. Docker takes the OS system out of the equation and you will get the same Airflow functionaly not matter what system you are running. ruclips.net/video/t4h4vsULwFE/видео.html
Hello Sir Thanks for such a great tutorial everting you made smooth like butter thanks for that ,just one question whenever we made new DAG ( we will have to add docker-compose-CeleryExecutor, docker-compose-LocalExecutor, and Config for that particular DAG )
Hi Parikshit, if you have mounted the host folder “dag” to the docker container, just as shown in the tutorial, then dags created in this folder will automatically appear in your docker UI and will be copied over to appropriate containers.
I think you all you need to is connect to orcale database and rest of the process should be similar. Here is a link to oracledb library that showcases how to establish connection to oracle database. python-oracledb.readthedocs.io/en/latest/user_guide/connection_handling.html
@@BiInsightsIncthanks for the reply. I’ve already know how to connect to oracle db in Python itself but was trying to connect oracle directly to airflow by using providers. No luck so far. Can you please make a video how to install airflow provider for oracle and how to extract data from oracle and insert data to Postgres ?
@@BiInsightsInc thanks you so much! actually, I've figured it out. It was just including airflow.providers.oracle in the docker-compose.yaml file... but I suppose this is not recommended for production. Can you show better approach to install the oracle provider like how to install cx_Oracle with instant client etc...? or alternatives... Also, after connection to oracle in airflow is established, I am having an error of UnicodeDecode error UTF-8 when I try to get the entire table from oracle... do you know how to fix it?
@@eunheechoi3745 You're on the right track to install the oracle providers. However, don't include it in the docker compose file rather build a custom image with additional libraries. I have covered how to build a custom image with additional libraries/providers here: ruclips.net/video/In7zwp0FDX4/видео.html&t
You can use sybase library in Python to connect to a Sybase database. If you want to use SQLAlchemey then use the following method to connect to a Sybase database. # Sybase library import Sybase db = Sybase.connect('server','name','pass','database') c = db.cursor() c.execute("sql statement") list1 = c.fetchall() #SQLAlchemey params = ( "DRIVER = "+driver+";"\ "SERVER = "+server+";"\ "DATABASE = "+database+";"\ "PORT = "+port+";"\ "UID = "+user+";"\ "PWD= "+password+";" params = urllib.parse.quote_plus(params) connexion_string = 'sybase+pyodbc:///?odbc_connect = %s'%params)
Check out the Airflow installation with SQL Server provider. I go over on how to create a SQL Server connection. How to install Apache Airflow on Docker? | Build Custom Airflow Docker Image | Airflow | Docker ruclips.net/video/t4h4vsULwFE/видео.html
Videos in this series:
Build ETL pipeline: ruclips.net/video/dfouoh9QdUw/видео.html&t
ETL Load Reference Data: ruclips.net/video/W-8tEFAWD5A/видео.html
ETL Incremental Data Load (Source Change Detection): ruclips.net/video/32ErvH_m_no/видео.html&t
ETL Incremental Data Load (Destination Change Comparison): ruclips.net/video/a_T8xRaCO60/видео.html
How to install Apache Airflow: ruclips.net/video/t4h4vsULwFE/видео.html
How to load big tables example of 30M rows from SQL to PG ? Trying to solve load speed and memory usage
@@beastmaroc7585 You can horizontally scale your set up to cluster if your data exceeds the resources of a single machine. This will allow you to process large datasets as cluster nodes will offer more compute and memory resources. You can also use a distributed engine like Spark and/or kafka to process large datasets.
I have discussed the Airflow execution and cluster based approach here. Feel free to check it out.
ruclips.net/video/In7zwp0FDX4/видео.html&ab_channel=BIInsightsInc
You have perfect cadence and clarity in your training. Short yet complex. Very cool and subscribed.
simple and godlike understandable 10/10
I want to thank you for posting this content. It is helpful in many ways.
Fantastic presentation!
Excellent video sir, thank you.
Very nice videos and blog! Keep up the good work!
Thank you sir, you helped me understand airflow, and I did the same thing following the same process but from mysql - extract-load -> transformation -> load with free employees database and I did share it on my github and linkedin tagging this video.
This video was a great resource. Thanks for the tutelage and your take on it.
Great content, so pragmatic!
Very good video. Thank you !
Thank you, i have a question If the task is scheduled to run daily and new data has been inserted into the source since the last transfer, will just new data get transferred next task or all data again
This would bring in all data. This is a truncate and load approach. If you need to bring only newly inserted data then you want to look into the incremental data load approache(s). I have covered those in the following videos:
ruclips.net/video/a_T8xRaCO60/видео.html
ruclips.net/video/32ErvH_m_no/видео.html
@@BiInsightsInc i can make scripte that merge between the incremental load and aiflow ?
Great explanation! Thanks!
Hey, this is really helpful. It would be even more insightful if you provided or suggested ways to run this process (along with those described in this recent series of tutorials) in the cloud or in a server less environment. Thanks!
Hey Giovanni thanks for stopping by. We've covered AWS data lake in the last few videos and will cover ETL topics going forward. I will be sure to include Airflow and how to process data from S3 to Redshift with it. Stay tuned.
@@BiInsightsInc thank you indeed. I completely missed the data lake video. Thanks that is really helpful!
Can you please explain how to setup microsoft Sql server in the Airflow UI so that i csn select it in the Webserver
Sure. SQL Server provider installation and connection in Airlfow UI is covered in the following video:
ruclips.net/video/t4h4vsULwFE/видео.html
Amazing
Was able to connect to import airflow.providers.oracle and oracle hook. However, when I use OracelHook, it keeps throwing an error saying ‘conn_id’ not found even thought the connection has been configured fine via the airflow UI. Do you have any idea? What could go wrong ?
It's not able to find the connection configured in the Airflow. You need to define the connection, test and try agian.
is it better to extract the entire tables to the stagings and then final table with consolidated data ?? why?
wouldn't recommend to have queried data first (extract) and then transforming and loading it to the final table?
I am trying to understand why the first approach is better over the latter....
If you are referring to the technique showcased in the video then it is following the dimensional modeling approach. After staging we are building a proper dimension table. With the basic ETL you would extract, transform and load the tables. I'd advise to pick up a good book on ETL, dimensional modeling if you're curious. Hope this helps.
@@BiInsightsInc Thank you so much for the explanation and advice. Can you please recommend the books/tutorials/courses which have practical hands on projects and where I can delve into data engineering space. I have worked as a quant/data analytics for about 5 + years and expanding my skills/knowledge in data engineering
What happens if you kill the airflow web server, or localhost? Will the DAG still run on the schedule you specified?
If the services are down then DAG won’t run. You want to make sure your server remains on for the DAG to execute on schedule.
in load_src_data, can we not use postgres connection object(conn) to load the data instead of using create_engine? because we need to know all the same connection details again which we used to create connection id in airflow
Hi Ulhas, when we use df.to_sql, we need to pass in a SQLAlchemy engine and not a standard connection object (and not a cursor either). If you try it on your end it will throw an error. Hope this helps.
excellent tutorial, thank you ! it would be great if you could split the tasks in several files, need to learn how to do this
Hi, Sir! Thanks for the insightful video! However I'd like to ask if we need to place the ETL python file in a particular folder for it to be recognized as a DAG by Airflow?
Yes, you need to put the DAG files in the dags folder. This folder gets mounted in the docker image.
Understood. Thank you, Sir! @@BiInsightsInc
Can you ask me a question? I need to transfer data from Azure SQL server to MSSQL Server (cloud to server). I can do it directly from source and load to destiny, or need to land in postgre?
Hi Alan, you can directly load the data from Azure SQL to SQL Server. There’s no need to load data to PostgreSQL.
@@BiInsightsInc can you indicate to me a content that explain this, i loved your vídeo, but is so much complex to me, i just need to transfer data, from source to destiny. (sorry about my english).
@@BiInsightsInc i tried with generic transfer, but this limit the transfer to 1000 rows por json file and take a lot of time to transfer.
why didn't you transform the data after you extract the data from mssql itself and then load the final data to postgresql ?
It's better to break the each operation in a different task so it is easier to manage and troubleshoot. This in turns results in a neater dag structure extract >> transform >> load. You can write the entire operation in a single function but it would be hard to maintain and debug.
do you have any suggestions for me to run etl python in command line ubuntu server? to read a csv file or a connection with a pre-existing mysql database?
Hi Dzakwan, I have done a video on querying existing mysql database. You can follow along to establish database connection and query the database. Once you write and test the script then save as Python file i.e. etl.py. You can type: Python etl.py in the terminal to execute the script. Hope this helps.
ruclips.net/video/mKnMY2GqaGI/видео.html
Hi, how dags airflow mssql to gcs please, i will to build data warehouse in bigquery.
Do you use dokcer for airflow or not? is there an installation video?
I have Airflow running on Docker. Here is the link to Airflow installation video:
ruclips.net/video/t4h4vsULwFE/видео.html
Question: Do you have a video to build pipeline to move data from Postgres Server to SQL Server?
Hi Shruti, I think we can reverse the order of the source and destination and it will give you the code to move from Postgres to SQL Server. Hope this helps!
Great playlist. Your method of building the videos is very practical and lovable.
One Question: How can you perform the "paste" line by line in the recording? is it ctrl+y after so many ctrl+z ops?
Glad you find it helpful. Yes, you're right. I write the code and then it's ctrl+y and ctrl+z as I am going through the code. It takes practice and sometimes few retakes to get it right.
@@BiInsightsInc that's so nice. I asked because I found a very useful way to build the code during the video
this is powerful knwoledge
Could you please provide more details on what you mean by "table fact of the datawarehouse"?
Can you please provide the timeline in the video for the context.
Thank you a lot. I'm trying to understand hot to create pipeline. I want to be expert on this and be a good Data Engineering. Professional.
Can you share a tip for big tables 30M rows ... I have an issue with Df and memory usage
I have done a video on this topic sinc a lot of folks were raising this. Also, look into big data processing frameworks design for large data processing i.e., Dask, Kafka, Spark and Flink. I have covered Spark and Kafka on this channel.
ruclips.net/video/8Awk8CpfeGc/видео.html&ab_channel=BIInsightsInc
Thanks !!❤
Thank you so much for the video, I have one question ,suppose i have 1000 different source files how to compare this file with one table present in database
Hi Varsha thanks for stopping by. If the files structure is the same then you can combine them into a DataFrame. Then query the database. From there is just comparing the two DataFrames. Hope this helps.
Also, if you are trying to do change detection for incremental load then I will be doing a video on how to perform incremental load. Stay tuned.
@@BiInsightsInc Thank you for your reply . I will wait for your video and request u to make more videos on operations which can be performed using pandas dataframe
@@varshachauhan5185 I have a video on the Pandas library with various tips and trips. Feel free to check it out: ruclips.net/video/-jerzfh2bS0/видео.html
Great great great
Thanks for the video. One question is why you didn't use Airflow's built-in PostgresHook or PostgreSQLOperator instead of SQLAlchemy and Pandas. I think this would simplify the code and make it more consistent with the way the SQL server connection is established using MsSqlHook.
This is a continuation of the Python ETL series. This session orchestrate what we developed previously and the (E*T*L) transformations are carried out with Pandas. I am not sure if PostgresHook offers same capabilities as Pandas therefore, I went with Pandas. I could've used the PostgresHook to get a dataframe from Postgres but not sure if I can persist dataframe as easily as SQLAlchemy. In addition, If I were to use the PostgreSQLOperator then I would've to switch to SQL as oppose to Python.
@@BiInsightsInc I guess this is just an example of how to use the hooks, cause airflow is not a processing framework but an orchestrator. Transformation should be handle by an external source like Spark, Database engine etc
@@alonsom.donayre1992 Yes, it is an orchestrator however with TaskFlow API 2.0, which is used here, they're claiming you can carry out the execution (transformation) within Airflow. I am not sold on it because it is designed for small to medium jobs, unless you are working with a cluster. I am waiting for Dagster to mature as it can handle orchestration and processing. More to come on Dagster. So yeah I would advise to process large datasets with an external database engine or distributed engines like Spark & Glue.
airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html
why didnt you show how you connect to mysql? please make a video on that!
As you can see that I am using SQL Server as the source and Postgres as the target therefore, MySQL is out of the scope. This is why it is not covered. Here is screenshot on how to connect to MySQL via Airflow:
github.com/hnawaz007/pythondataanalysis/blob/main/AirflowSession2/airflow_mysql.png
@@BiInsightsInc Thanks, how about sql server? Can you make that as well? And how did you host your db to that ip address in the picture?
Im asking these because there really isnt any good tutorials out there for stuff like these. Many people underestimate how difficult sometimes it is to make connections work between these tools, while the actual development on the other hand is most of times easy. That is a really gap that you could fill by making some videos on how to make it work between docker, microsoft sql server/myssql etc., airflow, and write back to connected database
@@mycousins896 Check out the Airflow installation with SQL Server provider. I go over on how to install SQL Server provider and how to create a SQL Server connection.
How to install Apache Airflow on Docker? | Build Custom Airflow Docker Image | Airflow | Docker
ruclips.net/video/t4h4vsULwFE/видео.html
Plz make video about installation of airflow .I faced many issues in windows 10/11 like localhost:8080 not open properly lacking many GUI feature
I have covered the Airflow's installation as a docker container. You can follow the steps outline in the following video. Docker takes the OS system out of the equation and you will get the same Airflow functionaly not matter what system you are running.
ruclips.net/video/t4h4vsULwFE/видео.html
it was awesome, I had a blast, Can a video be created for ETL SQL Server Tables to Azure Blobs with Airflow, you never worked around Azure Blob
Thanks Babak! I will try and explore Azure.
@@BiInsightsInc Thank you so much! Looking forward to it with great anticipation. Please ping me here when it done.
Hello Sir Thanks for such a great tutorial everting you made smooth like butter thanks for that ,just one question whenever we made new DAG ( we will have to add docker-compose-CeleryExecutor, docker-compose-LocalExecutor, and Config for that particular DAG )
Hi Parikshit, if you have mounted the host folder “dag” to the docker container, just as shown in the tutorial, then dags created in this folder will automatically appear in your docker UI and will be copied over to appropriate containers.
Thanks sir for clearing confusion please keep uploading videos 😀🙏
Hi, can you please make a video about python ETL pipeline with Airflow for extracting data from oracle sql and loading to postgres?
I think you all you need to is connect to orcale database and rest of the process should be similar. Here is a link to oracledb library that showcases how to establish connection to oracle database.
python-oracledb.readthedocs.io/en/latest/user_guide/connection_handling.html
@@BiInsightsIncthanks for the reply. I’ve already know how to connect to oracle db in Python itself but was trying to connect oracle directly to airflow by using providers. No luck so far. Can you please make a video how to install airflow provider for oracle and how to extract data from oracle and insert data to Postgres ?
@@eunheechoi3745 I will cover the oracle provider with airflow soon.
@@BiInsightsInc thanks you so much! actually, I've figured it out. It was just including airflow.providers.oracle in the docker-compose.yaml file... but I suppose this is not recommended for production. Can you show better approach to install the oracle provider like how to install cx_Oracle with instant client etc...? or alternatives...
Also, after connection to oracle in airflow is established, I am having an error of UnicodeDecode error UTF-8 when I try to get the entire table from oracle... do you know how to fix it?
@@eunheechoi3745 You're on the right track to install the oracle providers. However, don't include it in the docker compose file rather build a custom image with additional libraries. I have covered how to build a custom image with additional libraries/providers here: ruclips.net/video/In7zwp0FDX4/видео.html&t
How to connect to Sybase db
You can use sybase library in Python to connect to a Sybase database. If you want to use SQLAlchemey then use the following method to connect to a Sybase database.
# Sybase library
import Sybase
db = Sybase.connect('server','name','pass','database')
c = db.cursor()
c.execute("sql statement")
list1 = c.fetchall()
#SQLAlchemey
params = (
"DRIVER = "+driver+";"\
"SERVER = "+server+";"\
"DATABASE = "+database+";"\
"PORT = "+port+";"\
"UID = "+user+";"\
"PWD= "+password+";"
params = urllib.parse.quote_plus(params)
connexion_string = 'sybase+pyodbc:///?odbc_connect = %s'%params)
Thanks
sql server to gcs with airflow please
Will make a video on this topic soon. Thanks
@@BiInsightsInc
I'm really waiting for that
@@WEN_the_cat_Astonot SQL Server to GCP video is up.
ruclips.net/video/Nq8Td8h_szk/видео.html&t
Hi, Can you share the source code ?
Please check the description of the video. All the resources including link to the repo is there. Thanks
@@BiInsightsInc yes.. I commented too fast :) found them and forgot to fix my comment!! thanks!
but why you importing pandas
Pandas is used to query the database and carry out the transformation.
👍
how to add the "Conn Type" of mssql? ruclips.net/video/eZfD6x9FJ4E/видео.html
Check out the Airflow installation with SQL Server provider. I go over on how to create a SQL Server connection.
How to install Apache Airflow on Docker? | Build Custom Airflow Docker Image | Airflow | Docker
ruclips.net/video/t4h4vsULwFE/видео.html