- Видео 187
- Просмотров 433 319
Astronomer
США
Добавлен 3 янв 2017
Astronomer is the commercial developer of Apache Airflow, a community-driven open-source tool that's leading the market in data orchestration. We're a global, venture-backed team of learners, innovators and collaborators working to build an Enterprise-grade product that makes it easy for data teams at Fortune 500’s and startups alike to adopt Apache Airflow.
Quickstart ETL with Airflow (Step 9 of 9)
Sign up for a free Astro trial!
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?)
See this GitHub repository for the full code (github.com/astronomer/quickstart-etl-with-airflow-videos).
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?)
See this GitHub repository for the full code (github.com/astronomer/quickstart-etl-with-airflow-videos).
Просмотров: 17
Видео
Quickstart ETL with Airflow (Step 8 of 9)
Просмотров 132 часа назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?) CLI commands: psql -p 5434 -U postgres -d airflow_db (this will vary based on your Postgres setup) SELECT * FROM weather_data.sunset_table; See this GitHub repository for the full code (github...
Quickstart ETL with Airflow (Step 7 of 9)
Просмотров 274 часа назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?) See this GitHub repository for the full code (github.com/astronomer/quickstart-etl-with-airflow-videos/blob/main/include/vid7_code.py).
Exploring the Power of Airflow 3 at Astronomer with Amogh Desai
Просмотров 977 часов назад
What does it take to go from fixing a broken link to becoming a committer for one of the world’s leading open-source projects? Amogh Desai, Senior Software Engineer at Astronomer, takes us through his journey with Apache Airflow. From small contributions to building meaningful connections in the open-source community, Amogh’s story provides actionable insights for anyone on the cusp of their op...
Quickstart ETL with Airflow (Step 6 of 9)
Просмотров 267 часов назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?) See this GitHub repository for the full DAG code (github.com/astronomer/quickstart-etl-with-airflow-videos/blob/main/include/vid6_code.py) and SQL statement (github.com/astronomer/quickstart-e...
Quickstart ETL with Airflow (Step 5 of 9)
Просмотров 209 часов назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?) Connection UI fields: Connection ID: my_postgres_conn Connection Type: Postgres Host: your Postgres host Database: your Postgres database Login: your Postgres login Password: your Postgres pas...
Quickstart ETL with Airflow (Step 4 of 9)
Просмотров 3112 часов назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?) See this GitHub repository for the full code (github.com/astronomer/quickstart-etl-with-airflow-videos/blob/main/include/vid4_code.py).
Quickstart ETL with Airflow (Step 3 of 9)
Просмотров 3814 часов назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?) See this GitHub repository for the full code (github.com/astronomer/quickstart-etl-with-airflow-videos/blob/main/include/vid3_code.py).
Quickstart ETL with Airflow (Step 2 of 9)
Просмотров 4916 часов назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?). See this GitHub repository for the full code (github.com/astronomer/quickstart-etl-with-airflow-videos/blob/main/include/vid2_code.py).
Quickstart ETL with Airflow (Step 1 of 9)
Просмотров 11319 часов назад
To learn more about best practices for ETL with Airflow get the Apache Airflow® Best Practices for ETL and ELT Pipelines eBook. (www.astronomer.io/ebooks/apache-airflow-best-practices-etl-elt-pipelines/?) Commands and code from the video: docker ps brew install astro astro dev init astro dev start Log in at localhost:8080 with admin for the username and password
Using Airflow To Power Machine Learning Pipelines at Optimove with Vasyl Vasyuta
Просмотров 76День назад
Data orchestration and machine learning are shaping how organizations handle massive datasets and drive customer-focused strategies. Tools like Apache Airflow are central to this transformation. In this episode, Vasyl Vasyuta, R&D Team Leader at Optimove, joins us to discuss how his team leverages Airflow to optimize data processing, orchestrate machine learning models and create personalized c...
Maximizing Business Impact Through Data at GlossGenius with Katie Bauer
Просмотров 7414 дней назад
Bridging the gap between data teams and business priorities is essential for maximizing impact and building value-driven workflows. Katie Bauer, Senior Director of Data at GlossGenius, joins us to share her principles for creating effective, aligned data teams. In this episode, Katie draws from her experience at GlossGenius, Reddit and Twitter to highlight the common pitfalls data teams face an...
Optimizing Large-Scale Deployments at LinkedIn with Rahul Gade
Просмотров 8521 день назад
Scaling deployments for a billion users demands innovation, precision and resilience. In this episode, we dive into how LinkedIn optimizes its continuous deployment process using Apache Airflow. Rahul Gade, Staff Software Engineer at LinkedIn, shares his insights on building scalable systems and democratizing deployments for over 10,000 engineers. Rahul discusses the challenges of managing larg...
How Uber Manages 1 Million Daily Tasks Using Airflow, with Shobhit Shah and Sumit Maheshwari
Просмотров 106Месяц назад
When data orchestration reaches Uber’s scale, innovation becomes a necessity, not a luxury. In this episode, we discuss the innovations behind Uber’s unique Airflow setup. With our guests Shobhit Shah and Sumit Maheshwari, both Staff Software Engineers at Uber, we explore how their team manages one of the largest data workflow systems in the world. Shobhit and Sumit walk us through the evolutio...
Building Resilient Data Systems for Modern Enterprises at Astrafy with Andrea Bombino
Просмотров 96Месяц назад
Efficient data orchestration is the backbone of modern analytics and AI-driven workflows. Without the right tools, even the best data can fall short of its potential. In this episode, Andrea Bombino, Co-Founder and Head of Analytics Engineering at Astrafy, shares insights into his team’s approach to optimizing data transformation and orchestration using tools like datasets and Pub/Sub to drive ...
Actionable Pipeline Insights with Astro Observe
Просмотров 109Месяц назад
Actionable Pipeline Insights with Astro Observe
Inside Airflow 3: Redefining Data Engineering with Vikram Koka
Просмотров 174Месяц назад
Inside Airflow 3: Redefining Data Engineering with Vikram Koka
Building a Data-Driven HR Platform at 15Five with Guy Dassa
Просмотров 702 месяца назад
Building a Data-Driven HR Platform at 15Five with Guy Dassa
The Intersection of AI and Data Management at Dosu with Devin Stein
Просмотров 1082 месяца назад
The Intersection of AI and Data Management at Dosu with Devin Stein
AI-Powered Vehicle Automation at Ford Motor Company with Serjesh Sharma
Просмотров 1443 месяца назад
AI-Powered Vehicle Automation at Ford Motor Company with Serjesh Sharma
From Task Failures to Operational Excellence at GumGum with Brendan Frick
Просмотров 1373 месяца назад
From Task Failures to Operational Excellence at GumGum with Brendan Frick
Building Modern Data Apps:Choosing the Right Foundation and Tools
Просмотров 843 месяца назад
Building Modern Data Apps:Choosing the Right Foundation and Tools
From Sensors to Datasets: Enhancing Airflow at Astronomer with Maggie Stark and Marion Azoulai
Просмотров 1453 месяца назад
From Sensors to Datasets: Enhancing Airflow at Astronomer with Maggie Stark and Marion Azoulai
Mastering Data Orchestration with Airflow at M Science with Ben Tallman
Просмотров 1233 месяца назад
Mastering Data Orchestration with Airflow at M Science with Ben Tallman
Enhancing Business Metrics With Airflow at Artlist with Hannan Kravitz
Просмотров 1004 месяца назад
Enhancing Business Metrics With Airflow at Artlist with Hannan Kravitz
Cutting-Edge Data Engineering at Teya with Alexandre Magno Lima Martins
Просмотров 4354 месяца назад
Cutting-Edge Data Engineering at Teya with Alexandre Magno Lima Martins
Airflow Strategies for Business Efficiency at Campbell with Larry Komenda
Просмотров 7644 месяца назад
Airflow Strategies for Business Efficiency at Campbell with Larry Komenda
The title is misleading. This video is not about aieflow 3. It's about an individual contributors journey to becoming an airflow contributor. That's cool, but not what I came here for. Please do better in the future.
ma'am when i run the command astro dev start i get the following error Error: error building, (re)creating or starting project containers: Error response from daemon: error while creating mount source path '/host_mnt/Users/Bingumalla Likith/Desktop/MLOPS/airflow-astro/dags': mkdir /host_mnt/Users/Bingumalla Likith/Desktop: operation not permitted can you help me out with it. Im using mac.
how about the logs ?
I am able to install astro but getting access denied error when using astro dev init or any astro command
Insightful 🙏
@Astronomer, Hello! Could you please tell me how you open the .html documentation that is generated inside the Airflow Docker container through the web interface? When I navigate to "data_docs_url": file:///opt/airflow/gx/uncommitted/data_docs/local_site/index.html," I get a 404 error.
Gonzalez Betty White Scott Anderson Jennifer
Thompson Helen Martinez Helen Lee Laura
Hernandez Brian Lewis Angela Clark Thomas
The issue with the KubernetesExecutor is that you cannot view the task logs in the Airflow UI because, with KubernetesExecutor, workers are terminated after their job finishes. This issue is not present with the Celery or CeleryKubernetesExecutor. I tried different solutions with Persistent Volumes (PV) and Persistent Volume Claims (PVC), but they didn’t work for me. At the end of the video, Marc also presented the issue, but no solution was provided. Does anyone here know how to resolve it?
Hey there, thanks for commenting. It is absolutely possible to get task logs in the Airflow UI when using K8s Executor. If you're working with OSS Airflow, you will need to either enable remote logging so Airflow grabs the logs before the pod spins down, or use a persistent volume to store them. With Astronomer, this is all handled automatically in our Astro Runtime. I'd recommend reading more in the docs here: airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/kubernetes_executor.html
Say I have a dag run happening and dynamically I updated the dag tasks.. Will it break the existing dag run? Say that particular dag run has 10 tasks to do, and I update dag when it's doing 1st task. Will it implement old tasks and newly run dag run implement new updated tasks??
What I would love to see is how to do this when the dbt and Airflow are in separate repositories. Due to different dependencies between airflow and dbt, this seems like a common use case.
Que bacana, meu filho!!! Parabéns.
The best❤️🔥
Excelente 👏👏
Hi, will the dag.test() applied for complex tasks such as SparkSubmitOperator() ? 🙏
Great job, thanks for sharing it.
Are presented examples available in any code repo?
Thank you very much for the great presentation and hands-on session. We are going to use Airflow in EKS, and our development Team needed a way to simulate their local environment to test their DAGs during development and become familiar with airflow on Kubernetes. Your guide was extremely helpful.
Gracias
After adding the python file and html file, and restarting the web server plugin details are visible in Admin > Plugins path. But the View is not populating in cloud composer. Is there anything else need to be performed?
After adding the python file and html file, and restarting the web server and postgres from docker. But the View is not populating in my local airflow, Is there anything else need to be performed? Running airflow from docker setup, My airflow version is 1.10.15, pretty old, but can't switch to newer version right now
HI Is it possible to schedule the Task using dataset ? or its controlled at Dag level. I mean if i hv 2 task in downstream Dag , do I hv option to customised the schedule on the basis of Task's upstream dataset
Do you have LinkedIn?
Great!
Can you provide more context on the batch inference pipeline ? Airflow is an orchestrator, you will need a different framework to perform batch inference ?
Very informative, thank you!
Really helpful thank you😍
Tons of information. Any chance this can be thrown in a github for us engineers who need more time to digest?
thank you, well explained. Created an express application to create DAGs programatically but the endpoints are not working
This is really awesome and I love the entire video and always love content from you guys and girls but could I please give some constructive feedback?
Is it possible to get the list of variables pushed through xcom push in first task (here extracting lets say) And can we pull that varibales list xcom_pull and have it as a group Dynamically (instead of A, B, C)??
what about if any of the subtasks fails ? how to trigger the error than but also the remining parallel tasks to be run.
@Astronomer Please share a direct link to the CLI library you mention (for proper files strcuture) ruclips.net/video/zVzBVpbgw1A/видео.htmlsi=HiJa9Afi-53yLZOG&t=873
You can find documentation on the Astro CLI, including download instructions, here: docs.astronomer.io/astro/cli/overview
Do we have a video on how to run airflow using docker on cloud containers. Running locally is fine to learn and test. But the real work is to see how on cloud. Am a consultant and for my clients easier setup is the goal. With airflow i dont see that
Astronomer provides a managed service for running Airflow at scale and in the cloud. You can learn more at astronomer.io/try-astro
great work, keep it up.
Hi, this video is really beneficial. I have some question about the best practive of handling data transmission btw tasks. I am building MLops using airflow. In my model training dag, it contains data preprocess-> model training. So there would be massive data transmission btw this 2 dags. I am using Xcom to transmit data btw them. But there's like a 2G limitation in Xcom. So what's the best practice to deal with this problem? Using a S3 to sned/pull data from tasks? Or should I simply combine these 2 tasks(data preprocess-> model training)? Thank you.
Thank you! For passing larger amounts of data between tasks you have two main options: a custom XCom backend or writing to intermediary storage directly from within the tasks. In general we recommend a custom XCom backend as a best practice in these situations, because you can keep your DAG code the same, the change happens in how the data sent to and retrieved from XCom is processed. You can find a tutorial on how to set up a custom XCom backend here: docs.astronomer.io/learn/xcom-backend-tutorial. Merging the tasks is generally not recommended because it makes it harder to get observability and rerun individual actions.
@@Astronomer Hi, Thanks for your valuable reply. I would also like to ask what level of granularity should we aim for when allocating tasks. Since the more tasks there are, the more push/pull data from the external storage happens, and when the data is large, it brings some level of network overhead.
Great video. Would also be interested in a webinar regarding scaling the Airflow database since I'm having some difficulties of my own with that.
Noted, thanks for the suggestion! If it's helpful, you can check out our guide on the metadata db docs.astronomer.io/learn/airflow-database. Using a managed service like Astro is also one way many companies avoid scaling issues with Airflow.
great video. I'm trying to make this work with LivyOperator do you know if it can be expanded or partial arguments supplied to it?
It should work. Generally you can map over any type of operator, but not that some parameters can't be mapped over (e.g. BaseOperator params). More here: docs.astronomer.io/learn/dynamic-tasks
32:29 why "test' connection button is disabled. SO frustrating. Aifrflow makes it so hard to connect to anything. Not intuitive at all. And your video just skipped on how to enable "test". And ask me to contact my deployment admin. lol, I am the deployment admin. Can you show me how? I checked its website and the documentation is not helpful at all. I have been stuck for over a week on how to connect airflow to an MSSQL Sever.
The `test` connection button is disabled by default starting in Airflow 2.7 for security reasons. You can enable it by setting the test_connection core config to Enabled. docs.astronomer.io/learn/connections#test-a-connection. We also have some guidance on connecting to an MSSQL server, although the process can vary depending on your exact setup: docs.astronomer.io/learn/connections/ms-sqlserver
@@Astronomer Hi, where can I find the core config to make this update? I'm currently using Astro CLI. I'm not seeing this setting in the two .yaml files in the project. Thank you.
*promosm* 💔
Is it good to return df many times in Airflow?
It's generally fine to pass dataframes in between your Airflow tasks, as long as you make sure your infrastructure can support the size of your data. If you use XCom, it's a good idea to consider a custom XCom backend for managing dataframes as Airflow's metadata db isn't set up for this specifically.
Hi, I have already an existing airflow project, so how can use Astro CLI to run my project ?
is the git repository public?
Yes! You can find it here: github.com/astronomer/webinar-demos/tree/best-practices-prod
Thakns!!🙂@@Astronomer
please, share repository
The repo is here: github.com/astronomer/webinar-demos/tree/best-practices-prod
That is great intro and overview of Airflow for beginners! I very much like the datasets concepts and the ability to see data lineage. However, I haven't found the solution for how to make a triggered pipe, that is dataset aware, to be executed with the parent dag execution date. Is it even possible at the moment?
Thanks! And that is a great question. It is not possible to have the downstream Dataset-triggered DAG have the same logical_date (the new paramater equivalent to the old execution_date ) as the DAG that caused the update to the dataset, but it is possible to pull that date from the downstream DAG by accessing context["triggering_dataset_events"]: @task def print_triggering_dataset_events(**context): triggering_dataset_events = context["triggering_dataset_events"] for dataset, dataset_list in triggering_dataset_events.items(): print(dataset, dataset_list) print(dataset_list[0].source_dag_run.logical_date) print_triggering_dataset_events() If you use the above in your downstream DAG you can get that logical_date/execution_date to use in your Airflow tasks. For more info and an example with Jinja templating see: airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html#fetching-information-from-a-triggering-dataset-event .
@@Astronomer That is amazing! You are my hero for life! Thank you!
Hi, Thank you for detailed demo. I just started exploring dynamic task mapping and I have below requirement where I need to get the data from metadata table and create list of dictionary. [ { 'colA' : 'valueA', 'colB' : 'valueB', 'colC' : 'valueC', 'colD' : 'valueD', }, { 'colA' : 'valueA', 'colB' : 'valueB', 'colC' : 'valueC', 'colD' : 'valueD', }, { 'colA' : 'valueA', 'colB' : 'valueB', 'colC' : 'valueC', 'colD' : 'valueD', }, ] The above structure can be generated using fetch_metadata_task (combination of BigQueryHook and PythonOperator). Now the Question is, how do I generate the dynamic tasks using the above list of dictionary. for each dictionary I want to perform set of tasks ex:GCSToBigQueryOperator, BigQueryValueCheckOperator, BigQueryToBigQueryCopyOperator etc. The sample dag dependancy look like this: start_task >> fetch_metadata_task fetch_metadata_task >> [GCSToBigQueryOperator_table1 >> BigQueryValueCheckOperator_table1 >> BigQueryToBigQueryCopyOperator_table1 >> connecting_dummy_task ] fetch_metadata_task >> [GCSToBigQueryOperator_table2 >> BigQueryValueCheckOperator_table2 >> BigQueryToBigQueryCopyOperator_table2 >> connecting_dummy_task ] fetch_metadata_task >> [GCSToBigQueryOperator_table3 >> BigQueryValueCheckOperator_table3 >> BigQueryToBigQueryCopyOperator_table3 >> connecting_dummy_task ] connecting_dummy_task >> BigQueryExecuteTask >> end_task
Hi All, Thank you for detailed demo. I just started exploring dynamic task mapping and I have below requirement where I need to get the data from metadata table and create list of dictionary. [ { 'colA' : 'valueA', 'colB' : 'valueB', 'colC' : 'valueC', 'colD' : 'valueD', }, { 'colA' : 'valueA', 'colB' : 'valueB', 'colC' : 'valueC', 'colD' : 'valueD', }, { 'colA' : 'valueA', 'colB' : 'valueB', 'colC' : 'valueC', 'colD' : 'valueD', }, ] The above structure can be generated using fetch_metadata_task (combination of BigQueryHook and PythonOperator). Now the Question is, how do I generate the dynamic tasks using the above list of dictionary. for each dictionary I want to perform set of tasks ex:GCSToBigQueryOperator, BigQueryValueCheckOperator, BigQueryToBigQueryCopyOperator etc. The sample dag dependancy look like this: start_task >> fetch_metadata_task fetch_metadata_task >> [GCSToBigQueryOperator_table1 >> BigQueryValueCheckOperator_table1 >> BigQueryToBigQueryCopyOperator_table1 >> connecting_dummy_task ] fetch_metadata_task >> [GCSToBigQueryOperator_table2 >> BigQueryValueCheckOperator_table2 >> BigQueryToBigQueryCopyOperator_table2 >> connecting_dummy_task ] fetch_metadata_task >> [GCSToBigQueryOperator_table3 >> BigQueryValueCheckOperator_table3 >> BigQueryToBigQueryCopyOperator_table3 >> connecting_dummy_task ] connecting_dummy_task >> BigQueryExecuteTask >> end_task
Is there any option available in airflow ui to auto trigger.
Great video, with many examples, much appreciated!