I was recently brought onto a team to convert our ETLs from Apache Nifi over to Airflow and while your assessment is fine, I think there's a few areas where I would have structured this differently. 1. Airflow is not an ETL, you're right in calling it a job scheduler, it's technically referred to as a task scheduler. In your ETL processes you have really 4 things that you're trying to do- a. trigger when an event happens (an email is received, x amount of time has passed, someone put a file in your fileshare or s3 bucket, some notification prompts you to start). b. Extract your data from one location. c. Transform your data. This is where the bulk of your coding comes into play d. put your data into it's appropriate database or storage e. make sure a-d goes off without an issue. The reason why Airflow is a great ETL tool is because it does A and E by itself reall well, and it facilitates B and D. Hooks and sensors are built into airflow, and are fully customizable. If your project is reliant on programs like Glue then you can do all of this in the AWS suite (or Azure or GCP), but if Airflow very cleanly packages up your connection points and your custom etl and runs that sequence of tasks beautifully. Should you default to airflow? If your data engineers are already experts it's fine, if not, then no. Is it the magic tool to ETL? No, watch for AWS and fellow tech giants to come out with something like that in the next 5-10 years. Is it the best task scheduler? Due to support it's miles ahead of its competitors, so yes.
I agree. Furthermore, Airflow is a workflow/orchestration [management] tool/platform, that's why it includes Job/Task scheduling, monitoring, retries, and other features. On the other hand, there are things I don't like from Airflow like lack of a declarative way (via JSON or YAML) to define DAG and tasks.
I've recently started to use Airflow, and while I agree with many points, the thing with Airflow at least in the current 2.9 version is that by now it has all the operators so that the "Python" code you write *is* the description. For many tasks you do nothing but call operators like SqlExecuteQuery('file.sql') If you had to write the task descriptions in yaml you would be just copy/pasting the same python declarations with nothing much taken away as by now they're terse, and unless you do need some Python code they're just declarations strung together.
@@viewpointzero1420 But that's kind of the point isn't it? You wanna spend your time re-inventing the wheel and patting yourself in the back? Or you wanna deliver the product you are working on? That's the whole point of these tools, it makes parts of the jobs that's not actually "your problem" easier. If you just want to write python code, which is a really uncool language if ask me, then you shouldn't be using Airflow.
I'm a developer in data analytics team. And now I'm setting up an apach airflow for my team. They will create dags using jupiter lab and it will very comfortable.
Airflow is a scheduler and it doesn't care about what code you run. The easiest is to pack all your golang/rust/python code in docker containers and scale with that.
Dear Bryan, thank you for your informative video! For me personally it is actually great news that Airflow IS NOT a full-fledged ETL tool, this is actually exactly what I need. I honestly don't see mentioned limitations (no ETL functionality) as a disadvantage. ETL as a concept is also becoming outdated, in the wake of new approaches such as data mesh and service mesh solutions. What is definitely a no-no is the amount of code overhead and the strong coupling. Will definitely look into suggested tools.
As someone coming from SSIS and literally hate it for being all too much graphical interface, I have to say you did a good job about describing the problems with AirFlow.
Only thing I hate in SSIS is the variables. If you follow ELT pattern and do minimal/no data transformation in the package, it is nice, scalable and most importantly easy to administer / manage, without tons of code .
Although I agree with most of what was said in this video I do have some comments that would likely change someone's mind as it pertains to using Airflow in a real world business scenario. I agree Airflow is not an ETL/ELT tool. I would agree that it is a scheduler. I disagree that code is not reusable. That's one of the reasons why providers and operators exist. If you want to use the same set of tasks multiple times inside the current project or across multiple projects, create a custom operator and use it where you wish. If you are running a medium to large business and the company/IT philosophy is to adopt products that have vendor support, then NiFi and Kettle are not going to be for you. There is no one to call for support when your production instance of either of those goes down. With Airflow a business has the ability to go with Astronomer for a fully vendor supported and highly automated solution which doesn't require the heavy lift of setup. Anyone saying they use AWS Glue and love it, has either not used it or is lying to you. Simply put, it's got a long ways to go to catch up with most orchestrator type tools like Azure Data Factory. If you are in a situation where your company has chosen AWS as their cloud provider and Snowflake as their cloud data warehouse, your options are limited for orchestration of workflow which is a major playing in a complete data pipeline strategy. Products like Matillion are great for drag and drop functionality but are expensive and have a huge deficiency in deployment pipelines and ci/cd implementation. If you are living in the cloud data space and don't know Python at least at a basic level, there is a good chance you are entry level and will need to learn it at some point or not very effective at putting together data pipelines. One of the most powerful module/libraries/etc available to someone in the data space is the Pandas Python module. This becomes a very powerful tool in Airflow or any other orchestration engine dealing with data movement. Just my 2 cents. Again, I don't disagree with what was said. I just think there are way more valid use cases and reasons to use Airflow then insinuated.
Been using airflow a little over a year now and totally agree with most of your points. Appreciate it for logging, monitoring of pipelines and the visualizations. Also the good K8s integrations and active community. Would recommend it if most of your code to orchestrate is Python or dockerized. It does come with some downsides like the lack of pipeline version management or the complex setup. There are managed versions though, e.g. Cloud Composer
In most point I agree... Airflow is not an ELT Tool. It's an orchestrator, in my opinion the best in the world. In the company where I work I built up a BI for online activities. I tried a lot of tools, don't want to mention them all. But they all had a lot of draw backs and where expensive. I ended up using Airflow and I'm pretty happy with it. Sure, it's all code! That's what you have to keep in mind. Other tools like DBT, Airbyte and so on integrate perfectly into Airflow. So scheduling and monitoring the entire pipeline is absolutely great. On the other hand I had to struggle with a lot of data sources where out of the box tools had problems understanding the data. In the end I had to program a middle ware in python to make the data compatible with these tools. Now in Airflow it works inside the Airflow environment. Due to the fact Airflow delivers a lot of good operators the code got even smaller. Furthermore the Docker (Compose) images are great and the Helm Charts are good... So yes: it's not a native ELT tool. You have to use code only... But with code only comes a lot of flexibility. Don't want to go back the kettle, talend or SAP Data Services. What looks interesting is NiFi...
Thanks for the feedback. I agree if you need a high degree of control and have a lot of dependency/complexity it can be a good option. It does not fit most of the use cases I have done over several decades of data engineering work though.
I've been using Airflow for a little over a year and your video really confirmed that a lot of the things that have been bugging me about it are not really a me problem. I really love how powerful it is, but having been using it mostly for ETL, I've often found myself overwhelmed with all the coupling and the little "gotchas" in the form of how specifically things have to be set up. It adds a lot of overhead from the get-go, and importantly, means that no matter how well designed the business code is, whenever something breaks or needs to be changed I always need to re-learn all of the Airflow-specific code. I can see why it's a favorite for specialized data teams whose main job is maintaining data pipelines, but not for use cases like mine in which the data flow management is just a small part of the job. So not really anything wrong with Airflow, just that it might be overkill for users like myself. I'm going to look into some of the ETL tools you mentioned, and one thing I'm very interested in using Airflow for soon is managing 3D rendering pipelines. I think it's going to be fantastic for coordinating render jobs and their individual frames, which are often in the thousands.
Yeah. After the video, I came to the conclusion that there are job schedulers and orchestrators and often you just need a good job scheduler. When the complexity requires an orchestrator, I recommend you look at Dagster. It is much more extensible, testable, and adds a ton of features over Airflow. I've been studying up on it for months to be sure I liked it. dagster.io/
Love the video. Definitely made me think and gave me some good tools to look into. A few notes here (I'm an Airflow noob, but I've at least used it...) 1. It doesn't really work on Windows like it says in the screenshot at the beginning - unless you're using Docker or WSL. It only works on Linux. 2. It does not only support Python. As you mention, there's a BashOperator, which means it can run anything using a bash script (python, JavaScript, php script, Java app, C# console app, etc). 3. I think it's a bit disingenuous to say your DAG code could be more than your actual code running - the DAG definitions are insanely simple... your examples are probably about as complex as 70% of jobs (outside of the actual logic). 4. All the alternate solutions you present also have overhead to learn and their own proprietary outputs (that can't be reused anywhere else - except maybe Data Factory, which might be able to port into SSIS on-prem or whatever). A Python script (or whatever script - Powershell, C# app, etc) can run just about anywhere. 5. Instead of putting your Python logic inside the script, you can just use a BashOperator to run the Python script (ie: "python3 -m path/to/thescript.py") - which means you can decouple and use the script part anywhere and only the DAG definition is the only thing specific to Airflow (which is... trivial most of the the time). This might not work if you have complex dependencies between your scripts - mine were always fairly linear jobs like: move data to cloud, train ML model, run batch model outputs, do something with the outputs, update some API. I'll just say... if you're currently running C# console and Python script jobs on Windows Server Scheduler (which is where I'm coming from, lol!), Airflow is an awesome tool that's super easy to get started with. We didn't end up using it because it was Linux-only and our infra team is scared of Linux (and Docker... and WSL2...).
Thanks. Lots of good comments. My point is about parsimony. Do only as much as needed and keep maintenance in mind. To create the DAGs, I believe Python must be used but from there you can call other languages. Not sure how tightly integrated other operator are, i.e. seem to just shell out but Ok. I've used SQL Server Agent for ETL scheduling and it worked great and no coding required. But in the cloud, I need to use other options like Azure Data Factory, etc. Azure also has Azure Automation but I wish Azure had a good job scheduler.
@@BryanCafferky Having worked extensively on Airflow in recent months, on multiple proof of concepts, I will admit it has a fair bit of complexity to it. However, it does provides a lot of operators out of the box e.g. DockerOperator, K8sPodOperator. Working with those in a managed environment like AWS MWAA (Managed Airflow) has made things very straightforward for us. We have been using our pre-cooked Spark Docker Images to carry out all the tasks on runtime. It does require fair bit of training to understand how to best use it. And COST yes. The COST is expensive. But we were able to get started with Airflow on AWS in couple hours and were testing out Spark modules on the very day.
A main issue of defining a function inside of another function is that it's impossible to unittest. But testing is vital for data processing. it looks like all tasks should be written and tested as standalone functions and adapted to airflow by additional abstraction layer.
Thank you for POV. Take a look at Dbt too from Fishtown Analytics. I think version control needs to be a core requirement for any tool that is responsible for moving data. This might be a problem if the solution isn’t code-based.
Thanks for your feedback. Do you use dbt or work for Fishtown? Source code control can take many forms. SSIS stores its programs as XML which can be placed under SSC. The level of and need for SSC depends on the project requirements. For example, in a small shop where one person maintains the code, ease of use and a GUI may outweigh the need for SSC assuming the ETL object snapshots can be stored.
Airflow is great. Coupled with Kubernetes you don't have to stick to Python anymore. The only drawback I saw was that DAGs don't scale when they have huge amounts of tasks. Though it's easy to solve by splitting the DAG.
Thanks for the comment. You do have to define the DAGs in Python. What do you use Airflow for?
2 года назад+1
There are different ways to use Airflow, you can rely on Kubernetes Pods to run Docker instances. In recent versions, you can scale schedulers to solve task issues. Nowadays, anyone can give an opinion just by reading Wikipedia and some basic examples. It's not an ETL solution. It's just an orchestrator with batteries.
It definitely has a steep learning curve. I was not able to deploy with airflow so I switched to dagster and it was way simpler, was able to spin up a task schedule within a day
I think there's a misunderstanding. Airflow is NOT an ETL tool, and I don't think it was ever meant to be, or marketed as such. It's rather an unfortunate confusion in the minds of many, between the workflow management / orchestration (which Airflow DOES), and the ETL tasks that actually implement the data transformations which make up the ETL pipeline (which should Not and usually are not airflow tasks). With Airflow on AWS service we run nightly data ingestion of rdbms data in AWS S3; all the tables in a given schema are processed in parallel airflow tasks, but each of these tasks is just calling an informatica script which actually does the job of ingesting a given DB table. So, again, if people don't understand the meaning of orchestration, don't blame the tool 😁
In Airflow the correct pattern is not to write top level code. In the Postgres example, in production you could have a DBT file or .sql file that contains the queries. Airflow specifically says not to do processing on Airflow itself, it's used for kicking off jobs elsewhere, monitoring them and dealing with the results. Some examples would be running a Glue job, Apache Spark, start an EMR cluster, making an RESTful call to an API, train a model on a Ray cluster. Most of your code to do the data processing should be on those platforms and can be in Scala, Java, Golang, C. Apache NiFi is also good for ETL, but parts of it require that data moves along its processors. Such as converting from one file format to another or regexing columns. So, some parts of it need more compute to process the data needing to scale the NiFi cluster. NiFi 1.x is Java only and only recently with 2.x Python is supported.
Thank you, Bryan, for your videos. They are really useful. It will be very kind of you to make lessons about Apache NiFi, especially how to choose processors for needed actions.
"you are only limited to python" -- I don't think this is a bad thing. Python is a stable and versatile language with libraries for everything. "it's complex" --- If the developer already knows Python, imo, airflow isn't difficult to learn. "Requires 100% coding" -- I see this as an advantage. I'm using both Airflow and Pentaho. With Pentaho, code review is just painful because the raw code is in XML which makes it difficult to read and to keep track of. Also, there's not a huge user base like python or airflow. So there isn't much help out there on stackoverflow.
I'm not against visual programming or low code tools, but if your team are experienced developers, they'll be more productive and happier using all code. Nevertheless. ETL tools have their place. Let your teams use the tools that are most suited for the job and use Airflow as a centralized orchestrator. You can orchestrate ADF pipelines for instance.
This is a cool vid. I personally love Airflow, I use it mostly as an interface to k8s and run applications on pods. I think in terms of "if I can write a container for the task, then I can orchestrate it in Airflow". You're not wrong though, it did take time to learn the intricacies of Airflow (both in code and UI). Our company-practice is to make reusable functions that generate DAGs, reduces the code for creating workflows per use-case down to just a function call. Thanks for putting this video together. I learned about some good alternatives.
I use Airflow, too, but I totally agree with Bryan's point of view. Airflow is a powerful tool, but the other side of the coin is its steep learning curve especially for new Airflow users. Most of the time I just need to do simple stuff and I find using Airflow leads to over-engineering. Lots of people uses Kubernetes Operator, the biggest problem I see with it is a lot of times I have one common base Docker image, but I need to bundle different code into that Docker image just for the sake of using Kubernetes Operator.
@@BryanCafferky Thank you for posting this video. It's THE best video that I've encountered that explains what Airflow is. A lot of people in my company uses Airflow for use cases that are not fit for Airflow..
Very nice video Bryan! What is your take on Prefect? They highlighted few short comings in Airflow and hence Prefect. But, Airflow in its recent version came up with lesser boilerplate. But happy to hear back from you on Prefect. Thanks!
As I know, Airflow is used for "scheduling" the ETLs; not "creating" the ETLs. So, can you perform both "creating" and "scheduling" operations via AWS Glue?
I've not used Glue but the docs say you can. "AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs." For time based scheduling see docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html
@@BryanCafferky Thank you so much for your attention. Then, if you are working for a company that uses a Cloud Platform, actually you do not even need Airflow.
@@halildurmaz7827 YW. If you just need to do ETL work, you don't need Airflow. If you need complex task orchestration, i.e. workflows, Airflow might be a good option.
Can not compare Apache Airflow With Apache Nifi These two tools aren’t mutually exclusive, and they both offer some exciting capabilities and can help with resolving data silos. It’s a bit like comparing oranges to apples - they are both (tasty!) fruits but can serve very different purposes.
YW. Yeah. I wonder if they all use it or just like list it in job ads. But it could be. The best tool is often not the one selected for the job. Thanks for watching.
We are actually considering this to replace some old batch processes we inherited. These processes are created in a no code solution and we cannot stand it.
Have you looked at Dagster? It addresses most of the issues I mentioned and has an excellent data object centric model. It's all Python based too. See dagster.io/ I'd code centric but provides a lot of value for the code you write.
I think you are mistakenly comparing Airflow to AWS Glue where AWS Step Functions (maybe also with AWS Glue) are a better representation of what it seems you get from Airflow. I'm not an expert in Airflow, but based on what is shown here in this video, that is the impression I get.
Thanks for the feedback. My intent was that ETL focused tools include SSIS, Informatica, Databricks, Azure Data Factory, NifI, Pentaho, etc. Airflow is a workflow orchestrator. I saw many places where it is promoted as an ETL service. It is not an ETL tool although it can be used to orchestrate ETL work. However, unless there are many task dependencies, it is probably overkill.
Just to make sure I’m understanding correctly: 1. If a company uses Azure or AWS they can just use ADF or AWS Glue instead of Airflow? If so it seems that Airflow is more for companies who do end-to-end python ETL/ELTs and don’t wanna pay for ADF or AWS Glue? 2. I’m a bit confused between what you said because a couple articles on the web and answers on Quora say that Apache NiFi is NOT a replacement for Apache Airflow. So are there things that Apache NiFi can’t do that Airflow can? 3. I really don’t wanna learn Airflow because of the learning curve but some jobs do require it :/ so if Apache NiFi can replace it I’d rather just use that. Do you know of any good resources to learn Apache NiFi or do you plan on making videos on it?
Thanks for your thoughts. Airflow can be a good solution but my point was that it is not an ETL tool. It is a job scheduler or orchestrator. It is promoted often as an ETL solution which I think is misleading. But yes, I too see jobs that ask for it. For complex workflows, it may make sense, especially streaming or something with complex dependencies. Bear in mind a given workflow cannot run concurrently with itself, i.e. each run must go from start to finish before it can start again. I would google NiFi or check Amazon for books. The documentation online looks pretty good. NiFI videos might be something I'll do in the future. It looks pretty cool.
Hello, I have an airflow running on my machine with Postgresql on the scheduler's backend and LocalExecutor, but when I put my dags to run it consumes a lot of server CPU, how could I solve this high consumption problem?
If you are running it all on your machine, then it sounds like your machine may not have enough power to support it. You could deploy Airflow to cloud VMs or Kubernetes cluster to get more resources. This Stack Overflow talks about limiting Airflow memory consumption. stackoverflow.com/questions/52140942/airflow-how-to-specify-quantitative-usage-of-a-resource-pool This blog discusses how to configure Airflow with setting for max_threads, worker_concurrency, etc. medium.com/@sdubey.slb/high-performance-airflow-dags-7ad163a9f764
Which tool is recommended for a project where you have to be calling this jobs every 20secs? I suppose this is better for tasks that run once or twice a day and not in a constant loop. right? only 10% of my tasks are daily or weekly. any recommendations?
I think a huge characteristics of Airflow is that is a static tool (what I personal really don't like, but let's try to keep it neutral). If you want to change something, you'll need to change the underlying code deployed to the server where Airflow is running. This means to first take all the procedure to change the code, review it, run it in CI/CD development and then ship the code to the server (or probably just redeploy the server). That's a long process, you can take some shortcuts, but you'll never have the experimental mode or fast prototyping possible. Even when working with a test instance, still it's a slow process. For some use cases, that's great, because there is always a definite and reliable and versioned description of what's going on. But if you need to change workflows and aren't sure whether they work fine (e.g. because the production cluster is different in terms of performance than your development cluster, or w/e), the development speed goes down drastically. Even if you don't want to try it out live, you either have a lot of latency going to the development cluster or you need a huge machine as you need to put it locally in a K8s setup (for realistic scenarios in enterprises). There are benefits having everything in code and inside GitOps, but it's certainly not fast prototyping for sure. The comparison to cron is very true. The only way to really check that it runs is to deploy it (like for cron, too), but you should only deploy what you are sure that it runs, so it's a chicken-egg problem. You can run tests, but they don't look the same way as usual in Python or in ETL or in SQL databases or in pandas, and they are complex to write and failure modes might be difficult to understand (especially checking all possible triggering rules). I personal would in most cases prefer a dynamic tool, I could easily change while running. (You might still want to block changes on the production system, but for at least for development or staging environment, this is what I really missed when working with Airflow). But year, the visualizations are awesome and explaining the complexity of a system to stakeholders works much easier. So, in practice, you'll get a lot of acceptance if work is slow, so this counterbalances it significant.
First, Thanks for this video and thanks for your insights. I understand why you said some things but I don't agree with most of it. You're right Airflow is a great job scheduler, not an ETL/ELT tool. But from my experience, neither is Nifi, not if you want to do some long complex batch jobs, each block is autonomous and they don't wait for the previous one to complete (The others I haves very little experience with so apart from being pretty expensive...). I think the strength of Airflow, the reason I choose to use it, is the level of control you get, and the diversity of job/tools you can use. It can start with a bash calling a Talend Job that loads your DB and then a DBT job that processes it. You can further split your DBT into tests and loads and when there's a failure rerun from the point of failure. These are features I saw in expensive enterprise tools such as Control-M. It does have a steep learning curve but looking at the trends in the market today and the way teams are being structured, Engineers for infrastructure and Analysts for the BI part, I think its a good choise.
Thanks. NiFi is documented as only an ETL tool and seems to fit that from what I read, though I have not used it. As I discuss late in the video, Airflow can be a good choice as a scheduler if you need the sophistication, i.e. DAGs, it offers. I purposely titled the video to alert people that think Airflow is an ETL, that it is not. That’s what I wanted to use it for and after reading a book on it, realized, its not an ETL tool. It is a Workflow engine. There’s a similar one in Windows that works with C#. Its fine if that’s what you need. Airflow seems great for complex ML pipelines. On SQL Server, I have used SQL Server Agent which worked well for that environment. It had sufficient dependency management and control for most jobs. The best ETL service to use depends on your environment and requirements: Databricks Notebooks for Sparks, Azure Data Factory for Azure Cloud, Pentaho, Informatica, etc. I appreciate the feedback. Good thoughts.
A social network is a perfect example of a graph that is neither directed (facebook connections are not unilateral relations), nor acyclic (Tom knows Sally knows Bob knows Tom there's your cycle). Airflow not having ETL functionality is a perfect example of the Unix philosophy. Do one thing, and do it well. It not being low code is a strength rather than a limitation. Whenever I'm constrained by the whims of a low code platform, I always end up having to struggle immensely to find some way to get the job done whenever my use case doesn't match exactly with what the platform expects. Lot of misses at the start of the video already. Not sure if the rest of the video is worth anyone's time.
Coming from legacy ETL. I am kind of confused, as you said a lot marketed them as ETL tools, and when I look closely, I totally agree with you that it is CRON on steroids. I guess it is marketed as ETL as python had pandas is relatively easy to ingest data compared to other frameworks but doing real heavy ETL on pandas is not a perfect way to do.
Depending on your needs and environment, you can use different tools, Azure Data Factory for Azure, AWS Glue, Databricks Notebooks for Databricks which runs on Spark. Pentaho, Infomatica, etc. Lots of choices.
Sorry but there are many misleading statements here. Firstly, you are not coerced to use python in your tasks, you can perfectly orchestrate almost anything if you put your code in an image (so yeah, you can use NodeJS, Java, etc). The learning curve is nothing more complicated as the one to learn any other framework, like Django (obviously we are in the "data processing" domain here). Most of all, it is a powerful tool to organize your tasks when using a bunch of cron-jobs in microservices is not an option
Thanks for your comment. Your code to orchestrate must be Python, which is a limitation. Parsimony is key. For a given project, the question 'Do I need to take on the overhead of creating and maintaining code just to orchestrate work'. Code which can break. Absorb the learning curve time and future skill set needed for employees. It is powerful but with great power comes great responsibility I don't think most data movement/transformations cases need Airflow.
@@BryanCafferky The validity of the reasoning is best observed if you compare Airflow with Alteryx and contrast them. Then we really see the difference of the learning curve. Alteryx and Kettle allows for non-devs to make ETL pipelines and the learning curve for non-devs is shorter. Am I correct on my assumptions? Thanks for the video. It was a time saver really.
I don't really understand Python being a limitation here. It's just the technology and ecosystem Airflow is using. SSIS, ADF, Pentaho,... They all have their limitations in the ecosystems they are sitting in. As for maintaing code.... Same applies to SSIS, ADF,... Only you build logic using a visual tool instead of all code. Airflow has lots of pre-built provider packages for database actions, ADF, Databricks, non-data related stuff,... which you can use so you don't need to build tasks from scratch. Thanks for the vid btw. Your other points were valid. Airflow is indeed an orchestrator, not an ETL tool. 😊
DAG is a bad name for a task schedule. 1. Directed graphs are obviously necessary if you want to define an execution flow, so duh. 2. The fact that you have a schedule interval means that your DAG isn't really acyclic, because it loops onto its own start node at the end node. 3. Acyclic graphs are good for reducing complexity and dependency between tasks, and that's a great thing. But that's an actual restriction, a lack of functionality, so it isn't really a feature.
Yes. I had to look up n8n but it seems to be better focused on ETL work and has many connectors. However, it does not appear to run on Spark so you would need to config a Docker/K8 environment or use their Cloud service which is in the Azure Marketplace.
i liked this vid. I understand your view but not sure I would call airflow a schedculer... anyway, its late 2024, which open source, on prem tool (bare metal and private cloud), would you use for ETL processes? (the more options the marrier) 10x!
Good question. When I did on prem ETL, we used SQL Server Integration Services (SSIS) which is proprietary but an excellent ETL tool. For open-source, I really like Dagster (dagster.io/) which is a Python based data orchestration framework. While it has some of the same issues as Airflow, the data centric nature of Dagster including integrated data validation, data lineage tracking, and composability make it far superior to Airflow in my opinion. I have not used it in production so I would recommend a POC and pilot before committing to Dagster. The Dagster university free online training is good to get started. Dagster is still an orchestrator, not an ETL service. I've seen and looked at some open source ETL tools but have not dug in enough to recommend any. Does that help?
we have extensively used airflow, it is AMZING. I think the whole video revolves around "worflow orchestration is not that complicated and is of secondary importance", which is not usually the case. For "complex" workflows, using configurations is not any simpler or more neat than writing python scripts. It is also important you test your workflow. Airflow has that functionality. The ui feature is very handy. Restarting jobs, clear visibility into what happened, etc. It also scalers really well! This videos is a little misleading!
Very interesting video. What would be a suitable orchestrator to use if e.g. our stack for ELT is Fivetran and dbt. While yes, we might be able to hook up these individual tools directly, I feel an overarching orchestrator ("dag job scheduler") is needed. So, I am not interested in using Airflow as ETL/ELT but I always thought it would only be an orchestrator tool. Cheers
Nicely done. Airflow is being explored by one of our team members. I have a question for you - Is it possible to debug the code on my local workstation before running it on Airlfow?
Well, you can run Airflow locally, see airflow.apache.org/docs/apache-airflow/stable/start/local.html To test without Airflow, remember that Airflow just runs Python code in the specified sequence so you should be able to test that code. Just run it in the order it will run when it is in Airflow.
Hi, thank you very much for this video. The project where I work plans to replace Apache oozie with Airflow, so I think it is pretty useful to watch video like this one. I don't have any prior knowledge of Airflow, it is very easy to understand the main ideas behind this framework.
Im having a problem handling around to 50 scripts that generates reports that are send to the users, i would like to schedule them but also activate them from some microservices, is there any suggest for that? :(
50 scripts generating and sending reports is probably not the idea solution. A reporting tool would make more sense. However, you could use Azure Automation which supports Python and PowerShell to do scheduling and run the scripts. Azure functions could also be used.
The entire functionality of Airflow is already available in tools like Control-M, Zeke, AutoSys etc, which have been present in the market for more than 2 decades. What is it that Airflow is doing differently? It seems, the programming cult has taken over the data processing and data management world and re-writing all the tools in the way that it was in 1980s. We intentionally came out of code heavy data processing/management model because of its heavy and expensive maintenance costs. Almost 18 yrs ago, in the early days of my career I worked on an "ELT" tool called Sunopsis (later acquired by Oracle). Today we are lauding a similar technology called "dbt" which is doing exactly what Sunopsis did 20 years ago. what's going on folks?
I don't think social media is a good example of a DAG since in general if a is connected to b, b is connected to a, which are bidirectional (non-directed) edges. I suppose if you impose who connected to who first then you could keep it directed but that seems artificial.
True with the bash operator it can do an OS call out to run a script but that's not tight integration. Your DAGS are defined in Python and Airflow is a Python framework. Thanks for your feedback.
I agree apache airflow is pain in the butt to learn , install and figure out the code ...one big limitation is that it doesn't support windows system unless you run it inside docker container ...I would rather using apache nifi since it can run on windows ....support multiple scripting languages and its UI oriented vs code which make it more productive and much easier to use
@@BryanCafferky Nope. Don’t work for Dagster. Just a mild manner data engineer trying to weed out all the noise in the tech world, finding the right gems so I can focus on exploiting those gems to be productive and be ahead in the game. Unfortunately, most of my time is spent on weeding out noise. I thank you for your service for doing the same. I think the road to take in discovery of new tools is to ask the question “why a tool is bad”, rather than, “why is this tool good”.
Your channel is awesome (and I'm very picky). I've recommended it to my whole team and I'll try to get our company to help you on Patreon as well. Keep it up!
I think the right title should be "Don't Use Apache Airflow if you are a Data Scientist" Bc as a DevOps Junior, Airflow looks awesome compared to cron scripts, at least in the project I'm in
@@adibauI Thanks. I was wondering if it was me. :-) Usually, just to get a basic dev environment for a tool is easy but not this. Python Dask is a piece of cake and for Spark, you can just use Databricks Community Edition.
@@BryanCafferky Works perfectly when you build the system based on it, but the thing is that I need to execute python modules from outside airflow's container. Which I think the best way will be to define every single dep I need in Airflow's Dockerfile so it can run the tasks
I evaluated airflow and luigi(which you didn't mention), I feel that airflow is the one that has enough extendability to get to work with my company's compute resources/environment. It seems you just went through the tutorials and didn't implement anything significant in airflow. The limitations you mention seem a little arbitrary(most people like python) and I don't understand how these are resolved with other options or what associated tradeoffs I would be making. Still going to use airflow, this is clickbait.
Thanks for your thoughts but I have asked colleagues who have used Airflow extensively and they agreed with my points. Also, most of the viewers of this video who left comments also agreed and confirmed with their experiences. It not about Python, it's about the best solution to a problem. Sometimes that will be Airflow but for most use cases, I don't think is and I get concerned when people get defensive about a given technology. BTW: It's not click bait when you follow though with a content that is consistent with the title. Live Long and Prosper.
@@BryanCafferky mainly kubeflow but our company is not that big to use fully dedicated kubernetes clusters. Any tool you would recommend? Interesting video btw.
@@llorulez Thanks for the info. It all depends on what you need to do. The video was meant to get people to stop and think before jumping in as Airflow is pretty complex but can be a great solution. For ETL/Data Movement, if the workflow is sequential, I would use a simpler tool which I mention in the video. Databricks notebooks/jobs can work well but it depends on whether you need the scale. Dask looks good for non-Spark loads and is really easy to start with but gets complex with the scale out. Each public could has their own ETL PaaS services as well. My focus is parsimony, i.e. just enough to do what you need and no more.
@@BryanCafferky maybe it was easy for me because we extensively use docker and was quick using dockeroperators but as you mention it can be really challenging.
hi, please allow me to add that I used Airflow in order to run complex queries in Impala using .sql files (that contain Impala query) and run inside the DAG tasks in the needed order. This might be usefull, for me it was . I agree that Nifi is best and my favorite. Thanks
@@BryanCafferky nope ,it is a commercial scheduling tool I have used and based on your presentation everything you mentioned exactly like what control M does.A task or job scheduling tool!
Dear lord ... Please dont use ADF over Airflow unless you are doing a deployment pipeline. Unless you enjoy working DEEEP under the covers doing things like spinning up Powershell jobs to complete tasks in an environment that is not really strongly backed by Source control .... unless you want to link it to a git repo and stare at JSON blobs to figure out whats wrong with the underlying "code". I do agree with you that there is a finite set of things that Airflow is good at and things that it shouldn't be used for out of the box. I wholeheartedly disagree that the "python" needed to do many of the simpler dag use cases are difficult to accomplish as most of the out of the box operators are pretty thoroughly documented and example code on how to use them lives everywhere on the internet. I would say that even in the case that you want to do something that Airflow doesnt do directly out of the box there is always the ability to use the numerous Python Operators to run custom code, Or the ability to spinup Kubernetes Pod Operators and allow them to scale in the cluster for heavier ML tasks. "Use Databricks" ... yes you can ... but databricks is a potentially expensive way to orchestrate one thing. Whereas airflow can not only orchestrate Spark but do many things that Databricks cant do. Also ... at the end of the day Databricks just winds up being a bunch of JSON. I think the ETL code you show is a fairly ok example of example code ... but not really an example of how an ETL process would be setup in the real world. Nor are you showing many of the purely built in operators that will allow you to orchestrate jobs in a tremendous amount of services in one centralized place. Mostly IMO yes ... if you dont want to write any code dont use airflow ... If you are ok with some mostly cut/paste code for many basic dags and functions but also want the ability to do things that none of the other mentioned tools that I have personally looked at can do ... I would give airflow a shot ... Or ... if you arent into doing ANY of the management work look at a managed airflow service.
Thanks. Excellent video. I recently moved to a data engineering project that uses airflow with DBT ( and cosmos ). Finding it difficult to understand why use airflow esp. with ELT tool like DBT. For any task, there is dependency on available operators if you want to use airflow. python code is tightly coupled with airflow. and as in the video it says - you have to code everything. its not that you can't get work done with airflow and DBT but with something like pentaho you would have done it with half the effort.
Thanks for your comment. I would suggest also looking at Dagster They address many of my concerns with Airflow. dagster.io/ Not sure how well is works with Databricks clusters though.
- I thought it was obvious that Airflow's use case is to be the orchestrator of a data pipeline, not the executor. Who uses Airflow for ETL/ELT is using wrong. - I don't see a problem with being a code oriented tool, as Python is very easy to learn. Is almost Low Code. - Comparisons with "best options" were meaningless. The use cases are different. It would have been more logical to have quoted Prefect, perhaps Dagster
I would argue that the useful functions should be called into the airflow context from a separate module. With this methodology python could be used to run code outside airflow support. Am I missing something?
@@BryanCafferky Reusability of code utilized by airflow. For context, I landed here while listening to arguments for and against airflow because I'm trying to figure out if I'm going to learn it or Prefect. I don't know much about either, hence the question at the bottom of my comment.
@@jamescaldwell3207 Did you watch the video? I have no issue with reuse. Whichever fits your requirements with the least cost/effort to maintain is probably the best tool.
@@BryanCafferky Of course I did. My comment was regarding right around 13:50 where you state that generic functions cannot be used anywhere else because of the decorators. I would think non-specific functions would be in a separate module and imported for use inside a task. If that function is specific to airflow but generic within the operational capacity of airflow, then one could create an airflow specific library for use across multiple jobs. As stated, I'm deciding whether to learn one of two tools and my comment was an assumption which posed the question if I was missing something. Having now looked it up in the spirit of ending what is starting to feel like a combative exchange, I've learned my assumption was correct.
@@jamescaldwell3207 Sorry. No worries. Glad to get the question. I recorded this video 5 months ago so not all the details are still fresh in my mind. The reference time was helpful. Your point is valid. In fact, you could create non airflow generic function libraries too. As I look back at this, I can see that when using the decorator, only the outermost function is decorated. Also, you can write code that does not use the decorators although I think the decorators are intuitive. See this blog for more details. airflow.apache.org/docs/apache-airflow/stable/modules_management.html
Thanks is totally true that can use airflow as great ETL you need an effort focus in python. When you are a developer that use python and can prepare sql querys result perfect. Any way I will consider NiFi because I don´t Know it. Let me read about it
Seriously. Why should it be a downside that it is 100% and requires DevOps to manage? Mostly these days that's what we wan't because it gives 100% control over what happens in the flows and how it happens. I see tons of ETL tools (which often are also job schedulers) being used mainly as schedulers with a lot of workarounds or custom SQL around them, simply because they don't facilitate or support the most basic things needed to do heavy transformations etc. If these out of the box "drag and drop" tools are so good, why do I keep seeing people implement workarounds on them? It doesn't make any sense these days. IMO I would rather see more code on ETL work than less and have good written code to do the job for us. Code that I can audit and move around like I want to. Also, I think you aren't giving justice to AirFlow at all. Saying that it is complicated to use is really hilarious. All things take time to learn mate. When you know the tool well, it takes no time to do things in. That's how all things work in this world.
Thanks for your comments. The gist of my thoughts is parsimony. Use the solution that requires no more work to maintain than necessary while meeting the requirements. If Airflow is needed, great! I think it not needed in most situations.
I wasn't using it but after this video I just changed my mind. I'm gonna schedule some jobs using Airflow next sprint.
How is your use of Airflow going? What use cases did you use it for?
I was recently brought onto a team to convert our ETLs from Apache Nifi over to Airflow and while your assessment is fine, I think there's a few areas where I would have structured this differently.
1. Airflow is not an ETL, you're right in calling it a job scheduler, it's technically referred to as a task scheduler. In your ETL processes you have really 4 things that you're trying to do-
a. trigger when an event happens (an email is received, x amount of time has passed, someone put a file in your fileshare or s3 bucket, some notification prompts you to start).
b. Extract your data from one location.
c. Transform your data. This is where the bulk of your coding comes into play
d. put your data into it's appropriate database or storage
e. make sure a-d goes off without an issue.
The reason why Airflow is a great ETL tool is because it does A and E by itself reall well, and it facilitates B and D. Hooks and sensors are built into airflow, and are fully customizable. If your project is reliant on programs like Glue then you can do all of this in the AWS suite (or Azure or GCP), but if Airflow very cleanly packages up your connection points and your custom etl and runs that sequence of tasks beautifully. Should you default to airflow? If your data engineers are already experts it's fine, if not, then no. Is it the magic tool to ETL? No, watch for AWS and fellow tech giants to come out with something like that in the next 5-10 years. Is it the best task scheduler? Due to support it's miles ahead of its competitors, so yes.
I agree. Furthermore, Airflow is a workflow/orchestration [management] tool/platform, that's why it includes Job/Task scheduling, monitoring, retries, and other features. On the other hand, there are things I don't like from Airflow like lack of a declarative way (via JSON or YAML) to define DAG and tasks.
I've recently started to use Airflow, and while I agree with many points, the thing with Airflow at least in the current 2.9 version is that by now it has all the operators so that the "Python" code you write *is* the description.
For many tasks you do nothing but call operators like SqlExecuteQuery('file.sql')
If you had to write the task descriptions in yaml you would be just copy/pasting the same python declarations with nothing much taken away as by now they're terse, and unless you do need some Python code they're just declarations strung together.
@@viewpointzero1420 But that's kind of the point isn't it? You wanna spend your time re-inventing the wheel and patting yourself in the back? Or you wanna deliver the product you are working on? That's the whole point of these tools, it makes parts of the jobs that's not actually "your problem" easier.
If you just want to write python code, which is a really uncool language if ask me, then you shouldn't be using Airflow.
I'm a developer in data analytics team. And now I'm setting up an apach airflow for my team. They will create dags using jupiter lab and it will very comfortable.
Airflow is a scheduler and it doesn't care about what code you run. The easiest is to pack all your golang/rust/python code in docker containers and scale with that.
Dear Bryan, thank you for your informative video! For me personally it is actually great news that Airflow IS NOT a full-fledged ETL tool, this is actually exactly what I need. I honestly don't see mentioned limitations (no ETL functionality) as a disadvantage. ETL as a concept is also becoming outdated, in the wake of new approaches such as data mesh and service mesh solutions. What is definitely a no-no is the amount of code overhead and the strong coupling. Will definitely look into suggested tools.
Yeah. It is good for orchestration and it can work with Databricks.
As someone coming from SSIS and literally hate it for being all too much graphical interface,
I have to say you did a good job about describing the problems with AirFlow.
Thanks
Only thing I hate in SSIS is the variables. If you follow ELT pattern and do minimal/no data transformation in the package, it is nice, scalable and most importantly easy to administer / manage, without tons of code .
@@AP-nq4pe Best to do most work in SQL Server T-SQL but it SSIS does orchestrate well. Package parameters are also a nice feature.
Python Code is often Keep It Stupid Simple compared to SSIS or other tools for that matter.
Although I agree with most of what was said in this video I do have some comments that would likely change someone's mind as it pertains to using Airflow in a real world business scenario. I agree Airflow is not an ETL/ELT tool. I would agree that it is a scheduler. I disagree that code is not reusable. That's one of the reasons why providers and operators exist. If you want to use the same set of tasks multiple times inside the current project or across multiple projects, create a custom operator and use it where you wish.
If you are running a medium to large business and the company/IT philosophy is to adopt products that have vendor support, then NiFi and Kettle are not going to be for you. There is no one to call for support when your production instance of either of those goes down. With Airflow a business has the ability to go with Astronomer for a fully vendor supported and highly automated solution which doesn't require the heavy lift of setup.
Anyone saying they use AWS Glue and love it, has either not used it or is lying to you. Simply put, it's got a long ways to go to catch up with most orchestrator type tools like Azure Data Factory. If you are in a situation where your company has chosen AWS as their cloud provider and Snowflake as their cloud data warehouse, your options are limited for orchestration of workflow which is a major playing in a complete data pipeline strategy. Products like Matillion are great for drag and drop functionality but are expensive and have a huge deficiency in deployment pipelines and ci/cd implementation. If you are living in the cloud data space and don't know Python at least at a basic level, there is a good chance you are entry level and will need to learn it at some point or not very effective at putting together data pipelines. One of the most powerful module/libraries/etc available to someone in the data space is the Pandas Python module. This becomes a very powerful tool in Airflow or any other orchestration engine dealing with data movement.
Just my 2 cents. Again, I don't disagree with what was said. I just think there are way more valid use cases and reasons to use Airflow then insinuated.
Been using airflow a little over a year now and totally agree with most of your points. Appreciate it for logging, monitoring of pipelines and the visualizations. Also the good K8s integrations and active community. Would recommend it if most of your code to orchestrate is Python or dockerized. It does come with some downsides like the lack of pipeline version management or the complex setup. There are managed versions though, e.g. Cloud Composer
In most point I agree... Airflow is not an ELT Tool. It's an orchestrator, in my opinion the best in the world. In the company where I work I built up a BI for online activities. I tried a lot of tools, don't want to mention them all. But they all had a lot of draw backs and where expensive. I ended up using Airflow and I'm pretty happy with it. Sure, it's all code! That's what you have to keep in mind. Other tools like DBT, Airbyte and so on integrate perfectly into Airflow. So scheduling and monitoring the entire pipeline is absolutely great. On the other hand I had to struggle with a lot of data sources where out of the box tools had problems understanding the data. In the end I had to program a middle ware in python to make the data compatible with these tools. Now in Airflow it works inside the Airflow environment. Due to the fact Airflow delivers a lot of good operators the code got even smaller. Furthermore the Docker (Compose) images are great and the Helm Charts are good... So yes: it's not a native ELT tool. You have to use code only... But with code only comes a lot of flexibility. Don't want to go back the kettle, talend or SAP Data Services. What looks interesting is NiFi...
Thanks for the feedback. I agree if you need a high degree of control and have a lot of dependency/complexity it can be a good option. It does not fit most of the use cases I have done over several decades of data engineering work though.
I've been using Airflow for a little over a year and your video really confirmed that a lot of the things that have been bugging me about it are not really a me problem.
I really love how powerful it is, but having been using it mostly for ETL, I've often found myself overwhelmed with all the coupling and the little "gotchas" in the form of how specifically things have to be set up. It adds a lot of overhead from the get-go, and importantly, means that no matter how well designed the business code is, whenever something breaks or needs to be changed I always need to re-learn all of the Airflow-specific code. I can see why it's a favorite for specialized data teams whose main job is maintaining data pipelines, but not for use cases like mine in which the data flow management is just a small part of the job. So not really anything wrong with Airflow, just that it might be overkill for users like myself.
I'm going to look into some of the ETL tools you mentioned, and one thing I'm very interested in using Airflow for soon is managing 3D rendering pipelines. I think it's going to be fantastic for coordinating render jobs and their individual frames, which are often in the thousands.
Yeah. After the video, I came to the conclusion that there are job schedulers and orchestrators and often you just need a good job scheduler. When the complexity requires an orchestrator, I recommend you look at Dagster. It is much more extensible, testable, and adds a ton of features over Airflow. I've been studying up on it for months to be sure I liked it. dagster.io/
Love the video. Definitely made me think and gave me some good tools to look into.
A few notes here (I'm an Airflow noob, but I've at least used it...)
1. It doesn't really work on Windows like it says in the screenshot at the beginning - unless you're using Docker or WSL. It only works on Linux.
2. It does not only support Python. As you mention, there's a BashOperator, which means it can run anything using a bash script (python, JavaScript, php script, Java app, C# console app, etc).
3. I think it's a bit disingenuous to say your DAG code could be more than your actual code running - the DAG definitions are insanely simple... your examples are probably about as complex as 70% of jobs (outside of the actual logic).
4. All the alternate solutions you present also have overhead to learn and their own proprietary outputs (that can't be reused anywhere else - except maybe Data Factory, which might be able to port into SSIS on-prem or whatever). A Python script (or whatever script - Powershell, C# app, etc) can run just about anywhere.
5. Instead of putting your Python logic inside the script, you can just use a BashOperator to run the Python script (ie: "python3 -m path/to/thescript.py") - which means you can decouple and use the script part anywhere and only the DAG definition is the only thing specific to Airflow (which is... trivial most of the the time). This might not work if you have complex dependencies between your scripts - mine were always fairly linear jobs like: move data to cloud, train ML model, run batch model outputs, do something with the outputs, update some API.
I'll just say... if you're currently running C# console and Python script jobs on Windows Server Scheduler (which is where I'm coming from, lol!), Airflow is an awesome tool that's super easy to get started with. We didn't end up using it because it was Linux-only and our infra team is scared of Linux (and Docker... and WSL2...).
Thanks. Lots of good comments. My point is about parsimony. Do only as much as needed and keep maintenance in mind. To create the DAGs, I believe Python must be used but from there you can call other languages. Not sure how tightly integrated other operator are, i.e. seem to just shell out but Ok. I've used SQL Server Agent for ETL scheduling and it worked great and no coding required. But in the cloud, I need to use other options like Azure Data Factory, etc. Azure also has Azure Automation but I wish Azure had a good job scheduler.
@@BryanCafferky Having worked extensively on Airflow in recent months, on multiple proof of concepts, I will admit it has a fair bit of complexity to it. However, it does provides a lot of operators out of the box e.g. DockerOperator, K8sPodOperator. Working with those in a managed environment like AWS MWAA (Managed Airflow) has made things very straightforward for us. We have been using our pre-cooked Spark Docker Images to carry out all the tasks on runtime. It does require fair bit of training to understand how to best use it. And COST yes. The COST is expensive. But we were able to get started with Airflow on AWS in couple hours and were testing out Spark modules on the very day.
A main issue of defining a function inside of another function is that it's impossible to unittest. But testing is vital for data processing. it looks like all tasks should be written and tested as standalone functions and adapted to airflow by additional abstraction layer.
Thank you for POV. Take a look at Dbt too from Fishtown Analytics. I think version control needs to be a core requirement for any tool that is responsible for moving data. This might be a problem if the solution isn’t code-based.
Thanks for your feedback. Do you use dbt or work for Fishtown? Source code control can take many forms. SSIS stores its programs as XML which can be placed under SSC. The level of and need for SSC depends on the project requirements. For example, in a small shop where one person maintains the code, ease of use and a GUI may outweigh the need for SSC assuming the ETL object snapshots can be stored.
@@BryanCafferky Latest version of SSIS I checked, does version control and CI/CD like a pro!
Beautifully explained! I love how you dive into the code without getting lost in the weeds. Very helpful, thank you :)
Thanks!
Airflow is great. Coupled with Kubernetes you don't have to stick to Python anymore. The only drawback I saw was that DAGs don't scale when they have huge amounts of tasks. Though it's easy to solve by splitting the DAG.
Thanks for the comment. You do have to define the DAGs in Python. What do you use Airflow for?
There are different ways to use Airflow, you can rely on Kubernetes Pods to run Docker instances.
In recent versions, you can scale schedulers to solve task issues.
Nowadays, anyone can give an opinion just by reading Wikipedia and some basic examples.
It's not an ETL solution.
It's just an orchestrator with batteries.
writing 800 lines of code to schedule a job in airflow..i totally agree with you..its a Pain in the wrong place
This is amazing. Rarely anyone is so fair in evaluating popular tool like airflow
Thank you. There are some who disagree but I was trying to be fair.
Recently an idiot on Reddit argued with me by saying Airflow is better than Data Factory. This video says it all. Thanks Alot Bryan 🙏
Well, Airflow may be better at some things but not data movement/transformations in most use cases. ADF is a solid choice if you are on Azure.
whether I end up using airflow or not, this is a great video that clearly explains how to use the tool and your perspective. thank you!
Thanks for your kind words. Glad it is helpful!
It definitely has a steep learning curve. I was not able to deploy with airflow so I switched to dagster and it was way simpler, was able to spin up a task schedule within a day
I think there's a misunderstanding. Airflow is NOT an ETL tool, and I don't think it was ever meant to be, or marketed as such.
It's rather an unfortunate confusion in the minds of many, between the workflow management / orchestration (which Airflow DOES), and the ETL tasks that actually implement the data transformations which make up the ETL pipeline (which should Not and usually are not airflow tasks).
With Airflow on AWS service we run nightly data ingestion of rdbms data in AWS S3; all the tables in a given schema are processed in parallel airflow tasks, but each of these tasks is just calling an informatica script which actually does the job of ingesting a given DB table.
So, again, if people don't understand the meaning of orchestration, don't blame the tool 😁
In Airflow the correct pattern is not to write top level code. In the Postgres example, in production you could have a DBT file or .sql file that contains the queries.
Airflow specifically says not to do processing on Airflow itself, it's used for kicking off jobs elsewhere, monitoring them and dealing with the results.
Some examples would be running a Glue job, Apache Spark, start an EMR cluster, making an RESTful call to an API, train a model on a Ray cluster.
Most of your code to do the data processing should be on those platforms and can be in Scala, Java, Golang, C.
Apache NiFi is also good for ETL, but parts of it require that data moves along its processors. Such as converting from one file format to another or regexing columns. So, some parts of it need more compute to process the data needing to scale the NiFi cluster. NiFi 1.x is Java only and only recently with 2.x Python is supported.
Thanks for your thoughts.
Thank you, Bryan, for your videos. They are really useful. It will be very kind of you to make lessons about Apache NiFi, especially how to choose processors for needed actions.
Possibly. So many tools out there. Thanks for the suggestion.
I agreed with you on this.
Thanks Mr Bryan.
"you are only limited to python" -- I don't think this is a bad thing. Python is a stable and versatile language with libraries for everything.
"it's complex" --- If the developer already knows Python, imo, airflow isn't difficult to learn.
"Requires 100% coding" -- I see this as an advantage. I'm using both Airflow and Pentaho. With Pentaho, code review is just painful because the raw code is in XML which makes it difficult to read and to keep track of. Also, there's not a huge user base like python or airflow. So there isn't much help out there on stackoverflow.
Thanks for the feedback.
I'm not against visual programming or low code tools, but if your team are experienced developers, they'll be more productive and happier using all code.
Nevertheless. ETL tools have their place. Let your teams use the tools that are most suited for the job and use Airflow as a centralized orchestrator. You can orchestrate ADF pipelines for instance.
I am studying Apache NiFi now, it looks like a good tool for ETL purpose, thanks for your comments.
This is a cool vid. I personally love Airflow, I use it mostly as an interface to k8s and run applications on pods.
I think in terms of "if I can write a container for the task, then I can orchestrate it in Airflow".
You're not wrong though, it did take time to learn the intricacies of Airflow (both in code and UI). Our company-practice is to make reusable functions that generate DAGs, reduces the code for creating workflows per use-case down to just a function call.
Thanks for putting this video together. I learned about some good alternatives.
You're welcome and thanks for your feedback.
I use Airflow, too, but I totally agree with Bryan's point of view.
Airflow is a powerful tool, but the other side of the coin is its steep learning curve especially for new Airflow users.
Most of the time I just need to do simple stuff and I find using Airflow leads to over-engineering.
Lots of people uses Kubernetes Operator, the biggest problem I see with it is a lot of times I have one common base Docker image, but I need to bundle different code into that Docker image just for the sake of using Kubernetes Operator.
If you are doing Databricks, the new workflows are pretty easy to use to create task orchestration. Thanks for your comment. .
@@BryanCafferky Thank you for posting this video. It's THE best video that I've encountered that explains what Airflow is. A lot of people in my company uses Airflow for use cases that are not fit for Airflow..
@@davidgao4333 Glad you liked it.
Very good explanation ! It s good that other participants(products from aws or ms etc ) mentioned
Very nice video Bryan! What is your take on Prefect? They highlighted few short comings in Airflow and hence Prefect. But, Airflow in its recent version came up with lesser boilerplate. But happy to hear back from you on Prefect. Thanks!
As I know, Airflow is used for "scheduling" the ETLs; not "creating" the ETLs. So, can you perform both "creating" and "scheduling" operations via AWS Glue?
I've not used Glue but the docs say you can. "AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs." For time based scheduling see docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html
@@BryanCafferky Thank you so much for your attention. Then, if you are working for a company that uses a Cloud Platform, actually you do not even need Airflow.
@@halildurmaz7827 YW. If you just need to do ETL work, you don't need Airflow. If you need complex task orchestration, i.e. workflows, Airflow might be a good option.
Can not compare Apache Airflow With Apache Nifi
These two tools aren’t mutually exclusive, and they both offer some exciting capabilities and can help with resolving data silos. It’s a bit like comparing oranges to apples - they are both (tasty!) fruits but can serve very different purposes.
Thanks clear explanations , I haven't use Airflow yet but it is nearly in the all job posts :) Companies like to use it actually
YW. Yeah. I wonder if they all use it or just like list it in job ads. But it could be. The best tool is often not the one selected for the job. Thanks for watching.
Thanks! I struggled getting Airflow up and running - it seems like a really complex system. I'll take a look at Apache NiFi instead :)
We are actually considering this to replace some old batch processes we inherited. These processes are created in a no code solution and we cannot stand it.
Have you looked at Dagster? It addresses most of the issues I mentioned and has an excellent data object centric model. It's all Python based too. See dagster.io/ I'd code centric but provides a lot of value for the code you write.
@@BryanCafferky I'll check it out. Thanks!
I think you are mistakenly comparing Airflow to AWS Glue where AWS Step Functions (maybe also with AWS Glue) are a better representation of what it seems you get from Airflow. I'm not an expert in Airflow, but based on what is shown here in this video, that is the impression I get.
Thanks for the feedback. My intent was that ETL focused tools include SSIS, Informatica, Databricks, Azure Data Factory, NifI, Pentaho, etc. Airflow is a workflow orchestrator. I saw many places where it is promoted as an ETL service. It is not an ETL tool although it can be used to orchestrate ETL work. However, unless there are many task dependencies, it is probably overkill.
Does AWS StepFunctions Service fit anywhere within those alternative options?
I have not used them but from the docs, yes, it looks like Step Functions would be a good option.
Just to make sure I’m understanding correctly:
1. If a company uses Azure or AWS they can just use ADF or AWS Glue instead of Airflow? If so it seems that Airflow is more for companies who do end-to-end python ETL/ELTs and don’t wanna pay for ADF or AWS Glue?
2. I’m a bit confused between what you said because a couple articles on the web and answers on Quora say that Apache NiFi is NOT a replacement for Apache Airflow. So are there things that Apache NiFi can’t do that Airflow can?
3. I really don’t wanna learn Airflow because of the learning curve but some jobs do require it :/ so if Apache NiFi can replace it I’d rather just use that. Do you know of any good resources to learn Apache NiFi or do you plan on making videos on it?
Thanks for your thoughts. Airflow can be a good solution but my point was that it is not an ETL tool. It is a job scheduler or orchestrator. It is promoted often as an ETL solution which I think is misleading. But yes, I too see jobs that ask for it. For complex workflows, it may make sense, especially streaming or something with complex dependencies. Bear in mind a given workflow cannot run concurrently with itself, i.e. each run must go from start to finish before it can start again.
I would google NiFi or check Amazon for books. The documentation online looks pretty good. NiFI videos might be something I'll do in the future. It looks pretty cool.
Hello, I have an airflow running on my machine with Postgresql on the scheduler's backend and LocalExecutor, but when I put my dags to run it consumes a lot of server CPU, how could I solve this high consumption problem?
If you are running it all on your machine, then it sounds like your machine may not have enough power to support it. You could deploy Airflow to cloud VMs or Kubernetes cluster to get more resources. This Stack Overflow talks about limiting Airflow memory consumption. stackoverflow.com/questions/52140942/airflow-how-to-specify-quantitative-usage-of-a-resource-pool This blog discusses how to configure Airflow with setting for max_threads, worker_concurrency, etc. medium.com/@sdubey.slb/high-performance-airflow-dags-7ad163a9f764
Good comparisons. Python has the best/easiest frameworks (pandas, pyspark, et al.) for data transformation so that isn't a limitation.
Which tool is recommended for a project where you have to be calling this jobs every 20secs? I suppose this is better for tasks that run once or twice a day and not in a constant loop. right? only 10% of my tasks are daily or weekly. any recommendations?
If the job is constantly running, then an orchestration service seems to be unnecessary. Perhaps you should consider using a streaming source.
KubernetesPodOperator can be used to run any docker images using Airflow.
I think a huge characteristics of Airflow is that is a static tool (what I personal really don't like, but let's try to keep it neutral).
If you want to change something, you'll need to change the underlying code deployed to the server where Airflow is running.
This means to first take all the procedure to change the code, review it, run it in CI/CD development and then ship the code to the server (or probably just redeploy the server). That's a long process, you can take some shortcuts, but you'll never have the experimental mode or fast prototyping possible. Even when working with a test instance, still it's a slow process.
For some use cases, that's great, because there is always a definite and reliable and versioned description of what's going on.
But if you need to change workflows and aren't sure whether they work fine (e.g. because the production cluster is different in terms of performance than your development cluster, or w/e), the development speed goes down drastically. Even if you don't want to try it out live, you either have a lot of latency going to the development cluster or you need a huge machine as you need to put it locally in a K8s setup (for realistic scenarios in enterprises).
There are benefits having everything in code and inside GitOps, but it's certainly not fast prototyping for sure.
The comparison to cron is very true.
The only way to really check that it runs is to deploy it (like for cron, too), but you should only deploy what you are sure that it runs, so it's a chicken-egg problem. You can run tests, but they don't look the same way as usual in Python or in ETL or in SQL databases or in pandas, and they are complex to write and failure modes might be difficult to understand (especially checking all possible triggering rules).
I personal would in most cases prefer a dynamic tool, I could easily change while running. (You might still want to block changes on the production system, but for at least for development or staging environment, this is what I really missed when working with Airflow).
But year, the visualizations are awesome and explaining the complexity of a system to stakeholders works much easier. So, in practice, you'll get a lot of acceptance if work is slow, so this counterbalances it significant.
Thanks for the info.
First, Thanks for this video and thanks for your insights.
I understand why you said some things but I don't agree with most of it. You're right Airflow is a great job scheduler, not an ETL/ELT tool.
But from my experience, neither is Nifi, not if you want to do some long complex batch jobs, each block is autonomous and they don't wait for the previous one to complete (The others I haves very little experience with so apart from being pretty expensive...).
I think the strength of Airflow, the reason I choose to use it, is the level of control you get, and the diversity of job/tools you can use.
It can start with a bash calling a Talend Job that loads your DB and then a DBT job that processes it.
You can further split your DBT into tests and loads and when there's a failure rerun from the point of failure.
These are features I saw in expensive enterprise tools such as Control-M.
It does have a steep learning curve but looking at the trends in the market today and the way teams are being structured, Engineers for infrastructure and Analysts for the BI part, I think its a good choise.
Thanks. NiFi is documented as only an ETL tool and seems to fit that from what I read, though I have not used it. As I discuss late in the video, Airflow can be a good choice as a scheduler if you need the sophistication, i.e. DAGs, it offers. I purposely titled the video to alert people that think Airflow is an ETL, that it is not. That’s what I wanted to use it for and after reading a book on it, realized, its not an ETL tool. It is a Workflow engine. There’s a similar one in Windows that works with C#. Its fine if that’s what you need. Airflow seems great for complex ML pipelines. On SQL Server, I have used SQL Server Agent which worked well for that environment. It had sufficient dependency management and control for most jobs. The best ETL service to use depends on your environment and requirements: Databricks Notebooks for Sparks, Azure Data Factory for Azure Cloud, Pentaho, Informatica, etc.
I
appreciate the feedback. Good thoughts.
A social network is a perfect example of a graph that is neither directed (facebook connections are not unilateral relations), nor acyclic (Tom knows Sally knows Bob knows Tom there's your cycle). Airflow not having ETL functionality is a perfect example of the Unix philosophy. Do one thing, and do it well. It not being low code is a strength rather than a limitation. Whenever I'm constrained by the whims of a low code platform, I always end up having to struggle immensely to find some way to get the job done whenever my use case doesn't match exactly with what the platform expects. Lot of misses at the start of the video already. Not sure if the rest of the video is worth anyone's time.
Coming from legacy ETL.
I am kind of confused, as you said a lot marketed them as ETL tools, and when I look closely, I totally agree with you that it is CRON on steroids. I guess it is marketed as ETL as python had pandas is relatively easy to ingest data compared to other frameworks but doing real heavy ETL on pandas is not a perfect way to do.
Depending on your needs and environment, you can use different tools, Azure Data Factory for Azure, AWS Glue, Databricks Notebooks for Databricks which runs on Spark. Pentaho, Infomatica, etc. Lots of choices.
Sorry but there are many misleading statements here. Firstly, you are not coerced to use python in your tasks, you can perfectly orchestrate almost anything if you put your code in an image (so yeah, you can use NodeJS, Java, etc). The learning curve is nothing more complicated as the one to learn any other framework, like Django (obviously we are in the "data processing" domain here). Most of all, it is a powerful tool to organize your tasks when using a bunch of cron-jobs in microservices is not an option
Thanks for your comment. Your code to orchestrate must be Python, which is a limitation. Parsimony is key. For a given project, the question 'Do I need to take on the overhead of creating and maintaining code just to orchestrate work'. Code which can break. Absorb the learning curve time and future skill set needed for employees. It is powerful but with great power comes great responsibility I don't think most data movement/transformations cases need Airflow.
@@BryanCafferky The validity of the reasoning is best observed if you compare Airflow with Alteryx and contrast them. Then we really see the difference of the learning curve. Alteryx and Kettle allows for non-devs to make ETL pipelines and the learning curve for non-devs is shorter. Am I correct on my assumptions? Thanks for the video. It was a time saver really.
@@OgnyanDimitrov Yes. You got it! Thanks
I don't really understand Python being a limitation here. It's just the technology and ecosystem Airflow is using.
SSIS, ADF, Pentaho,... They all have their limitations in the ecosystems they are sitting in.
As for maintaing code.... Same applies to SSIS, ADF,... Only you build logic using a visual tool instead of all code. Airflow has lots of pre-built provider packages for database actions, ADF, Databricks, non-data related stuff,... which you can use so you don't need to build tasks from scratch.
Thanks for the vid btw. Your other points were valid. Airflow is indeed an orchestrator, not an ETL tool. 😊
Hi I am working with Pentacho, can you make a video on it ?
DAG is a bad name for a task schedule.
1. Directed graphs are obviously necessary if you want to define an execution flow, so duh.
2. The fact that you have a schedule interval means that your DAG isn't really acyclic, because it loops onto its own start node at the end node.
3. Acyclic graphs are good for reducing complexity and dependency between tasks, and that's a great thing. But that's an actual restriction, a lack of functionality, so it isn't really a feature.
hey , can we compare it to n8n or absolutely not ?
Yes. I had to look up n8n but it seems to be better focused on ETL work and has many connectors. However, it does not appear to run on Spark so you would need to config a Docker/K8 environment or use their Cloud service which is in the Azure Marketplace.
i liked this vid. I understand your view but not sure I would call airflow a schedculer...
anyway, its late 2024, which open source, on prem tool (bare metal and private cloud), would you use for ETL processes? (the more options the marrier)
10x!
Good question. When I did on prem ETL, we used SQL Server Integration Services (SSIS) which is proprietary but an excellent ETL tool. For open-source, I really like Dagster (dagster.io/) which is a Python based data orchestration framework. While it has some of the same issues as Airflow, the data centric nature of Dagster including integrated data validation, data lineage tracking, and composability make it far superior to Airflow in my opinion. I have not used it in production so I would recommend a POC and pilot before committing to Dagster. The Dagster university free online training is good to get started. Dagster is still an orchestrator, not an ETL service. I've seen and looked at some open source ETL tools but have not dug in enough to recommend any. Does that help?
we have extensively used airflow, it is AMZING. I think the whole video revolves around "worflow orchestration is not that complicated and is of secondary importance", which is not usually the case. For "complex" workflows, using configurations is not any simpler or more neat than writing python scripts. It is also important you test your workflow. Airflow has that functionality. The ui feature is very handy. Restarting jobs, clear visibility into what happened, etc. It also scalers really well! This videos is a little misleading!
Thanks for your comments. Did you watch the video? That's not what I said.
I have been reading on their website, and I just can't understand what airflow even is or does.
Very interesting video. What would be a suitable orchestrator to use if e.g. our stack for ELT is Fivetran and dbt.
While yes, we might be able to hook up these individual tools directly, I feel an overarching orchestrator ("dag job scheduler") is needed.
So, I am not interested in using Airflow as ETL/ELT but I always thought it would only be an orchestrator tool.
Cheers
Nicely done. Airflow is being explored by one of our team members. I have a question for you - Is it possible to debug the code on my local workstation before running it on Airlfow?
Well, you can run Airflow locally, see airflow.apache.org/docs/apache-airflow/stable/start/local.html
To test without Airflow, remember that Airflow just runs Python code in the specified sequence so you should be able to test that code. Just run it in the order it will run when it is in Airflow.
Hi, thank you very much for this video. The project where I work plans to replace Apache oozie with Airflow, so I think it is pretty useful to watch video like this one. I don't have any prior knowledge of Airflow, it is very easy to understand the main ideas behind this framework.
Im having a problem handling around to 50 scripts that generates reports that are send to the users, i would like to schedule them but also activate them from some microservices, is there any suggest for that? :(
50 scripts generating and sending reports is probably not the idea solution. A reporting tool would make more sense. However, you could use Azure Automation which supports Python and PowerShell to do scheduling and run the scripts. Azure functions could also be used.
The entire functionality of Airflow is already available in tools like Control-M, Zeke, AutoSys etc, which have been present in the market for more than 2 decades. What is it that Airflow is doing differently? It seems, the programming cult has taken over the data processing and data management world and re-writing all the tools in the way that it was in 1980s. We intentionally came out of code heavy data processing/management model because of its heavy and expensive maintenance costs. Almost 18 yrs ago, in the early days of my career I worked on an "ELT" tool called Sunopsis (later acquired by Oracle). Today we are lauding a similar technology called "dbt" which is doing exactly what Sunopsis did 20 years ago. what's going on folks?
Good feedback. Not sure about dbt. It seems to offer quite a bit for ETL, less so for scheduling/orchestration.
I don't think social media is a good example of a DAG since in general if a is connected to b, b is connected to a, which are bidirectional (non-directed) edges. I suppose if you impose who connected to who first then you could keep it directed but that seems artificial.
Thanks for this. It is easy to understand things sometimes in the context of when you should not use it rather than what it's for.
YW
With the BashOperator and custom operators I find it hard to understand how it only supports python.
True with the bash operator it can do an OS call out to run a script but that's not tight integration. Your DAGS are defined in Python and Airflow is a Python framework. Thanks for your feedback.
I agree apache airflow is pain in the butt to learn , install and figure out the code ...one big limitation is that it doesn't support windows system unless you run it inside docker container ...I would rather using apache nifi since it can run on windows ....support multiple scripting languages and its UI oriented vs code which make it more productive and much easier to use
Nifi may be a good option. Databricks Workflows are also a good one. See my video on it. ruclips.net/video/tMH3K8Rncmk/видео.html
I’ve worked with the alternates you mentioned and you’re missing one other product that surpasses all of them. That is: Dagster
Yeah. Looks interesting . Do you work for Dagster?
@@BryanCafferky Nope. Don’t work for Dagster. Just a mild manner data engineer trying to weed out all the noise in the tech world, finding the right gems so I can focus on exploiting those gems to be productive and be ahead in the game. Unfortunately, most of my time is spent on weeding out noise. I thank you for your service for doing the same. I think the road to take in discovery of new tools is to ask the question “why a tool is bad”, rather than, “why is this tool good”.
What does ETL stands for in ETL Service? (at min 4:02)
It stand for Extract, Transform, and Load.
Dear Bryan, thank you very much for this video! Very valuable and straight to the point content. Congrats!
Thanks You!
Your channel is awesome (and I'm very picky). I've recommended it to my whole team and I'll try to get our company to help you on Patreon as well. Keep it up!
Thank you so much!
I think the right title should be "Don't Use Apache Airflow if you are a Data Scientist" Bc as a DevOps Junior, Airflow looks awesome compared to cron scripts, at least in the project I'm in
Could be. I am finding that installing and configuring Airflow can be challenging. I only see one SaaS offering on Azure for it and it starts at 45K.
@@BryanCafferky Yeah you are totally right, I'm trying to implement it in a docker project with conda deps, and hell, this is hard
@@adibauI Thanks. I was wondering if it was me. :-) Usually, just to get a basic dev environment for a tool is easy but not this. Python Dask is a piece of cake and for Spark, you can just use Databricks Community Edition.
@@BryanCafferky Works perfectly when you build the system based on it, but the thing is that I need to execute python modules from outside airflow's container. Which I think the best way will be to define every single dep I need in Airflow's Dockerfile so it can run the tasks
@@adibauI Yeah. I think that makes sense. Reach out on LinkedIn if you would like to connect. I'd be interested in following you progress on this.
Many times all code much better than no code - much better version management code reviews and existing code readability
But what if you can do no code faster, cheaper, and with less bugs?
ur channel is underrated.
I didn't know it was rated but hope you find it useful.
@@BryanCafferky def. its a hidden gem. thanks for content!
@@rick-kv1gl Thanks. Please let others know about my channel.
You, sir, are a master in title marketing 😂
did you really delete my comment.. wow.. I didn't even say anything bad.. just that I disagreed and thought you were wrong..
I evaluated airflow and luigi(which you didn't mention), I feel that airflow is the one that has enough extendability to get to work with my company's compute resources/environment. It seems you just went through the tutorials and didn't implement anything significant in airflow. The limitations you mention seem a little arbitrary(most people like python) and I don't understand how these are resolved with other options or what associated tradeoffs I would be making. Still going to use airflow, this is clickbait.
Thanks for your thoughts but I have asked colleagues who have used Airflow extensively and they agreed with my points. Also, most of the viewers of this video who left comments also agreed and confirmed with their experiences. It not about Python, it's about the best solution to a problem. Sometimes that will be Airflow but for most use cases, I don't think is and I get concerned when people get defensive about a given technology. BTW: It's not click bait when you follow though with a content that is consistent with the title. Live Long and Prosper.
in the current startup i work is good enough, not expensive and easy to use
Thanks for the comment. Yes. It does do a lot. What other Workflow engines were considered?
@@BryanCafferky mainly kubeflow but our company is not that big to use fully dedicated kubernetes clusters. Any tool you would recommend? Interesting video btw.
@@llorulez Thanks for the info. It all depends on what you need to do. The video was meant to get people to stop and think before jumping in as Airflow is pretty complex but can be a great solution. For ETL/Data Movement, if the workflow is sequential, I would use a simpler tool which I mention in the video. Databricks notebooks/jobs can work well but it depends on whether you need the scale. Dask looks good for non-Spark loads and is really easy to start with but gets complex with the scale out. Each public could has their own ETL PaaS services as well. My focus is parsimony, i.e. just enough to do what you need and no more.
@@BryanCafferky maybe it was easy for me because we extensively use docker and was quick using dockeroperators but as you mention it can be really challenging.
Got fucking boomed by the title
hi, please allow me to add that I used Airflow in order to run complex queries in Impala using .sql files (that contain Impala query) and run inside the DAG tasks in the needed order. This might be usefull, for me it was . I agree that Nifi is best and my favorite. Thanks
Thanks. Yes. Sounds like you had a good use case for Airflow.
Can it compete with Control M?
Don't know. Never heard of Control M. Do you work for them?
@@BryanCafferky nope ,it is a commercial scheduling tool I have used and based on your presentation everything you mentioned exactly like what control M does.A task or job scheduling tool!
cron job + make?
Hmm... not sure about the idea of throwing away configuration that is written down with a bunch of non-documented, non-recreatable jobs.
Dear lord ... Please dont use ADF over Airflow unless you are doing a deployment pipeline. Unless you enjoy working DEEEP under the covers doing things like spinning up Powershell jobs to complete tasks in an environment that is not really strongly backed by Source control .... unless you want to link it to a git repo and stare at JSON blobs to figure out whats wrong with the underlying "code". I do agree with you that there is a finite set of things that Airflow is good at and things that it shouldn't be used for out of the box.
I wholeheartedly disagree that the "python" needed to do many of the simpler dag use cases are difficult to accomplish as most of the out of the box operators are pretty thoroughly documented and example code on how to use them lives everywhere on the internet. I would say that even in the case that you want to do something that Airflow doesnt do directly out of the box there is always the ability to use the numerous Python Operators to run custom code, Or the ability to spinup Kubernetes Pod Operators and allow them to scale in the cluster for heavier ML tasks.
"Use Databricks" ... yes you can ... but databricks is a potentially expensive way to orchestrate one thing. Whereas airflow can not only orchestrate Spark but do many things that Databricks cant do. Also ... at the end of the day Databricks just winds up being a bunch of JSON.
I think the ETL code you show is a fairly ok example of example code ... but not really an example of how an ETL process would be setup in the real world. Nor are you showing many of the purely built in operators that will allow you to orchestrate jobs in a tremendous amount of services in one centralized place.
Mostly IMO yes ... if you dont want to write any code dont use airflow ... If you are ok with some mostly cut/paste code for many basic dags and functions but also want the ability to do things that none of the other mentioned tools that I have personally looked at can do ... I would give airflow a shot ... Or ... if you arent into doing ANY of the management work look at a managed airflow service.
Thanks. Excellent video. I recently moved to a data engineering project that uses airflow with DBT ( and cosmos ). Finding it difficult to understand why use airflow esp. with ELT tool like DBT. For any task, there is dependency on available operators if you want to use airflow. python code is tightly coupled with airflow. and as in the video it says - you have to code everything. its not that you can't get work done with airflow and DBT but with something like pentaho you would have done it with half the effort.
Thanks for your comment. I would suggest also looking at Dagster They address many of my concerns with Airflow. dagster.io/ Not sure how well is works with Databricks clusters though.
@@BryanCafferky Thanks. Will check !!
Good video! Besides, singular of "vertices" is "vertex." Because it is Latin.
- I thought it was obvious that Airflow's use case is to be the orchestrator of a data pipeline, not the executor. Who uses Airflow for ETL/ELT is using wrong.
- I don't see a problem with being a code oriented tool, as Python is very easy to learn. Is almost Low Code.
- Comparisons with "best options" were meaningless. The use cases are different. It would have been more logical to have quoted Prefect, perhaps Dagster
Python as low code was a good laugh to start my morning, thanks.
I am actually looking for a scheduler to run python scripts, but if tha means i have to wrote MORE python... good lord.
Thanks Bryan This is a good video.
Thanks. Very informative.
I would argue that the useful functions should be called into the airflow context from a separate module. With this methodology python could be used to run code outside airflow support.
Am I missing something?
What are you responding to specifically?
@@BryanCafferky Reusability of code utilized by airflow.
For context, I landed here while listening to arguments for and against airflow because I'm trying to figure out if I'm going to learn it or Prefect. I don't know much about either, hence the question at the bottom of my comment.
@@jamescaldwell3207 Did you watch the video? I have no issue with reuse. Whichever fits your requirements with the least cost/effort to maintain is probably the best tool.
@@BryanCafferky Of course I did. My comment was regarding right around 13:50 where you state that generic functions cannot be used anywhere else because of the decorators.
I would think non-specific functions would be in a separate module and imported for use inside a task. If that function is specific to airflow but generic within the operational capacity of airflow, then one could create an airflow specific library for use across multiple jobs.
As stated, I'm deciding whether to learn one of two tools and my comment was an assumption which posed the question if I was missing something. Having now looked it up in the spirit of ending what is starting to feel like a combative exchange, I've learned my assumption was correct.
@@jamescaldwell3207 Sorry. No worries. Glad to get the question. I recorded this video 5 months ago so not all the details are still fresh in my mind. The reference time was helpful. Your point is valid. In fact, you could create non airflow generic function libraries too. As I look back at this, I can see that when using the decorator, only the outermost function is decorated. Also, you can write code that does not use the decorators although I think the decorators are intuitive. See this blog for more details. airflow.apache.org/docs/apache-airflow/stable/modules_management.html
Thanks is totally true that can use airflow as great ETL you need an effort focus in python. When you are a developer that use python and can prepare sql querys result perfect. Any way I will consider NiFi because I don´t Know it. Let me read about it
clear and great explanation
Surprised you didn't mention Argo as an alternative.
There are many alternatives. Too many to cover them all. Thanks for the suggestion.
Thanks, best explanation ever!
You're welcome!
Hi Bryan, the gcp Gui etl option is datafusion.
Thanks for that. Good to know.
You are correct 👍
My exact thoughts... literally a job scheduler on roids...
great video!! thank you
Agree with the video.
Azure data factory is far superior in my experience. Airflow isn't terrible though.
Thank you so much
You're Welcome!
Wonderful video, myth busted.
can you plz throw some
lights on dbt tool.
Its also being promoted as an ETl tool.but i am not sure of its use case.
Seriously. Why should it be a downside that it is 100% and requires DevOps to manage? Mostly these days that's what we wan't because it gives 100% control over what happens in the flows and how it happens. I see tons of ETL tools (which often are also job schedulers) being used mainly as schedulers with a lot of workarounds or custom SQL around them, simply because they don't facilitate or support the most basic things needed to do heavy transformations etc. If these out of the box "drag and drop" tools are so good, why do I keep seeing people implement workarounds on them? It doesn't make any sense these days. IMO I would rather see more code on ETL work than less and have good written code to do the job for us. Code that I can audit and move around like I want to. Also, I think you aren't giving justice to AirFlow at all. Saying that it is complicated to use is really hilarious. All things take time to learn mate. When you know the tool well, it takes no time to do things in. That's how all things work in this world.
Thanks for your comments. The gist of my thoughts is parsimony. Use the solution that requires no more work to maintain than necessary while meeting the requirements. If Airflow is needed, great! I think it not needed in most situations.
thanks
Very true..
very good!!