Introduction to Apache Airflow

dmayle · on May 29, 2020

I use Airflow, and am a big fan. I don't think it's particularly clear, however, as to when to use airflow.

The single best reason to use airflow is that you have some data source with a time-based axis that you want to transfer or process. For example, you might want to ingest daily web logs into a database. Or maybe you want weekly statistics generated on your database, etc.

The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures. For example, maybe you want to garbage-collect some files on a remote server with spotty connectivity, and you want to be emailed if it fails for more than two days in a row.

Beyond those two, Airflow might be very useful, but you'll be shoehorning your use case into Airflow's capabilities.

Airflow is basically a distributed cron daemon with support for reruns and SLAs. If you're using Python for your tasks, it also includes a large collection of data abstraction layers such that Airflow can manage the named connections to the different sources, and you only have to code the transfer or transform rules.

javajosh · on May 29, 2020

Yes, this seems to be yet another tool that falls prey to what I think of as "The Bisquick Problem". Bisquick is a product that is basically pre-mixed flour, salt, baking powder that you can use to make pancakes, biscuits, and waffles. But why would you buy this instead of its constituent parts? Does Bisquick really save that much time? Is it worth the loss of component flexibility?

Worst of all, if you accept Bisquick, then you open the door to an explosion of Bisquick options. Its a combinatorial explosion of pre-mixed ingredients. In a dystopian future, perhaps people stop buying flour or salt, and the ONLY way you can make food is to buy the right kind of Bisquick. Might make a kind of mash up of a baking show and Black Mirror.

Anyway, yeah, Airflow (and so many other tools) feel like Bisquick. It has all the strengths, but also all the weaknesses, of that model.

jacobr1 · on May 29, 2020

The art of software engineering is all about finding the right abstractions.

Higher-order abstractions can be a productivity boon but have costs when you fight their paradigm or need to regularly interact with lower layers (in ways the designs didn't presume).

Airflow and similar tools are doing four things:

A) Centralized cron for distributed systems. If you don't have a unified runtime for your system, the old ways of using Unix cron, or a "job system" become complex because you don't have centralized management or clarity for when developers should use one given scheduling tool vs another.

B) Job state management. Job can fail and may need to be retried, people alerted, etc ... Most scheduling system has some way to do deal with failure too, but these tools are now treating this as stored state

C) DAGs, complex batch jobs are often composed of many stages with dependencies. And you need the state to track and retry stages independently (especially if they are costly)

D) What many of these tools also try to do, is tie the computation performing a given job to the scheduling tool. This now seems to be an antipattern. They also try to have "premade" job stages or "operators" for common tasks. These are a mix of wrappers to talk to different compute systems and actual compute mechanisms themselves.

If you have the kind of system that is either sufficiently distributed, or heterogeneous enough that you can't use existing schedulers, you need something with #A, but if you also need complex job management, you need #A, #B and #C, and having rebuilt my own my times, using a standard system is better when coordinating between many engineers. What seems necessary in general is #D.

jacobr1 · on June 3, 2020

I meant to say D seems unnecessary

andrewflnr · on May 30, 2020

Just playing devil's advocate for a bit: the horror of your Bisquick scenario depends in part on the assumption that salt, flour, etc are fungible across applications, which is not quite true. Flour, sugar, and probably other trace ingredients for managing texture benefit from using different types in different recipes. If any of those benefit from economies of scale, it could well be optimal in some sense to have mixes for everything. This is much closer to being true in software, where different circumstances demand different concrete implementations of abstractions like, say, "scheduler" (analogous to grade/type of abstract ingredient like "flour").

Ed: I should say, I really like this metaphor, and I expect it will crop up in my thinking in the future.

vorpalhex · on May 29, 2020

I realize this is a metaphor and I'm answering the metaphor and not the underlying problem, but: camping. Seriously, want some quick pancakes or donuts when you're out in the field? Bisquick and just change up how much water you add.

jstarfish · on May 29, 2020

Also disaster survival.

Couscous, Bisquick and other low-or-no-heat, premixed, just-add-water solutions are a godsend to have when a tornado takes out your gas line or electric grid.

eeZah7Ux · on May 29, 2020

On top of that, such Bisquick takes hours of learning, deployment, troubleshooting and has complex failure modes compared to few cronjobs and trivial scripts.

awhitby · on May 29, 2020

Assume that future kinds of Bisquick can have negative amounts of flour, salt or baking powder. Now recipes in the dystopian future just require a simple change of basis.

javajosh · on May 29, 2020

Yes, that is true. If you allow negative ingredients you can indeed reach all points in the characteristic state-space of baking, even when limited to picking from a huge set of proprietary Bisquicks. Which is a hopeful thought. I think.

rywalker · on May 29, 2020

Implying that Airflow is a simple mixture of a few ingredients is selling it quite short. There are a lot of knobs and switches in Airflow (i.e. features) that have been built and battle-tested over a lot of users. It has quite a lot of dependencies across the scheduler, webserver, cli, and worker modes. And there is a lot of new development going into Airflow in recent months (new API, DAG serialization, making the scheduler HA).

michaelcampbell · on May 29, 2020

Brawndo. It's what plants crave.

_mzl1 · on May 29, 2020

Your comment doesn't provide insight of when and when not to use it.

javajosh · on May 29, 2020

I guess as I've grown older I've grown wary of black-and-white thinking. The insight I would share with you is to be wary of Bisquick, but do not dismiss it outright. All creation is combination, and you won't succeed saying no to all combinations. In the same way, you won't succeed saying yes to every combination.

nickpeterson · on May 29, 2020

I think what you’re saying is, you cannot start from first principles if you want to accomplish most things, but you need to understand first principles to not misuse those things.

_mzl1 · on June 1, 2020

I guess so, like everything comes with pros and cons you know

simo7 · on May 29, 2020

I would add these gotchas/recommendations:

- Airflow the ETL framework is quite bad. Just use Airflow the scheduler/orchestrator: delegate the actual data transformation to external services (serverless, kubernetes etc.).

- Don't use it for tasks that don't require idempotency (eg. a job that uses a bookmark).

- Don't use it for latency-sensitive jobs (this one should be obvious).

- Don't use sensors or cross-DAG dependencies.

So yeah unfortunately it's not a good fit for all the use cases, but it has the right set of features for some of the most common batch workloads.

Also python as the DAG configuration language was a very successful idea, maybe the most important contributor to Airflow success.

prions · on May 29, 2020

> - Don't use it for tasks that don't require idempotency (eg. a job that uses a bookmark).

You can totally design your tasks to be idempotent - but its up to you to make them that way. The scheduler or executor doesn't have any context into your job.

This is why I encourage people to use a unified base operator and then pass their own docker containers to it. Aka like how https://medium.com/bluecore-engineering/were-all-using-airfl... outlines it.

> - Don't use it for latency-sensitive jobs (this one should be obvious).

IIRC this is being addressed in Airflow 2.0

> - Don't use sensors or cross-DAG dependencies.

This is a little extreme. I've never ran into issues with cross dag dependencies or sensors. They make managing my DAGs way easier because wee can separate computation dags from loading dags.

context: I built/manage my company's Airflow platform. Everything is managed on k8s.

simo7 · on May 29, 2020

> You can totally design your tasks to be idempotent

Yes of course, I mean Airflow is not a good fit for the tasks you don't want to be idempotent (I think most but not all tasks should be idempotent).

> I've never ran into issues with cross dag dependencies

I believe Airflow docs advice against them when possible. I see why from my experience: less visibility and more complexity, especially for backfills.

domenp · on May 29, 2020

> context: I built/manage my company's Airflow platform. Everything is managed on k8s

My team is running Airflow on a single node but we're slowly outgrowing this setup. We're considering running jobs on k8s.

Curious what's your setup like? Is your cluster of a fixed size or does it scale with the load?

ForHackernews · on May 29, 2020

Using the KubernetesPodOperator for everything adds a huge amount of overhead. You still need Airflow worker nodes, but they're just babysitting the K8S pods doing the real work.

I know it's 2020 and memory is cheap or whatever, but Airflow is shockingly wasteful of system resources.

rywalker · on May 29, 2020

re: ETL framework - you can get a lot done with the built-in Airflow Operators, including the PythonOperator (and bring in any python dependency you like) and BashOperator (call any CLI, etc.) - it's not drag-and-drop, but I've found it to be quite versitile.

re: idempotency - yes, make your workflow tasks idempotent.

re: latency - this is being worked on very actively. Ash (PMC member) has committed to working on task latency almost exclusively until it's resolved

re: sensors, there is some great work from Airbnb to improve: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+A...

hn2017 · on May 29, 2020

If using a docker or kubernetes operator, you can't make use of operators or Airflow connections. Do you just work without those?

simo7 · on May 29, 2020

Correct, I use operators only to delegate the actual workload to an external service.

hn2017 · on May 29, 2020

Got it, to be clear - do you still use the Airflow for storing connections, or no? If no, how do you store your credentials? We've only done a POC and we've discovered a higher than expected learning curve.

IanCal · on May 29, 2020

This is handy thanks - I deal mostly with luigi and this helps me place airflow a bit better.

contravariant · on May 29, 2020

Sensors are quite helpful when you don't know the exact moment your data will come in.

simo7 · on May 29, 2020

I think it's better to place everything in one DAG if that solves the problem. If it doesn't then sensors are ok I guess, but I would try to avoid them otherwise.

prions · on May 29, 2020

I really disagree here. Monolithic dags are a bigger pain to manage.

Breaking them out into smaller dags makes retrying/backfilling/etc a lot more straightforward. It also lets you reuse those pieces more easily.

We have compute DAGs that are the upstream dependency for many other dags. Originally, this dag was monolithic and loaded data into one table. But because the dag is split into computation and loading we can easily add more downstream dags without changing how the first one operated.

bosie · on May 29, 2020

What do you mean by bookmark in this context?

vorpalhex · on May 29, 2020

"I have processed 5/15 records, next run I need to start at record 6". Bookmarking is a common concept for working through a large job in several small runs.

devonkim · on May 29, 2020

I wonder if this difference in jargon has origins in different sub-fields. I recognize that concept as "checkpoints" across different companies but I also remember seeing the term from data science folks and thinking that I just don't know the concept.

nickpeterson · on May 29, 2020

Often called a Waterline/Watermark as well.

sk5t · on May 29, 2020

I'd call it a watermark too; "checkpoint" can mean an intermediate point in a multi-step database transaction.

jstarfish · on May 29, 2020

I've always called them offsets.

wiml · on May 29, 2020

I would call it "a read cursor (that's updated within the transaction)", but I'm boring.

unixhero · on June 1, 2020

Cursor nomenclature is used for describing where in a sequence the operator is during a code execution pass in database query or similar right?

I think I kind of immediately understand "cursor" the best in this context. I also agree, it's a little boring and definitely old school:).

gravitas · on May 29, 2020

> The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures.

For this specific use case, I use healthchecks.io - trivial to deploy in almost any context which can ping a public URL. Generous free tier limits so I've got that going for me which is nice.

chrischen · on May 29, 2020

I thought the point of airflow is for orchestration in an event driven microservices architecture? That’s what Uber uses Cadence for at least.

aequitas · on May 29, 2020

I recently started investigating Airflow for our use-case and it seems exactly what you describe and not more. But in it niche it excels featurewise, at least regarding the features I need to expose to the users.

sdepablos · on May 29, 2020

There're so many alternatives to Airflow nowadays that you really need to make sure that Airflow is the best solution (or even a solution) to your use case. There's plenty of use cases better resolved with tools like Prefect or Dagster, but I suppose the inertia to install the tool everyone knows about is really big.

BTW, here's https://github.com/pditommaso/awesome-pipeline a list of almost 200 pipeline toolkits.

walleeee · on May 29, 2020

I've had a wonderful experience with Dagster so far. I love that it can deploy to Airflow, Celery, Dask, etc, I love the Dagit server and UI and that I can orchestrate pipelines over HTTP, I love the notebook integration via Papermill, I love that it's all free (looking at Prefect here...), and the team is extremely responsive on both Slack and GitHub

_frkl · on May 29, 2020

Didn't Prefect open source their orchestration component recently, or am I mistaken? What part of Prefect is still closed?

walleeee · on May 29, 2020

Oh, I was saying it wasn't free. I think you're right and it is fully open source

rywalker · on May 29, 2020

It's not quite "open source", better labelled as "source available"

dmayle · on May 29, 2020

I've gone through a large number of these, and I think that Airflow is the best on Kubernetes for managed orchestration. The things I like are:

  * Source control for workflows/DAGs (using git-sync)
  * Tracking/retries with SLAs
  * Jobs run in Kubernetes
  * Web UI for management
  * Fully open source

I also use Argo Workflows, because I like its native handling of Kubernetes objects (e.g. the ability to manage and update a deployment as one of the steps), but it just doesn't have the orchestration/tracking side of things very well managed yet

tiagod · on May 29, 2020

I absolutely love Cadence/Temporal. It can do anything the others can do (and you could implement any other engine with it), without any DSLs.

bjacobso · on May 29, 2020

I'd also like to +1 on Cadence/Temporal. It's a really great mental model and framework for thinking about workflows.

This is more of a proof of concept but it can also support DSLs (although we found the go client is easier to understand than DSLs): https://github.com/checkr/states-language-cadence

mfateev · on May 29, 2020

As a team lead of the Cadence/Temporal it is really amazing to hear such a feedback. Thank you!

BTW. You can AMA about the technology here.

tiagod · on May 29, 2020

Thank you for your work! I believe once it's properly documented and marketed, it will be very popular.

cpdean · on May 29, 2020

I think this is the link to cadence: https://github.com/uber/cadence

and temporal: https://github.com/temporalio/temporal

tiagod · on May 29, 2020

Yep, this is it. They're developed by the same team, Temporal is still beta but I believe the production version is coming out late-June.

I can't find an easy way to explain everything it does, but it pretty much allows you to write naive functions with no error handling, with month-long sleeps, auto-retries on unreliable function calls, etc etc.

It also gives you a web interface when you can inspect the running functions, and allows for external code (and other workflow functions) to signal/query the running workflows.

silasb · on May 29, 2020

Just found out about Temporal and it looks interesting. I eager to jump in but our organization primarily uses Ruby. I know the big difference between Cadence vs Temporal is the fact they are using GRPC which seems much easier to adopt.

tiagod · on May 29, 2020

I also find the API to be more consistent on Temporal (I began implementing a project in Cadence before finding out about Temporal)

mfateev · on May 29, 2020

Temporal has an externally contributed Ruby client. It is going to be open sourced very soon.

silasb · on May 29, 2020

Very exciting news. Is there a way that I can subscribe to get info on when it'll be released?

mfateev · on May 29, 2020

There is #ruby-sdk channel in the Temporal Slack: https://join.slack.com/t/temporalio/shared_invite/zt-c1e99p8...

autogenerated · on May 29, 2020

Moving off of Airflow and to Cadence/Temporal was the single biggest relief in terms of maintainability, operational ease and scalability. Also +1 on being free of any DSL.

tiagod · on May 29, 2020

I'm currently moving from a custom yaml DSL-based engine to Temporal and it's the best architectural decision I've taken in a long time. I researched a lot and couldn't find anything that even came close to the freedom it provides.

alexg123 · on May 29, 2020

Curious about this. Can you elaborate more? Also happy to hop on a call/zoom if you don't want to share publicly (email me at saguziel@gmail.com). I'm working on something similar.

sequoia · on May 29, 2020

> A curated list of awesome pipeline toolkits

Is this "curated"? It seems like an exhaustive "dump" of toolkits.

unixhero · on June 1, 2020

What curation do you feel it lacks?

sequoia · on June 2, 2020

To curate a collection is to be an editor, determining what to include and what to exclude. From a curated selection of tools, I expect to see a selection of the best tools, chosen by a knowledgable curator who has evaluated the tools in some way.

So for example, if you said "I want to start a library, donate any books you have at my house" that would not be a "curated" collection in my opionion. If you went through & evaluated the books, selecting only those that you'd personally recommend and discarding the rest, that would be a "curated" collection.

(Strictly speaking any maintenance of a collection even just "dump everything at my door at I'll put it on the pile" could be considered "curation" but to call a list of tools "curated" suggests there's some selection going on, and there does not appear to be on this list.)

EDIT: In fact, all of the definitions of "curate" here[0] start with the word "select," for example "Select, organize, and look after the items in (a collection or exhibition)." This fits my meaning. This definition[1] of "curator" requires one to merely have "care and superintendence of something," so in that sense the list is "curated" insofar as someone looks at PRs and clicks the "merge" button.

The Miriam-Webster definition, however, speaks of curating a more general "something" rather than a "collection." If you "curate" a statue by cleaning & protecting it, fine, you don't need to select the statue. However, when a collection is curated, in my opinion, this necessarily implies selection, not just maintenance.

0: https://www.lexico.com/en/definition/curate 1: https://www.merriam-webster.com/dictionary/curator

Peteris · on May 29, 2020

One way to get around this is to use Kedro https://github.com/quantumblacklabs/kedro, which is the most minimal possible pipeline interface, yet allows you to export to other pipeline formats and/or build your own exporters.

sdepablos · on May 29, 2020

Interesting (and in the list). Reading the documentation looks like the only other format it allows to be exported is Airflow, isn't it?

adolph · on May 29, 2020

Here's a nice podcast about Prefect that talks about its and Airflow's approachs:

https://softwareengineeringdaily.com/2020/04/29/prefect-data...

eyeball · on May 29, 2020

We used Luigi because airflow was to complicated to get an unsupportive IT department to install.

anordin95 · on May 29, 2020

I've been using Airflow for nearly a year in production and I'm surprised by all the positive commentary in this thread about the tool. To be fair, I've been using the GCP offered version of Airflow - Composer. I've found various components to be flaky and frustrating. For instance, large scale backfills don't seem to be well supported. I find various components of the system break when trying to large scale backfills, for instance 1M DAG runs. As another note, the scheduler also seems rather fragile and prone to crashing. My team has generally found the system good, but not rock-solid nor something to be confident in.

rywalker · on May 29, 2020

Yes, we need to improve the backfills in Airflow.

We're working on making the scheduler HA and more performant, reach out to me if you'd like to collaborate on your use case (ry at astronomer dot io)

chrisjc · on May 29, 2020

This mirrors my experience with Airflow by another SaaS. Ended up getting rid of it when our Data Warehouse introduced tasks and stored procedures, since most of our work was ELT.

Peteris · on May 29, 2020

Airflow is an incredibly powerful framework to use in production, but a little unweildy for anything else.

You can use something like Kedro (https://github.com/quantumblacklabs/kedro) to get started building pipelines with pure Python functions. Kedro has its own pipeline visualiser and also has an Airflow plugin that can automatically help you generate airflow pipelines from Kedro pipelines.

https://github.com/quantumblacklabs/kedro-airflow

zukzuk · on May 29, 2020

Airflow is great, right up to the point where you try to feed date/time-based arguments to your operators (a crucial bit of functionality not covered in the linked article). The built-in API for that is a random assortment of odd macros and poorly designed python snippets, with scoping that never quite makes sense, and patchy and sometimes misleading documentation.

diogofranco · on May 29, 2020

Agree that this is a bit confusing. I ended up writing a small guide on how date arguments work in airflow (https://diogoalexandrefranco.github.io/about-airflow-date-ma...) and I always end up consulting it myself, as I just can't seem to memorize any of these macros.

ianbutler · on May 29, 2020

Airflow is great, honestly the biggest gotchas are passing time to operators which someone has mentioned in the thread already and setting up the initial infra is a bit annoying too. Other than that though as a batch-etl scheduler and all around job scheduler it's pretty great, it's really a very user friendly and it's graphical interface simplifies a lot of the management process. I see a lot of people prefer the non graphical libs here like Luigi or Prefect and to each there own but I really do prefer having that interface in addition to the pipelines as code line of thinking.

I also see a lot of people saying it's a solution for big companies and the like, I heavily disagree it's useful for any size company that wants to have better organization of their pipelines and provide an easy way for non technical users to check on their health.

hn2017 · on May 29, 2020

FYI - Prefect has a GUI too.

I also agree Airflow is good for smaller companies too if they're familiar with Python.

ianbutler · on May 29, 2020

That's good to know, I had been trying to figure that out but the doc pages I landed on didn't really make it clear when I was looking at it.

naveedn · on May 29, 2020

I don’t think this blogpost provides any value over the official documentation. You can run through the airflow tutorial in about 30 minutes and understand all the main principles pretty quickly.

bthomas · on May 29, 2020

I recently went through the airflow docs for the first time - agree. But this comment thread has much more helpful than any of the docs!

zomglings · on May 29, 2020

I'm actually considering using Airflow. Have never used it before, and I have the impression that setting up the required infrastructure could be problematic.

Since a lot of you use Airflow, I am curious about your experience with it:

1. Are you hosting Airflow yourselves or using a managed service?

1. a. If managed, which one? (Google Cloud Composer, Astronomer.io, something else?)

1. b. If self-hosted, how difficult was the setup? It seems daunting to get a stable setup (external database, rabbit or redis, etc.).

2. Do you use one operator (DockerOperator looks like the right choice) or do you allow yourself freedom in operators? Do you build your own?

3. How do you pass data from one task to the next? Do the tasks themselves have to be aware of external storage conventions or do you use the built-in xcom mechanism? It seems like xcom stores messages in the database, so you run the risk of blowing through storage capacity this way?

pyrophane · on May 29, 2020

1. Managed, Cloud Composer. Cloud Composer is getting there. It feels much less buggy then just 8 months ago when I started using it, and it is improving rather quickly.

One downside with Composer, though, is that it must be run in its own GKE cluster, and it deploys the Airflow UI to App Engine. These two things can make it a bit of a pain to use alongside infrastructure deployed into another GKE cluster if you need the two to interact.

I would probably still recommend Composer over deploying your own Airflow into GKE, as having it managed is nice.

2. Freedom. For some tasks we run containers in GKE, for other we use things like the PythonOperator or PostgresOperator.

A note here: Using containers with Airflow is not trivial. In addition to needing some CI process to manage image building/deployment, having the ability to develop and test DAGs locally takes some extra work. I would only recommend it if you are already invested in containers and are willing to devote the time to ops to get it all working.

3. X-com is useful for small amounts of data, like if one task needs to pass a file path, IDs, or other parameters to a downstream task. For everything else have a task write its output to something like S3 or a database that another task will read from.

All in all, I would say use Airflow if you need the visibility and dependency management. Don't use it if you could get away with something like cron and some scripts or a simple pool of celery workers.

Also, don't use it if your workflows are highly dynamic. For example, if you have a situation where you need to run a task to get a list of things, then span x downstream tasks based on the contents of the list. Airflow wants the shape of the DAG to be defined before it is run.

Hope that helps.

hn2017 · on May 29, 2020

Your last point about highly dynamic workflows was a particular pain point for me and I think for many others. One recommendation for Airflow is to create a list of use cases with sample DAGs to show best practices.

zomglings · on May 29, 2020

Thank you, it helps a lot!

mrshu · on May 29, 2020

1. Managed, on AWS ECS -- it was not that difficult to set up using AWS services (Aurora for the DB, ElastiCache for Redis)

2. Freedom. We generally use PythonOperators but it is not uncommon to run containers as well. I agree with pyrophane, setting up containerized operators really is a non-trivial and far from straightforward to test locally. Still, it seems worth doing, particularly if you do not want various execution to affect one another.

3. Again, echoing what pyrophane said, a custom solution is needed for anything that's more than a couple hundred bytes in size. There even exists (now mostly abandoned) plugin that allows you to streamline the whole process: https://github.com/industrydive/fileflow Writing directly to something like S3 is almost always sufficient in combination with passing the path for the file from one task to another.

Having said that, I would encourage you to try it out, even if the setup may sound daunting at first. If you can model your tasks as DAGs whose graphs are known in advance, I would argue it almost always makes sense because you get many things "for free" that were not mentioned before: logging, backfill, standard handling of connections/secrets and a ton of metrics in a nice UI that helps a lot with the visibility part.

ies7 · on May 30, 2020

1. self hosted on ec2.

1b. Pip install airflow[all|what you need] The airflow itself is easy to install. I’d say that installing the external tools is also easy. I believe installing pg, redis, or celery should be categorized in easy. It’s not the kafka or k8s level of installation.

2. Freedom.

3. Custom scripts

nojito · on May 29, 2020

Airflow is a great example of technology being used at a massive company for massive company problems...which is now being pushed as a solution to everything

Papermill is another example.

bhntr3 · on May 29, 2020

When Max started building airflow Airbnb's engineering team was probably 80 or so people. The data engineering and data infrastructure teams couldn't have been more than ten combined. I'm not sure that's massive but it is the point where the previous solution (Chronos: https://medium.com/airbnb-engineering/chronos-a-replacement-...) was starting to strain.

Airflow may not be necessary for a ten person team but it could be if that team has complicated data orchestration needs and doesn't want to incrementally replace their infrastructure every couple years.

mliker · on May 29, 2020

Airflow can be used to process data that only adds up to 10s or 100s of MBs. In aggregate, a big tech company processes petabytes or exabytes, but remember, technology like Airflow isn't processing all of that simultaneously within one task. So no, I would not characterize Airflow as a solution to just "massive company problems".

pgoggijr · on May 29, 2020

Been using airflow in Production at a small start-up for almost 4 years now and been very happy with it (small dockerized deployment, and simple syncing with github) - what big company problems does it solve that make it unsuitable for use at small companies?

detaro · on May 29, 2020

Any hints at solutions that help if you're a small company having small company sized problems but don't want to DIY the entire flow execution logic?

walleeee · on May 29, 2020

I'm the lone developer on a project which will likely never scale beyond a few thousand users and I'm really liking Dagster. You can deploy it on a bunch of other platforms (Airflow, Dask, Celery, K8s) which is really nice for my use case (automating workflows in HPC environments from the browser) or run it standalone

rywalker · on May 29, 2020

We were a very small company when we discovered Airflow, just saying, and it provided a lot of value to us vs. alternatives.

We fell in love with Airflow's capability to allow us to write one small python program that generated hundreds of tailored dags by iterating over the configuration of each of our customers. This dynamism is very powerful for similar use cases.

higeorge13 · on May 29, 2020

I have set it up, configured and performed all the upgrades just by myself in a small startup, performing very complex daily tasks.

I miss features, sure. The whole execution timing might be confusing the first time, sure. But I can't figure out why i see so many comments regarding deployment difficulties.

slap_shot · on May 29, 2020

> Airflow is an ETL(Extract, Transform, Load) workflow orchestration tool, used in data transformation pipelines.

Apologies if this is pedantic, but the orchestration of jobs transcends ETL workflows. There's countless usecases of scheduling dependent jobs that aren't ETL workloads.

carlosf · on May 29, 2020

I've been quite happy with the following pattern:

- Encapsulate your business logic in microservices and expose ETL actions with APIs.

- Call your microservices using Airflow.

That way my Airflow jobs are very lightweight (they only call https APIs) and only contain logic of when doing things, not how. All the core business logic for a specific domain lives in a single container that can be tested and deployed independently.

Doing ETL using Airflow jobs exclusively or Lambda would spread business logic and make it a nightmare to test and reason about.

unixhero · on May 29, 2020

Thread a few weeks back with Apache Airflow's cousin Apache Nifi [0] . A lot of great discussion in that thread, just like in this one.

[0] https://news.ycombinator.com/item?id=23144450

chartpath · on May 29, 2020

Happy user of Prefect here. I prefer it for being more programmable and able to run on Dask. If you just want dynamic distributable DAGs and not necessarily an "ops" appliance feel (like Airflow), check them out: https://docs.prefect.io/core/getting_started/why-not-airflow...

Not knocking Airflow, it is great. Luigi too.

hn2017 · on May 29, 2020

Airflow isn't perfect but it's in active development and one of the biggest pros compared to other toolkits is that it's an Apache high-level project AND it's being offered by Google as Cloud Composer. This will make sure it sticks around and maintains development for some time.

https://cloud.google.com/composer/

ForHackernews · on May 29, 2020

Airflow has major limitations that don't become obvious until you're already deep into it. I'd advise avoiding it myself.

It's only useful if you have workloads that are very strictly time-bounded (Every day, do X for all the data from yesterday). It's virtually impossible to manage an event-driven or for-each-file-do-Y style workflow with Airflow.

gtrubetskoy · on May 29, 2020

If you are using BigQuery and your "workflow" amounts to importing data from Postgres/MySQL databases into BQ and then running a series of SQL statements into other BigQuery tables - you might want to look at Maestro, it written in Go and is SQL-centric, there is no Python dependency hell to sort out:

https://github.com/voxmedia/maestro/

With the SQL-centric approach you do not need to specify a DAG because it can be inferred automatically, all you do is maintain your SQL and Maestro takes care of executing it in correct order.

throwaway7281 · on May 29, 2020

Having used a variety of modern ETL frameworks in the past years, I consider writing a hands-on book about what I have learned on the way.

If I may ask, what questions do you find most difficult to solve in the context of real-world ETL setups?

knite · on May 30, 2020

There are too many different tools in the space. I've been heavily researching workflow / ETL frameworks this week, and even after culling the ones that seemed like poor fits, I'm still left with:

- https://github.com/getpopper/popper

- https://docs.pachyderm.com/

- https://github.com/lyft/flyte

- https://aws.amazon.com/step-functions/

- https://github.com/spotify/luigi

- https://docs.metaflow.org/

- https://github.com/dagster-io/dagster

- https://github.com/argoproj/argo

- https://github.com/prefecthq/prefect

ramraj07 · on May 29, 2020

Primary problem for me is spending so much time setting up these monsters to do what's basically a set of cronjobs. What's the most simplest system out there that can be highly available and deployed as easily as possible?

Another question is, I strongly feel like the definition of pipelines should not be in code, but in the database. I keep coming back to that design pattern every time I start coding my own simple scheduling solution. Is there merit to this thought?

throwaway7281 · on May 29, 2020

Yes, cron is a bit undervalued in that for one off (well locked) tasks it's perfectly fine to create a crontab entry. And simplicity is king. I feel people throw frameworks at problems where a simple shell/go script in a cron would be just enough.

As for the pipeline definition. One goal is to have a notion of pipelines that is both comprehensive and declarative.

As for a database, what would you store there? Container image to run? Past execution data (e.g. output path, time, errors)?

The software world has many pipeline-y things, such as CI definitions and these definitions usually live in configuration files.

What is difficult at time is the tracking of done tasks. Is the output a file or a new row in some database or many files or many rows or anything else?

aequitas · on May 29, 2020

> KubernetesExecutor runs each task in an individual Kubernetes pod. Unlike CeleryCelery, it spins up worker pods on demand, hence enabling maximum usage of resources.

You'll probably use up a lot of resources indeed, depending on how big your tasks are you will have quite some overhead to run each and every one in a seperate pod, compared to running them in a Celery Multiprocessing "thread" on an already running worker container.

Vaslo · on May 29, 2020

Has anyone who has used this also used SSIS? Curious as to how the two compare as I use SSIS currently and have gained some experience with Python.

kfk · on May 29, 2020

I always felt neither Airflow nor Superset solved any of the foundational problems with data analytics today. If we take Airflow, it is relatively easy to schedule runs of scripts using cron (or more fancy Nomad jobs with a period stanza). What else does Airflow give me that cron doesn't? Is the parallelization stuff working? Dask is built from the ground up with parallelization in mind, sure it seems to solve a more foundational problem than Airflow. Is triggering and listening to events working? Doesn't look like. Is collaboration working? Doesn't seem to be the case since after writing your python script you need to basically rewrite it into an Airflow dag.

naveedn · on May 29, 2020

Airflow is 100% better at chaining jobs together than cron. The Airflow scheduler makes it so you don’t have to worry about putting “sleep” calls in your bash scripts to wait for some conditions to be met, and allows for non-linear orchestration of jobs. You might have different flows that need to run on different schedules, and with airflow, you can wait until one flow is done before the other starts.

bosie · on May 29, 2020

dependency management. if task #3 fails, any task depending on it shouldn't run. not easy to do with cron based triggers

kfk · on May 29, 2020

Yes but that is a feature of parallel and Dask does it well.

bosie · on May 29, 2020

I don't know too much about dask, how would you build a node in a Dask DAG to execute a java app and analyse its results (i.e. say database entries) to evalutate the success of that step?

pgwhalen · on May 29, 2020

Airflow is generally not aware of data flow, it just runs something once its upstream dependencies have finished. Less useful in a way, but also much more broadly applicable.

jillesvangurp · on May 29, 2020

I've spent the past month+ setting airflow up. To be honest, I don't like it for a lot of reasons:

1) it's not cloud native in the sense that running this on e.g. AWS is an easy and well trodden path. Cloud is left as an exercise to the reader of the documentation and at best vaguely hinted at as a possibility. This is weird because that kind of is the whole point of this product. Sure, it has lots of things that are highly useful in the cloud (like an ECS operator or EMR operator); but the documentation is aimed at python hackers running this on their laptop; all the defaults are aimed at this as well. This is a problem because essentially all of that is wrong for a proper cloud native type environment. We've looked at quite a few third party repos for terraform, kubernetes, cloudformation, etc that try to fix this. Ultimately we ended up spending non trivial amounts of time on devops. Basically, this involved lots of problem solving for things that a combination of wrong, poorly documented, or misguided by default. Also, we're not done by a long shot.

2) The UX/UI is terrible and I don't use this word lightly. Think hudson/jenkins, 15 years ago (and technically that's unfair to good old Hudson because it never was this bad). It's a fair comparison because Jenkins kind of is a drop in replacement or at least a significant overlap in feature set. And it arguably has a better ecosystem for things like plugins. Absolutely everything in Airflow requires multiple clicks. Also you'll be doing CMD+R a lot as there is no concept of autorefresh. Lots of fiddly icons. And then there's this obsession with graphs and this being the most important thing ever. There are two separate graph views, only one of which has useful ways of getting to the logs (which never requires less than 4-5 mouse clicks). And of course the other view is the default under most links so you have to learn to click the tiny graph icon to get to the good stuff.

3) A lot of the defaults are wrong/misguided/annoying. Like catch up defaulting to true. There's this weird notion of tasks (dags in airflow speak) running on a cron pattern and requiring a start date in the past. Using a dynamic date is not recommended (i.e. now would be a sane default). So typically you just pick whatever fixed time in the past. When you turn a dag on it tries to 'backfill' from that date unless you set catchup to false. I don't know in what universe that's a sane default. Sure, I want to run this task 1000 times just because I unpaused it (everything is paused by default). There is no way to unschedule that. Did I mention the default parallism is 32. That in combination with the docker operator is a great way to instantly run out of memory (yep that happened to us).

4) The UI lacks ways to group tasks like by tag or folders, etc. This gets annoying quickly.

5) Dag configs as code in a weakly typed language without a good test harness leads to obvious problems. We've sort of gobbled together our own tests to somewhat mitigate repeated deploy screw ups.

6) implementing a worker architecture in a language that is still burdened with the global interpreter lock and that has no good support for either threading or light weight threads (aka co-routines) or doing things asynchronously, leads to a lot of complexity. The celery worker is a PITA to debug.

7) IMHO the python operator is a bad idea because it gives data scientists the wrong idea about, oh just install this library on every airflow host please so I can run my thingy. We use the Docker operator a lot and are switching to the ECS operator as soon as we can figure out how to run airflow in ECS (we currently have a snow flaky AMI running on ec2).

8) the logging UI is terrible compared to what I would normally use for logging. Looking at logs of task runs is kind of the core business the UI has to do.

9) Airflow has a DB where it keeps track of state. Any change to dags basically means this state gets stale pretty quickly. There's no sane way to get rid of this stale data other than a lot of command-line fiddling or just running some sql scripts directly against this db. I've manually deleted hundreds of jobs in the last month. Also there's no notion of having a sane default for number of execution runs to preserve. Likewise the there is built in way to clean up logs. Again, Jenkins/Hudson had that always. I have jobs that run every 10 minutes and absolutely no need to keep months of history on that.

There are more things I could list. Also, there are quite a few competing products; this is a very crowded space. I've given serious thought to using Spring Batch or even just firing up a Jenkins. Frankly the only reason we chose airflow is that it's easier for data scientists who are mostly only comfortable with python. So far, I've been disappointed with how complex and flaky this setup is.

If you go down the path of using it, think hard about which operators you are going to use and why. IMHO dockerizing tasks means that most of what Airflow does is just ensuring your dockerized tasks run. Limiting what it does is a good thing. Just because you can doesn't mean you should in airflow. IMHO most of the operators naturally lead to your airflow installs being snow flakes.

Not dockerizing means you are mixing code and orchestration. Just like installing dependencies on CI servers is not a great idea is also the reason why doing the same on an airflow system is a bad idea.

rywalker · on May 29, 2020

1) An official Helm chart is coming very soon. We (Astronomer) have a commercial platform that aims to solve this completely including observability, configuration and deployment. Happy to team up to improve Airflow's open-source k8s story if you have some ideas.

2) Yes the UI is outdated, and not responsive. We're going to kick off a process to build a new modern UI in Q3 (a full-featured Swagger API is being built now, which the new UI will rely upon.)

3) I personally think catchup true is a fine default, but whatever. Generally when I launch a new DAG I want to generate some historical data using the DAG.

4) Airflow has tags now since 1.10.8 https://airflow.readthedocs.io/en/latest/howto/add-dag-tags.... - we decided not to do folders.

5) That's true, but it also provides a low bar to entry. There are some guides written on unit testing DAGs, but I agree we should be a test-first community. On my roadmap.

6) Celery w/ KEDA is pretty nice - check out https://www.astronomer.io/blog/the-keda-autoscaler/

7) I personally love the PythonOperator for simple DAGs. Agree that DockerOperator (KubernetesPodOperator if you're running Airflow in K8s)

8) Yes, logging UI will be improved in the UI rewrite. What's your favorite UI for this?

9) That feature is in the queue https://github.com/apache/airflow/issues/7911

opportune · on May 29, 2020

Other than 5+6 this seems like basically a spec for a managed airflow product. So basically run Airflow on public cloud, manage all the operational bits, create a better UI, and fix some upstream bugs.

rotten · on May 29, 2020

Which is something both Google Cloud Composer and Astronomer.io are trying to do.

blakeburch · on May 29, 2020

I couldn't agree more with you on most of these points. You may be interested in trying out Shipyard (www.shipyardapp.com). Fair disclosure, I'm the co-founder. While we don't address all of these issues, we're building with these key focuses.

- Simplicity is key. Data Teams should focus on creating solutions, not fighting infrastructure and limitations. - Workflows shouldn't change how code is written. Your code should run the same locally as on our platform, with no extra packages or proprietary setup files required. - Templates are a first-class object. The modern data pipeline should be built with repeatability in mind. - Data solutions should be usable and visible beyond the walls of technical teams.

We're in a private beta and rapidly trying to improve the product. I would love to chat more if you're interested. Details in profile.

For your specific problems:

1) We're cloud-native and handle hosting/scaling on our side. You don't have to worry about setup. Just log in and launch your code.

2) Our UI is pretty slick (built in AntD) and built to reduce the overwhelming options when setting jobs up.

3) If you want to run a job on-demand, just press "Run now". If you schedule a job, then change the schedule or status, we'll automatically add/update/remove the schedules. Other technical defaults aren't options right now because we're trying to abstract those choices away so Data Teams can just focus on building solutions that work.

4) We let you group your scripts into "Projects" (essentially folders) with a high level overview of the quantity of jobs, as well as how many recently failed or succeeded.

5) Workflows get made directly in the UI. This makes it easier for any less technical users to set up jobs on their own. We still have a lot of improvement to go in this area though.

6) Our product is written in Go (the language of the cloud). We don't force the worker to be in a language and managed by a process in that language. We manage at the process level.

7) Every job creates a new container on the fly, installing package dependencies that the user specifies. You can connect your scripts together without worrying about conflicting packages, conflicting language versions, or without needing to know how to make and manage Docker containers.

8) Not sure how we compare on the logging front. However, we separate out logs for each time a script runs so you're not having to search for a needle in a haystack. You can filter and search for specific logs in the UI.

zomglings · on May 29, 2020

How much does Shipyard cost? The website doesn't mention pricing at all and I am wary of spending time on anything like this without having a sense of how much it will cost me down the road.

blakeburch · on May 29, 2020

We're still working on defining the exact pricing, but it would be a usage-based monthly subscription model, since our core costs are related to infrastructure scaling.

The good news is that if you had to switch workflow tools, our product design makes your code easily portable (no proprietary config files, no packages to add to your code). You would just lose access to the scaled workflow and execution logic.

I'd love to discuss more with you to see what your team would see as reasonable. Happy to work something out.

zomglings · on May 29, 2020

Sure. My email is in my profile.

We are a 2-person startup and are looking to make a decision for our scheduling/orchestration infrastructure over the next month.

slt2021 · on May 29, 2020

couple questions re airflow from a guy coming from Informatica/SSIS world:

1. Does airflow have native (read high speed) connectors to destination databases (oracle, mysql, mssql) ?

2. How the typical ETL in Airflow compares to one in Informatica/SSIS in terms of speed of development, performance (throughput and latency), memory consumptipon? Is it the same speed, or slower due to using Python interpreter?

3. Is it easy or hard to use parallel transformations with processes/threads/async ? For example, ingest data from your source in 20 threads at once, as opposed to serial processing

oxfordmale · on May 29, 2020

1. Airflow uses the "default" connectors for destination databases, for example psycopg2 for Postgres. You can easily write your own hooks using whatever connector you fancy 2. It depends on your set up. I move most of the heavy lifting to SQL or Spark, so it is as performance as Informatica/SSIS 3. I have written multi threaded ETL processes using the PythonOperator. That basically starts off any Python script you want, allowing you the full flexibility of Python.

slt2021 · on May 29, 2020

can I ask a question about your #2. It seems you do ELT with spark/sql doing the T part after loading. Is your loading part high performance or do people even care whether it is fast or not? In my experience, when I extract and load data as is (for example into SQL Server) - it is kinda slow, because the columns have to be wide and generic, to accommodate all the crap that can come in. For example, I noticed that loading 1M rows into nvarchar(2048) is way slower, than into varchar(50). Let's say you have one column that usually does not exceed 50 chars, but sometimes it can be crap data and be 2000 chars. What is the best scenario to ELT it quickly?

What I found is that if data is high quality - then ELT is totally fine, often times it ends up being just EL without much T. But if the data is crap, and you have a lot of wide columns, then even loading it takes time, before we even get to processing stage. In this scenario ETL works much faster.

oxfordmale · on May 29, 2020

There is two approaches we follow. The first approach, which is quite slow, bulk extract the data to S3 and then run our transforms on top of that. If we need high performance, we write a delta streamer that only streams modified records.

aouyang2 · on May 29, 2020

Has anyone tried using airflow to build out an entirely new az/region in aws? Most common use cases have been for data pipelines, but how about deployments?

rywalker · on May 29, 2020

You could certainly execute a Terraform script from an Airflow DAG, but I wouldn't say it's a common use case.

unixhero · on June 1, 2020

It's a pretty recursive idea. Creator of worlds.

niyazpk · on May 29, 2020

We have a lot of spark applications that run on AWS EMR. Right now we use Oozie to create and coordinate the workflows. Any reason to switch to Airflow?

alexchantavy · on May 30, 2020

The article says a webserver is involved; are there other dependencies that I need to deploy as well?

memosstilvi · on May 29, 2020

How on earth did this post reach #1?

unixhero · on June 1, 2020

Airflow is pretty interesting

rb808 · on May 29, 2020

I've tried to use Airflow but was way more complicated than I expected. I just want to run a couple hundred jobs, why do I need a database? Surely a few files would suffice.

It seems a big gap in the market. I can't rely on cron as its a single point of failure. I have my own hardware so dont want to use AWS Batch or GCP Cloud Scheduler, any other ideas?

BiteCode_dev · on May 29, 2020

Wait, you only have a couble hundred jobs, but don't want a single point of failure, but think a database is too much, but talk about cloud hosting?

This all seems contradictory.

Personnaly, using Python, I go for Celery (www.celeryproject.org): it's a persistant daemon that can run tasks, provide queues and shedule work like cron .

A lot of people prefer Python-RQ, as it seems simpler, but the truth is you can start using celery with just the file system for storing tasks and result:

https://www.distributedpython.com/2018/07/03/simple-celery-s...

If your needs grow, you can plug it to redis, rabbit MQ and/or a database later.

It can expose an API so that other languages can talk to it and trigger tasks or retrieve results (but not write tasks, they must be in python).

ramraj07 · on May 29, 2020

Celery for scheduled jobs seem to not be a supported design pattern at all, and any job that starts to potentially come close to the 1 hour timeout seems to get annoying to work with in celery. It seems primarily designed to send emails in response to web requests, which is not the use case most people are discussing here.

BiteCode_dev · on May 29, 2020

I don't see how you came to this idea. The jobs can be as long as you want, you can have retry, persistant queues, priorities, and dependancies.

Of course, I would advice to put a dedicated queue for very long running tasks, and set worker_prefetch_multiplier to 1 as the doc recommand for long running tasks: https://docs.celeryproject.org/en/stable/userguide/optimizin...

With flowers (https://flower.readthedocs.io/en/latest/), you can even monitor the whole thing or deal with it manually.

I assume your comment is reporting on other comments, but not direct experience?

ramraj07 · on May 29, 2020

Direct experience very fresh in memory :)

The issue with long running tasks is that you have to change the timeout to longer than the default value of one hour (otherwise the scheduler assumes the job is lost and requeues it). But this is a global parameter across all queues so this means we essentially loose the one good feature of celery for small tasks which is retrying lost tasks within some acceptable timeframe.

Further flower seems weird - half the panels don't work when connecting through our servers; our vpc settings are a bit bespoke but not completely out there, so it's not fully useful. Also flower only keeps track of tasks queued after you start the dashboard (but then it accumulates a laundry list of dead workers across deployments if you keep it running continuosly).

We were also excited to use it's chaining and chord features but went into a series of bugs we couldn't dig ourselves out of when tasks crashed inside a chord (went into permanent loops). I just declared bankruptcy on these features and we implemented chaining ourselves.

Point is, I'm sure we got some parameters wrong, but I and another engineers spent WEEKS wrangling with celery to at least get it running somewhat acceptably. That seems a bit too much. We are not L10 Google engineers for sure but we aren't stupid either. The only stupid decision we made was probably choosing celery from what I can see.

In the end we still keep celery for the on demand async tasks that run in a few minutes. For scheduled tasks that run weekly, we just implemented our own scheduler (that runs in the background in our webservers in the same elastic beanstalk deployment) that uses regular rdbms backend and does things as we want. Turns out it's just a few hundred lines of simple python.

BiteCode_dev · on May 29, 2020

Fair enough and very honest.

> But this is a global parameter across all queues so this means we essentially loose the one good feature of celery for small tasks which is retrying lost tasks within some acceptable timeframe.

Oh, for this you just setup two celery deamon, each one with their own queues and config. I usually don't want my long running task on the same instance than the short ones anyway.

> We were also excited to use it's chaining and chord features but went into a series of bugs we couldn't dig ourselves out of when tasks crashed inside a chord (went into permanent loops). I just declared bankruptcy on these features and we implemented chaining ourselves.

Granted on that one, they not the best part of celery.

Just out of curiosity, which broken and result backend did you use for celery?

I mostly use Redis as I had plenty of problems with Rabbit MQ, and wonder if you didn't have those because of it.

ramraj07 · on May 29, 2020

Our use case was that the timescale of any of our tasks (depending how complex a query the user makes) can go from 1 minute to 45 minutes. We demoed a new task that occassionally went over the 1 hour mark. It's definitely annoying to have separate queues for these tasks but that might be what we need to do!

We use redis. FWIW within the narrow limits of the task properties it's remarkably stable,so no complaints on that!

sillycube · on May 29, 2020

Yes, I feel the same after spending several weeks to deal with celery, redis, docker compose config, flower, set up celery workflow, rate limit, worker, etc. Testing for a really long time...

When there is a workflow to play with chord & chain, it became unintuitive to find out the issue. I was stuck in an issue and finally I posted on SO to ask for help. Luckily I got an answer

I thought it's my problem due to no experience in scheduling stuffs. I hope there is something which is simpler

slt2021 · on May 29, 2020

Windows Task Scheduler, it is way more powerful and robust than many people think

rb808 · on May 29, 2020

Agreed its very good, but has same problem as Cron with a single point of failure. I want to add I can handle a single point of failure if all job definitions are in git or a directory of flat files.

ramraj07 · on May 29, 2020

Someone else who has the same problem!

Looks like airflow, celery and every other workflow orchestrator doesn't want to deal with it and just asks you to deal with it using shit like k8s.

I decided to write a simple scheduler that runs in a separate thread in the background of our webapp, so that it can be parceled into our existing elastic beanstalk deployment.

I use two database tables to pick up tasks and run them, and have some amount of failover. Just need to be a bit careful around deadlocks, but thats a cakewalk compared to the dumpster fire that is configuring these Babylon tower frameworks.

If you're interested I can write up some sample scheduler code and publish it which shouldn't be more than a few hundred lines and do what we're looking for.

slt2021 · on May 29, 2020

i dont really understand concept of SPoF. My windows servers have never really failed me. In case machine reboots in the middle of your batch - Task scheduler can restart your job. In case the job fails, it can retry after a certain interval.

you can configure your job to retry unless it returns success - and unless someone nukes your windows machine - it will get executed. if you are afraid that your windows machine will get nuked - then you can use SQL Server Agent on High Availability Cluster - and it will do the job.

if you want your jobs as a code - you can code them in Powershell and store in git/azure devops repo. You can deploy your jobs with the same powershell, or even use a CI/CD pipeline to do that.

ramraj07 · on May 29, 2020

It's probably no more safe to assume a single server will never fail when we're talking about AWS or GCP.

Honestly maybe we can, given I see ec2 instances running for four years without even a restart, but it's still not in the general dev philosophy of cloud at the least.

ibains · on May 29, 2020

Has anyone used Enterprise schedulers - ControlM, Atom and know how it compares to Airflow?

frankdilo · on May 30, 2020

How is this different from Celery?

fxtentacle · on May 29, 2020

I can't figure out if this is satire or not.

I believe that says a lot about open source projects released in recent years.

fmakunbound · on May 29, 2020

From out of nowhere it jumps into a list of problems with cron being cron. I kind of see your point.

fxtentacle · on May 29, 2020

I think looking at the discussions here is even more surreal.

Airflow, Prefect, Dagster, Kedro,... it appears there are now a lot of tools that I never heard of and never needed, despite me doing exactly what all of them try to solve with Hadoop and MapReduce.

diehunde · on May 29, 2020

Hadoop? MapReduce? Never heard of those and sound crazy. I solve my data problems just using unix tools.

mattmcknight · on May 29, 2020

Airflow is analogous to Oozie in the Hadoop ecosystem.

fxtentacle · on May 31, 2020

Thanks for clarifying :)

EdwardDiego · on May 29, 2020

I introduced Airflow to replace a precarious collection of cronned scripts in our ETL workflows and have never looked back.

The ability to rerun a failed DAG from a given task was ideal for the situation we found ourselves in at the time with a dependency on a Hadoop cluster that was starting to fail.