Airflow in Practice: How I Learned to Stop Worrying and Love DAGs

Airflow in Practice: How I Learned to Stop Worrying And Love DAGs Sarah Schattschneider >> I think we’re good to start So we can wait 2 minutes to see if anyone else comes How is everyone’s day going? Unfortunately I don’t have stand-up background like Paul does, so it’s not going to be as interesting But last talk of the day and then we have our keynote This is so cool having an ASL interpreter next to me Never had that before This is really cool So we still have another minute to wait This isn’t painful at all but I’m ready to start And it’s over there? There’s so many ways to know what I’m saying Isn’t this fun? Okay, cool You did just take my first slide I was going to say hey, everyone Today I’m going to be talking about airflow in practice My name is Sarah Schattschneider I’m a software engineer at blue apron on our data operations team We are in charge of our data pipeline which includes ingesting data and events into our data warehouse Some fun facts about me My airflow interests are figuring out cool ways to use it, besides just ETL processes, but a lot of my work does include that We’ve used it in big query permissioning, so actually being able to programatically being able to set up our permissions And I like testing within airflow and I’ve been working with it for about two years and hoping to be contributing more to open source shortly Today I’m going to go over the background of some core concepts, then go into airflow Then talking about evaluating airflow, and the challenges of airflow Finally talking about common use cases, and best practices All right So now this is when you guys get to raise your hands So I want to know who’s in the crowd So please raise your hand if you use Airflow on a regular basis All right So we have a few of those Awesome And who’s new to Airflow? Awesome Hello, friendly new faces So for those of you who are advanced and know what most of the core concepts are, bear with me until the end I get into some really cool in-depth topics then So let’s dive into the background of what Airflow is and why you should care Some of you may recognize this as a Cron, depends on when you entered the industry In a cron tab you are determining when to run it, that’s the stars and numbers on the left-hand side and then you have on the right-hand side what is going to run at that time So that can be a script or just one command So for those of you who are unfamiliar with cron, one of the down sides to cron it has very little alerting It’s really on the user I heard a war story from Blue Apron before I joined and we transitioned everything, and we had a snapshot process and the only way to know it was working is one engineer got an email every half hour saying it’s working So to figure out you need that one engineer to realize that those emails have stopped Not the best way to be handling your alerting Airflow is built on top of cron and we’ll still use that scheduling which is fantastic but it comes with a lot of helpful things off the bat such as the monitoring, alerting and very fancy UI and CLI So an important concept to understand is a directed acyclic graph Here you’ll see that the graph is made up of nodes and directed edges There is only one way to flow through this graph We have to start at A and we’re going to end at J. But come on, we need a better example than this So let’s go to our foodie metaphor of chocolate cake Because who doesn’t like chocolate cake So when you’re making your cake you have to get your bowl, you have to get your ingredients, you put all your ingredients if the bowl, once you have them all in, you’re going to mix the batter and so forth Once you have the batter, and we are looking at the putting the batter in the cake pan, we want to make sure our assumptions are met, that is completed through the dependency graph We will never run, put cake batter in cake pan, until we have the batter properly mixed, and until the cake pan is properly set up We don’t want to put cake batter in an ungreased pan because then you’re going to have very sad cake at the end Not pretty So now that we know the basics of cron and DAGs, let’s go into specifics around Airflow What is Airflow, and as a Python developer, why would you be interested in it? First off it was open sourced by Airbnb in 2015 It was a way to offer work flows so your DAGs are going to be set up with different tasks in them Those will be the nodes And they will be scheduled to run at a specific time And all of this is written in Python We have stateful scheduling within Airflow so this means when you have your cake pan and your cake batter, you’re only going to run that when the dependencies have been met So you can sleep easy at night

It comes with a really rich CLI and UI, which is very helpful to understand what is currently happening with your different workflows We also have logging, monitoring and alerting right off the bat, so you’re able to troubleshoot all of these issues when they arise And because the structure is a DAG and we have all of the tasks, we stay very modular And this increases the testability and maintainability of these work flows And overall, Airflow solves many common problems with batch processing, but also can be used in other circumstances So now that we know the fundamentals, the next stop is talk about code Some of you may not see this as totally Pythonic It is more boiler plate at the DAG level At the top you see we have a very simple hello world Python function They be we have our default arguments that we are declaring So here you’re going to say when this process should start running, you will say the owner, and number of retries and some other things Next we actually initialize the DAG This is where you give the DAG a name This is where you say the default arguments, and you say when to run Again, we’re seeing those numbers and stars We’ll get into that a little bit later And below the DAG initialization, we see this, what looks to be a doc stream but it’s kind of in a different format Some of you may recognize that as markdown You are able to use markdown for your documentation for both at the DAG level and the task level, which is very helpful for other programmers to understand what is happening in your work flow And now we’re in an actual task We have two tasks here We have start and hello Start is a dummy operator These are really useful when you have a larger DAG It’s not necessary in this case, but wanting to introduce the concept And then we have a Python operator And we see the Python callable is calling the hello world function that we have at the top And the very last line here we have at the top start, bit shift operator to hello And this is actually defining the dependency structure for the DAG What does this look like in the UI? It looks pretty nice, doesn’t it? So you’ll see at the bottom of the screen on the left-hand side you’ll see two blocks You’ll see a start pointing to hello This is your DAG and a dependency structure that is made from the previous code we saw Here we will also be able to see some really interesting things about what’s happening Over here we see the different colors around the task will show if it’s running, if it’s successful, if it’s failed, if it’s up for retry and so forth We also have on the upper right-hand side we have the scheduler intervals Right away we know how frequently this is running and finally we have some tabs across the top to have different metrics and insights into what’s going on I want to highlight two metrics for those who are new to Airflow On the left-hand side we have the Gantt chart this is going to show what’s currently running or what has been successful If it’s running it’s light green if it’s dark green it’s been successful so you can see when it’s finished On the right-hand side we have the task duration graph which is interesting because you can use these two, if there were a failure in your system, to be able to give an ETA to your downstream consumers, being able to see what’s left running, how long it has historically taken, do some math and get it to them All right Now let’s move into more in-depth Airflow concepts We have hooks, operators and tasks Again, all of this is written in Python So your hooks which are going to encapsulate all interactions with different APIs are going to be able to use any Python API of a third party or utilize REST and things like that You then have an operator which will instantiate one or two hooks and complete a specific goal With your DAG you want to make sure each task is only completing one goal If you have a whole workflow in one Python operator, that’s not good You should look at the code And then finally we have a task, which is going to initialize an operator and give it specific variables to be able to run the task

So going back to our cake example Our hooks would be a cake pan and an oven The cake pan will have different functions such as the cake pan size, putting batter in it, and yeah That’s about it oh, also greasing it Because we need to make sure it’s prepared Then we have the oven We can open the door, we can close the door We can preheat it Things like that So with this operator of put cake pan in oven, we are going to have this operator initialize these two hooks, and be able to actually run all of the functions necessary to complete this goal So we are going to have it preheated, we’re going to open the oven door, put the cake pan in, close the oven door, and be able to — yeah That is the put cake pan in oven And finally we have the task Which is initializing it So we’re going to say use Sarah’s oven, use the round cake pan, and actually with the connection IDs, so we know where we’re authenticating To bring this back to the DAG we saw at beginning, this is going to be the last operator, the last task in this DAG Real quick, let’s give this a real example that a data engineering might be seeing This is an example within the Google Cloud platform This can be easily applied to AWS and other platforms So you have two hooks You’re going to have a hook for your Google Cloud storage similar to AWS S3 and Google big query In these hooks these already exist but they use the Python API to be able to interact with each of these systems The operator will use the GCS hook to get the file, and then ingest it into big query And finally we have our tests, where we are saying the source path to the destination table and which connections to use So we’re properly authenticating with each system I mentioned before there are a lot of hooks and the operators already existing in the Airflow ecosystem, so you might recognize parts of your tech stack already here and you would be able to easily transform these into work flows Again, lots of operators The one I would like to point out is the Python operator You will be able to run any Python function within the Airflow ecosystem, so you have that log-in monitoring and alerting on failures to understand more of what’s happening Now that we understand Airflow and its core concepts, let’s go into evaluating it What value does Airflow add? So Airflow is able to elegantly retry errors It might not sound so interesting, but if you have a workflow that calls to any other service, which I’m assuming you will have, and you get a 503 error, that transient error from them, at 3:00 a.m., we don’t want to be called, right? We want to keep sleeping So if you’re going with Airflow, you’re going to be able to define how many times to retry a task and hopefully you’ll be able to get past that transient error You’re able to alert on failures within Airflow That can be an email operator, a slack operator, and many others You’re able to rerun a specific part of the DAG So if you have a DAG with 20 tasks and one failed with three downstream of it You’re looking at the DAG you see there are four tasks that have failed and three have specifically failed because of upstream And we only want to rerun those four, you’re going to be a clear downstream of that one failed and get clear all four What this allows you to do is save time, instead of having to rerun the whole process, just the one part that failed And finally, Airflow also supports distributed execution So multiple machines can contribute towards the same jobs as workers And overall airflow enables your ETL jobs to be a lot more fun to develop, they’re readable and maintainable So now let’s dive into some Airflow development and how would you get this set up for yourself At Blue Apron we had our Airflow instance deployed in a docker container in a Kubernetes cluster

This is just as easy as any other service at Blue apron so we haven’t had any issues there To add the Airflow connections we have a set of two scripts One is a batch script which will actually copy the credentials either JSON files or environment variables into the docker container And we have a Python script that will then create docker — sorry Airflow connections and add them to the Airflow database So just to note, I did recently put out a blog post on this so you can find that later to look into this more Evaluating how easy it is to get started Because it has such an active community, we have so many third parties already integrated with hooks and operators, so a lot of what you’re going to be interested in doing is already possible, which is great The advantage of airflow is still continuing to use the cron scheduling allows a lot of flexibility with when tasks are going to run So the first example here we have it running Monday at 4:00 a.m., and the second example is the third day of any month at 10:05 We have a lot of flexibility with when exactly to run it day of week and day of month Which is helpful for different business processes With the cron scheduling we get time zone safe because we use UTC, and we have flexibility, how I said in determining when to run it, which is great Now let’s go into — is there a bad side to Airflow? One challenge we have found was their up grades and this was due to the fact that we were an early adopter and actually created a lot of custom hooks and operators There are a few ways to mitigate this pain We are trying to contribute more regularly to upstream so then we’re in part of the package and upgrading is no problem We also have a staging environment DAG that has realistic data running through it to be able to catch if any of our operators are no longer going to work with the upgrade and figure out where we need to go fix And finally we have better unit testing So any custom hook and operator we create, we want to make sure we have a unit test to know what the expected behavior is Another interesting concept in Airflow which was challenging at the beginning but we figured it out was data sharing There are two unique ways of data sharing within Airflow The first is excoms This is cross-communication between tasks So if you have a return value from one task, you can access that in in another task So this keeps it so that each node in your DAG has one goal that it’s trying to accomplish The next is an Airflow variable This is actually updated through the UI, and entry in the Airflow metadata database As we all know, if this is written in Python, we still have Python variables, so this is used for working memory And finally we have environment variables open to us, which are used for passing secret toss Airflow and so forth So when we were looking at this from a traceability standpoint, we wanted to make sure we knew when these different pieces of data were getting updated So having it be more traceable, we used the excoms and Python variables because those are checked into our code base, and if we wanted — if we’re less concerned about the traceability and more interested in the ease of update, we would use an Airflow variable or environment variable And depending on your personal setup, the environment variables might be more retractible Let’s go into some common and fun use cases So a main use case for Airflow is extract transform load jobs That would be because it’s very easy to move data between different sources and get them to where they need to go You can create the custom hooks and operators with different third-party APIs if they don’t exist already, which has been very easy to do So one example of an extract transform load, if your company is sending out regular surveys to your customers to get a net promotor score and you want to ingest that data into your data warehouse for the teams to understand the net promotor score

You would be able to utilize a hook inside of an operator to be able to access that data, which you are initializing in the task Another cool feature is the slack integration This enables you to let downstream consumers know if a task is finished, if there’s been some error, things like that We get to keep people informed and knowing what’s happening, which is great One place we use that at Blue Apron is within our big query permissioning So our start operator right here we have a slack message that gets sent to the downstream team so they know they’re going to see some intermittent issues with their big query permissioning And at the end of the process we want to let them know hey you can resume your day-to-day operations, to let you know, go back to big query and it should be all set So that’s how we utilize the Slack integration here But otherwise we are using the big query hook and operators here to be able to actually create the data sets, create the date partition tables and views and finally programatically add the permissioning to these data sets Another use case at Blue Apron that we found very helpful is our ML squad will actually use this for feature extraction So they have one DAG that’s going to export the data they are interested in, and another that’s going to use it in Spark to either train models or predictions Let’s finally get into some best practices here I said at the beginning I’m interested in testing so we’re going to talk about some testing It’s important for unit testing — to have unit tests for all of your supporting Python functions So you’re going — if you’re doing ETL jobs, your transformation, you’re going to want to make sure your transformation is acting as you expect So having a unit test here will make sure your expected behavior is met now, and in the future if someone goes to update it You can also create unit tests for custom hooks and operators to make sure you are interacting with those APIs as expected And finally, something we found helpful is an accept tense test that runs flow Airflow command that will let you know if all of your DAGs are compiling properly So if you’ve added a new DAG and also added a connection ID, because this is a new workflow, and you haven’t added that yet to staging or production, you will get a failure here and no, we need to update something before pushing this to production The final topic I want to talk about is documentation in Airflow It’s amazing to be able to utilize the markdown templates in the DAG At Blue Apron we have come up with six questions we like to answer for all of our workflows We have a quick summary explaining what the workflow is doing We have is it mission critical and that will help us determine if we need to alert in the middle of the night or if it can be looked at the next day We have what to do on failure If someone is on call weaned have a failure and they need to figure out how to solve this, they will know what needs to be looked at And finally we have points of contacts, and then the business use cases and other documentation This actually lives right in your code base, so right under initializing the DAG you’re able to add this information, so we know as it changes over time if there are different points of contact, or if we need to change it from going — the data from one place to another and so forth And then this gets rendered in the UI So when the on-call person comes to look at this, they see it right away They don’t need to open up the code base They see the error, and they go directly to the link So in conclusion, I hope you guys understand that Airflow is a fun way to programatically schedule your work flows It’s a great way to create custom hooks and operators and it’s fun to develop So thank you very much [ Applause ] I do want to give quick shout-out to all of the organizers who made this possible Having the ASL interpreters has been an amazing opportunity at this conference, and the live

captioners So thank you very much [ Applause ] And I think — do we have time for questions or we’re out? Okay We will take questions — I’ll take questions in the hallway and also we’re hiring so if this sounded interesting to you, you can join our team Thanks! On