TensorFlow 2.0: Transitioning to production – Kirkland ML Summit ‘19

ANDREW FERLITSCH: Hello, I’m Andrew Ferlitsch I’m part of Google Cloud Developer Relations So if you’re not familiar with Developer Relations, we’re that organization that’s the outreach in to developers, whether it’s a developer community or into developers within our client companies We’re sort of the two-way interface between the developers and the Google product people Today I’m going to be talking about transitioning into production And so part of that, sort of a little intro, is about infrastructure for production And then we’re going to start off with a little timeline why I’m bringing up transitioning into production So in 2017, most of my interactions with people were about planning They hadn’t really started ML yet They were trying to figure out, how is AI and ML going to fit into our business? What kind of skills do we need? How do we organize a data science team? And then I found in 2018, most of the people I was interacting with were in what I call exploration They were experimenting with models, building proof of concepts, MVPs And the business people were trying to figure out the value proposition, how’s it going to fit in, and what is the projected ROI And I think all those types of models are kind of like the models Hannes referred to earlier of the undeployed models And I kind of think of it like back then, every data scientist, their laptop was like the Island of Undeployed Models, a really, really sad space But today, I’m finding more and more and more of our clients and people in the developer communities, they’re now ready to take those models and take them into a real production environment And so we’re in this transition phase, and that’s what I’m going to talk about So part of being in production is having, like anything, the right things So one of them is having the right framework And that’s a production-ready AI/ML framework And that’s what TF 2.0 provides It’s not to say– I always want to make this little note here It’s not to say that TF 1.x didn’t do production It’s that TF 2.0 was designed for production If you think about the earlier versions of TF, when you look back to 2016 and ’17, people were doing planning So it was meeting those people’s needs In 2018, people were doing development and prototyping It was meeting those needs Now that we’re in production, again, you see a revamp in TF 2.0 But I’m not going to talk about the new features and all that in TF 2.0, because that’s already well covered by the TensorFlow team Another important item is, of course, your pipelines Those are having those production-ready pipelines for training, versioning, and serving your models And that’s what TFX is And if you’re not familiar with the history of TFX, that is actually what Google used internally across all its own business organizations to serve its own models for six years before it was even open source It was already a very well-developed production system for pipelines And then finally, what we’re finding is you need a platform And your platform are your managed services This is particularly important to the DevOps people, but also the product owners And things that it’s going to handle is your retraining, your versioning, the deployment of the models, the distribution, and everything And here at Google, our AI platform is the GCP AI platform, so just a quick little plug there If you haven’t looked at it, please do So this depicts what a modern production flow is going to look like So you’re going to have at your company some kind of data repository This is where all your data is And you’re going to have data engineers that are going to come up with methods, identify what data is relevant, what data you’re going to use, and it’s going to come up with data distribution strategies That is, to get the data from the repository, feeding to the model training thing, the right data at the right time Then we have model training We’ve long gone past the old days of being on a laptop and serially training one model at a time You’re either using large hardware, or you’re on cloud, and you’re training multiple models in parallel

You might even be training multiple instances of the same model in parallel The next step is, as we train the models, we go through this validation step We’re going to make a decision whether this new version of the model is better than the old version If it’s not, we obviously go back, continue to train And we’re going to have some kind of training model repo, that when we decide something is better and we version it, we’ve got a way to store it Now it’s just like source code It’s a version– version one of the source code, version two of the source code Now it’s in the model And that’s going to have metadata that we’re going to use to make evaluations about whether this is a better version Assuming it’s a better version, we’re going to deploy that model And when it’s deployed, we’re still not even done, because what we think in-house, that a model is better than the other one, we really don’t know it until it’s actually out in the wild being used by people So there’s always some version of checking We usually call it A/B testing Some number of people are going to get the previous version Some number of people are going to get the new version And we have some kind of metric that company will have to measure which one is performing better And that’s our A/B testing And from that, insights will be collected And those insights will lead to additional data being collected and labeled and added to the data repo And this cycle just continuously goes So I’m going to talk about a few things that are a lot to do with TF 2.0 They’re more recommendations, not necessarily features, about what’s good for production And one of the ones I really like is the idea of moving data pre-processing into the graph So I’m going to be highlighting things like tft.transform, the tf.function decorator, and subclassing So prior to this recommendation, TF 2.0, generally we did our data pre-processing upstream from the model So if you’re not familiar with data pre-processing, it would be like your feature engineering, your normalization for an image, things like that It’s really to say, how do I take my data from its native format and get it into a machine learning-ready format that can be fed into the model during training? The problem is, when you did it upstream, that happened on the CPU And the training happens on the GPU And you have to be able to process your data fast enough on the CPU that you’re not being the bottleneck and starving the GPUs And of course, we’ve had a lot of great things, TensorFlow tf.data We built in prefetch, assuming you had a multi-core CPU, so that you’re actually using multiple cores, getting the next batch pre-processed But still, as you scaled up on a single instance and added more GPUs, sooner or later, you would end up starving your GPUs So what they came up with was– the idea is, let’s come up with additional graph operations that support transformations of your data And then those can be added to your model And we can move– so like I say, here’s sort of– there we go OK The old way, where we did it on the CPU, we basically now make what I call the data distribution– that input function that you all remember writing, that separate input function You write the model, then the input function Well, we don’t write the input function anymore as part of the data distribution That’s why I show the icon smaller Instead, that’s now part of the model And so it’s going to run on the GPU So if we look at even a bigger picture here, in the old practice, what we did is when we had that separate input function, we had to also re-implement it on the deployment side That is, when you fed something for a prediction, it had to be pre-processed in exactly the same way it was for training So we had this duplication If you look under the new version, by building this now right into the model, it’s already there in the model when we deploy it We don’t have to re-implement it on our production pipeline So all of that pre-processing is transformed to graph operations And there’s a program they call that’s built into TensorFlow 2.0 It’s called AutoGraph It’s a compiler It does several things It knows how to optimize graph operations to make your graph

run as fast as possible But it also knows a lot of other things that can be converted into graph operations, which we’re going to go into in a little bit But by doing so, our data pre-processing is now happening on the GPUs So they’re not being I/O bottlenecked waiting for upstream pre-processing We don’t have to re-implement it on the side And another thing I really like about this is the way they’ve implemented the tf.keras and the sub-classing, is we can now do what’s called pre-stems That is we can now add layers in front of existing models without rebuilding or retraining them And we can actually implement this through a pre-stem So conceptually, today’s modern architecture– and I’m just going to use a convolutional neural network architecture– looks effectively like this I would say ever since about 2015, a residual network has followed this general architecture, where we have a standard stem convolution group That’s what takes the initial images coming in and builds the first feature maps And then there was a sequence of convolutional groups, and then finally a classifier And that’s sort of our macro architecture And then inside these, how they’re implemented, that’s what we call the micro architecture What we’re finding today in modern stems is that some of them now have a pre-stem And that’s what we’re going to talk about So let’s start off with what sub-classing is So Keras, or tf.keras is built on OOP, Object-Oriented Programming So it supports everything from polymorphism, abstraction, inheritance So we can actually use inheritance So the underlying structure for layers is the object layer in Keras So we’re just going to subclass it and make a new one And the important three built-in functions that we have to redefine are these– init, build, and call Init is just your instantiation one Build is what happens when you go doc compile, and it assembles your model For every layer, it’s going to call the build method And when you’re actually executing your model, you’re feeding data through it, it’s the call method that’s called OK, kind of already covered that So let’s see I’m going to build one I’m going to create a new layer I’m going to call it Normalize And the whole job of this one is, I’m assuming this is for image pre-processing So we know our image data is coming in really as pixel data Those are 8-bit integers between 0 and 255 I want to convert it into a floating point number And I’m going to divide by 255, so every value’s between 0 and 1 Real simple So I can easily build a layer here I’m going to call it Normalize I’m going to inherit from .Layer A few things I know– this is a straightforward algorithm There’s no learnable parameters Therefore, I have no kernel That’s why I set kernel equal to none That tells the compiler process to not do backward propagation and try to update the values in here, because there are no trainable ones And then if I call a function, here is where I’m going to implement this is the input tensor coming in And then I’m going to just divide every element in my input tensor by 255 And then I’m going to add this decorator And this decorator tells AutoGraph to change this into graph operations so they get added to the graph So none of this will occur on the CPU So I’ve kind of already covered that So let’s look how I might implement that So think of removing this line as a very simple MNIST model you might use I’m going to use the functional API here So first I declare my inputs, right? And then my first layer, I’m going to have to flatten them So after flattening, they’re going to have one or more hidden layers, dense layers And finally, we’ll have output dense layer for classifying the 10 digits and then pulling it together Well, in between, I’ve just added this new layer So I’m going from input to normalize then flatten So now I can just feed this model with the original MNIST images Just take them straight as is I don’t have to do any pre-processing You can do this, as I mentioned, with a pre-stem

Here’s the simplest way to add a pre-stem Think of it this way If the output of your pre-stem matches exactly your input, you can use the sequential API to just attach them together So I have some existing model Let’s assume it’s that MNIST model I’m going to make what I call a wrapper model I’m going to make a wrapper around my existing model So it can be trained or untrained And as the first layer, I’m going to put it in the normalized one And then I’m just going to add my existing model to that So I’ve just attached a pre-stem to the stem entry point of my existing model Now I’ve got a wrapper model Go ahead and compile it We can use it as is So that’s moving data pre-processing But once you’re moving data pre-processing, really one of the goals in TF 2.0 is to get people to start thinking about, what else can I move in the graph? How can I make my graph even more of the application? So the one I like to do is start thinking about data augmentation when you’re training Why not build the data augmentation right into the graph instead of making it an upstream process? One of the benefits if I can get it in the graph is now I have consistency in the data augmentation when I’m retraining I know for certain, when I retrain that model it’s going to be retrained with the exact same data augmentation I had before If I’m going to reuse some pipeline, it eliminates my need to re-implement the data augmentation on that new pipeline, because in effect the data augmentation is now built in the graph So if I’m going to transfer learning, the data augmentation comes with it And by implementing it as a pre-stem, which I’ll show you in a moment how to do it, I kind of now have a plug-and-play augmentation pipeline I can build these data augmentation little mini graphs and start putting them on different models, reusing them So in this case, I want a subclass, a new one called Mirror So that’s just saying when I get an image, I want to make a mirror image of it So that’s just a vertical flip Of course, I don’t want to do it on every one I’m not going up into the code too detailed, but I use a simple toggle here Every other input– you see I flag and then I toggle it So every other batch it sees, it will flip So the first batch won’t be flipped, the next batch flipped, the next one not flipped, and so forth The key thing is, I only want this layer to happen during training, right? It’s in my model When I deploy it, I don’t want it doing data augmentation So there’s a backend function in here called in_training_phase And that function only executes when you’re in training So I put all my code in this little embedded function, pass it here So what will happen is that during training, this code gets executed The data augmentation will occur But in evaluation, or prediction phase, this becomes disabled, and this layer just is effectively not there It’s actually the same way that dropout is implemented Do you know if you put dropout in your model, it occurs during training, but it somehow magically goes away If you read the source code, you’ll see this exact same line in the source code for dropout AUDIENCE: [INAUDIBLE] image generator, when you implement it, it does the same thing, really [INAUDIBLE] ANDREW FERLITSCH: Right Yeah AUDIENCE: But what’s that [INAUDIBLE] then about not using image generator? ANDREW FERLITSCH: Well, I think you have to– if you look at the sort of how they merged Keras into TensorFlow, they merged the model API So one can kind of think, well, the Keras data management, data feeding system is still around Really, the emphasis is to go with that TensorFlow ecosystem around the Keras API So here’s just an example now, where I’ve just gone a step further So making my model, I have my inputs This is for a CIFAR-10 Normalize it, so I don’t have to pre-process it Put my data image augmentation step here Rest of the model, finish the model, go train it While training, this layer is executing When I’m in evaluation, it becomes just a direct pass through Warmups

Here’s another thing Another thing that we’re really encouraging is the idea of doing a little bit of pre-training on your model before you go into full training We already think of it when we’re training We used to go do, in the early days, one set of hyperparameters, and you’d just do a full training and see what you got If you didn’t like it, you did another set of hyperparameters and you did a full set of training Then we evolved around with the idea of hyperparameter search, where I take some set of parameters and I train for a while, but not completely, right? Then I look what I got And then I keep doing that more efficiently until I find the best set of hyperparameters And then I do my full training The problem with that approach is it overlooks the importance of the initial weights So we all have a lot of experience knowing how important getting the right learning rate and learning rate schedule is for training But what we have found– and there’s a great paper on it called “The Lottery Ticket Hypothesis”– is that that draw– when your model is initialized, it’s taking or draw from some random distribution And every draw is different, and some draws are better than other draws And they can make the difference about how fast your model will converge and whether it will converge on the best or global optima And in “The Lottery Ticket” paper, what they showed is they took why very large models work And when they studied them, they found that they were really a composition of lots of little models It was like models and models, where every one of them was a separate draw, having an initialization It was like having a separate draw from that random distribution And so it was like each one had a lottery ticket, and one of them was the lottery ticket And the way they were doing the paper was trying to figure out how to cut out that little model, pull it out, and say, that’s the winner But we’ve taken that concept now into warmups And generally the recommendation is, before we start our hyperparameter search, let’s take several copies of the model, each of them having a different initialization Start with a really low learning rate and just very short runs on each one, and look really closely at the stability of how it’s increasing in accuracy and the valuation loss is going down And use that to say that’s the best initialization algorithm That’s my winning lottery ticket I’m going to start with that, instead of a random one, before I go into my hyperparameter search So that’s kind of how things are changing That was the general practice And now the general practice is to not start with just some arbitrary initialization, but to use the warmup phase to find the best initialization, use that model, and then move into your next step So here’s, if you want to do it by hand, how I would do it So I’ve created some model here It’s not compiled And I’m going to save it And remember, I don’t care what default weights are in this model, because I’m not going to use them So this is one of my little favorite tricks here So I got a really low learning rate And I’m going to get five copies, and I’m just going to keep reloading a copy Here’s the thing I want to give each one of these a unique set of weights And so I can go from the model .get_weights, get its original weight sets And I could also feed it right back and reset all those weights, right? But I want to change them So I know this model used a he_normal distribution So I’m just going to get some kind of random draw from that he_normal distribution that’s different than a previous draw and reset it Here’s the key thing Sometimes I see people doing this in blogs or Stack Overflow And they miss this line right here These weights are a combination of really the weights and the biases So the weights are always going to be numbers between 0 and 1 The biases are always going to be 0 or 1 And if you start replacing those 0s and 1s with those little numbers, like 0.32, 0.85, it’s not going to work Your model’s going to take a longer to learn It may not even convert A nice trick is, in these models, what I find is the biases are always 1D vectors

And everything else is 2D or 3D So I just have this little embedded if statement that only if it’s greater than one dimension, then I know for certain it’s weights [INAUDIBLE] Otherwise I retain the 0 and the 1 So that’s how I can get a unique distribution each time when I’m reloading the model from my disk I compile it I do my short warmup test, so just a short number epoch, so a short number of steps for there I keep the history I’m going to save this history And based on the history, I’m going to– somebody’s telling me I got five minutes Deconvolution [LAUGHS] Here’s another one So a lot of people are used to convolution Deconvolution’s really the reverse, transposed And you can use it in a pre-stem to solve problems In a convolution, when it’s strided, that means your strides are greater than 1 You’re actually making your input feature map smaller In deconvolution, you’re doing the reverse [? A ?] convolution’s the same Instead of using a static algorithm to upsample, you’re learning the best algorithm for your data and your model, upsample your data So I can use this If I have an existing model that’s training for one input size, let’s say 128 by 128 by 3, and it’s already trained and I want to reuse it, but I’ve got a new input size that’s a lot smaller, what would I do? Well, I can again use a pre-stem And I can use deconvolution to scale up the 32, 32 by 3 to that size But not do it statically, but actually learn the best algorithm for doing it for that data and that model It’s real simple So let’s say I got a ResNet50 here And configured it to output– this is for CIFAR-10, for the 10 classes As you can see, I’ve got a version that’s 128, 128, by 3 And that’s just to show, hey, that’s really what it is So now if I want to add that, again, I’m going to make a wrapper, sequential model I do two deconvolutions by setting my stride to 2, I’m going to double the size So it’s going to go 64, 64 I’m keeping the same number of channels I want the thing to match I repeat it again Now I’m at 128, 128, 3 That now matches the input of that ResNet I combine them right together And now when I train, it will learn the best way to up sample those images so that this ResNet50 can learn What’s interesting about it is these concept weren’t around in 2015 and ’16 when the ResNet v1, v2 papers were published And when they tried to use the original ResNet architectures, they couldn’t train it on CIFAR-10 They couldn’t get it to converge They had to actually make a reduced version of it, a specialized version, which they called ResNet CIFAR-10 And on that one, they reported getting about 92%, 93% accuracy back then on 200 epochs But remember, they couldn’t do it with a ResNet50 I took a ResNet 50 I used my deconvolutional pre-stem I had no problem with training, and I got the same accuracy in 50 epochs So using deconvolution in a pre-stem, I can accomplish what they couldn’t accomplish back then And I got something that was as accurate, trained faster, and no redesign of architecture And this is probably a good point to quit, so some people have time to ask questions [LAUGHTER] [APPLAUSE]