Migrating Existing Open Source Machine Learning to Azure : Build 2018

>> All right. It looks like we’re all set Thank you for being here this late in the day and making the Treacherous journey across the 7th avenue. Thank you for coming I am david smith. I am a member of the cloud Development advocates team at microsoft, and you can see Many of my fellow team members there on the right The font is quite small if you can tell But if you want to check us out, you can have a look at This first link right there. But something you might notice If you do go ahead and take a look at that link, many of us In that team work in open source technologies, whether Kubernetes or docker, you name the project, and we’re all Quite involved in that. And part of our mission was to Talk to folks about how to use open source technologies in The microsoft ecosystem. And, in fact, that is going to Be the topic of my talk today, which is about using open Source tools for data science and for machine learning Within the azure environment. That’s a particular area of Specialty for me because i’m a data scientist. I’ve been working with data For over 20 years at this point. As you might be able to guess From my gray hair. I my tool of choice for doing Data science and machine learning is the r language Just for a quick show of hands, who here is familiar are? More than i was expected. I was expecting a crowd here So for those of you familiar, you will see a few examples Those of you that are not familiar are will be using Other tools like python and tensorflow and spark in the Examples today. But as i said, my tool of Choice is r, and i’ve been working in the r project for a Long time, i’m a member of the r board, and i also write About lots of other aspects of machine learning and data Science at the revolutions blog, and you can have a look At what i write about. You can also follow me on Twitter. I tweet quite regularly in the Things that i see. So moving on This particular talk is not going to be very surprising Especially if you’re already using open source tools to do Machine learning and data science. And, in fact, that is my Intent with this talk is to be as unsurprising as i can Because what i want to show you is that you’ve already got Your own workflow working with open source tools You have the editor you like. You have the engines you like You have the ways you like to do graphics integrate with Other systems. What i want to show you and Demonstrate to you in this talk today is you can carry over all those same Experiences and skills and bring them into the azure Environment with very little changes, except being able to Make use of much greater processing power, the Availability are of things like powerful gpus, the Clusters, so you can take the machine learning development and scale them up into big Datasets and big compute jobs within azure And i’ll show you how to do that and try to make it as Easy as i can as i possibly can. So in the talk today, i’m Going to focus primarily on open source tools that you’re Already familiar with. I’m actively going to avoid Hear a lot of the microsoft-specific stuff that you might be seeing in the Other tools. And the things that might be Familiar with and comfortable with Me personally i come from linux background, so i’m Particularly happy now that i can use batch within my Windows environment, and i’ll show you how you can do that And how i can do the same thing as well And then use bash within windows or anywhere unix or Linux-like environment using the azure command interface Then i’m going to dive in to some failure detailed Discussions of the deep-learning machine This is a virtual machine image that we provide on azure That has a whole bunch of tools ready set up to go And that’s available on windows, but i’ll be using the Image in the session today. We’ll also take a little look At jupiter notebooks. I’m sure many of you are Familiar with using jupiter notebooks, and i’ll show you How to do the same kind of thing in the azure Environment. And also show you this one is A little bit more microsoft specific. The azure’s notebooks that are Available within the azure service, which is a great way To share notes you’ve developed. And then switch tracks a Little bit and start having a look at gpu-enabled compute environments And specifically, give you a quick demo of using tensorflow

To take a python script that you’ve written in your own Environment and use tensorflow computation and how easy you Can transition that over to a batch cluster, in fact, Running gpus to scale up that analysis. I’m going to show you another Instance of scaling. In this case, to be compute But rather to big data by scaling your analytics over Into a spark cluster. And how you can in this case Run r within a spark cluster to do machine learning on data That lives in that cluster. I’ll be popping up various Links during the talk. Don’t panic if you don’t catch them I will pop them up at the end again so you can grab that with your camera and these Slides will become available later on. Also, we’ll point out this is Quite a code-heavy talk. I do make the font as big as i Can, but it can only go so far. You might consider moving a Little bit closer because some of the fonts do get small With my glasses, i need to do a bit of that myself So, first of all, let’s dive into how you can interact with The resources available to you in the azure cloud Now, when i first started working on microsoft a few Years ago, this was basically the only option for Interfacing with azure. It’s a web-based portal You can point and click all your way around to set up the Virtual machines, for instance. I did it with a small group Anybody use the azure portal so far? About half the people. I’ve got quite a few people You should be familiar with this idea of filling out a Bunch of fields in a form to generate. A really good way of Discovering services. It’s great for one offs It’s not so great when you need to do this thing over and Over again because it’s quite to repeat that process One of the things i recommend you check out if you haven’t At the very top of the azure portal there That’s the way to get the shell, which will basically give you a command line, it Can be a windows command, powershell, bash, where you Can do command line commands to do those kinds of things again The way i actually like to do that is directly within my Desktop environment. You can install the azure Command line interface in your local shell As i mentioned earlier on, i prefer bash So in a minute, i can show you how i can deploy a virtual Machine using these series of command line incantations and Then just wait for it to run and launch into that machine Again. This has been really good for Me when i do big projects i need to share with other People, i need to replicate the processes i’ve used to Spin up services in the cloud. It’s also great for demos as You can see in a minute i can just run a line of code and Get things done. And it’s available on all the Different platforms and importantly, it’s also available and consistent Whether you do it in windows or mac or any of the various Linux environments all within the windows subsystem for linux, which is what i’m going To use in just a moment. You can find lots of Information about the azure cli at this link i’ll provide To you at the end of the talk. I’m going to show you a little bit about my desktop Environments that i tend to go for. I’m using here a surface book two It was given to me when i joined microsoft a few years ago Prior to that, i typically used Macs. And i do quite like it One of the reasons i do like is is the visual studio code I was a big e macs guy for a long, long time In fact, if you have a look at the contributor file in e Macs, you will still find my name there But i have moved in the last couple of years because it has The same feel with being a editor with lots of extensibility But most importantly to me, it’s cross platform so i get The same experience whether i do it on windows or mac or Within a linux environment, and it’s a really great team That works on this. So, again, just a show of Hands i can do this with a small group Who’s used visual studio code? a few people A few female haven’t. One thing to remember is Visual studio code is an entirely different thing from Visual studio. They kind of designed to have Similar kinds of interfaces but visual studio code is the Open source variant and has lots of extension and plugins created by the community for You to use. But i’ve been using quite a Bit of my demo, point out some of the interface layers We’ve got source code there, syntax highlighting, in this Case i’m looking at a python file. You can have an embedded Terminal, and this is where i’ll be running several of my Scripts to actually interface with the azure li, and various Tools over on the left-hand side depending on what kinds Of plugins you’ve installed onto studio code One thing i’m going to mention but not use on my demo is Something called the ai explorer. And one thing you might want To check out is something called visual studio tools for

Ai, for visual studio or if you use visual studio code, It’s called visual studio code for ai It provides a lot of streamline interfaces to take Scripts you’re developing for visual studio, automatically Move it over into a virtual machine that’s running in Azure, do the computations there, return the results back Locally, so you don’t have to move around with a lot of the File moving and data moving side of things It’s really useful. I’m just not going to show it To you here because i did want to show you a process that Works in any environment. But if i do want to check out Visual studio guide for ai, there’s a great talk coming up On wednesday by chris in the expo hall, so i do recommend You check that out. So the first thing i wanted to Do is show you — give you a little tour of doing data Science with the azure data science virtual machine Now, what is the data science virtual machine? It’s basically an image that a team of data scientists Working at microsoft have put together with lots and lots of Data science and machine-learning tools. Many of them are open source There are few proprietiary tools, and i’ll mention those briefly One of the really nice things about the deep learning, That’s right the data science virtual machine is that Everything just works, and it just works together If you’ve ever fallen into python package hell and one Framework, the really nice thing about the data science Virtual machine is everything there works together straight Out of the box. It is available in both a Windows flavor, but i will be using the flavor for this Particular session here today. One of the nice things about The data science virtual machine is that even though it Does include some proprietiary components, you don’t actually Get charged for them. All you actually pay for is The underline virtual machine infrastructure rate, and i’ll Talk a little bit about that in a second Just pro tip if you want to do something with sql server or Microsoft r or some of those other tools there, you Essentially get free developer access to all those tools via The data science virtual machine. But i won’t be talking about Those in detail today. There’s a couple of links There where you can find more info about the data science Vm. But let me zoom into that Little chart there just so you can see some of the things that are available to you Within the data science vm. All the languages that you Might be familiar with for doing machine learning and Data science are available to you. The two big ones, of course, Are in python available as complete builds there Plus, many of the common packages and libraries that You use with them already become preinstalled A few other interesting things too. There’s of course c#, julia if You haven’t checked out julia, really interesting up and Coming language for machine learning already installed in That environment as well. When it comes to your Developer tools, visual studio code is available in there Pi chart is there, our studio is in there, all the ids and Tools you use for developing applications with these Languages already are available right there Lots of high level data science tools allowing you to Explore data with those environments, which could be Stored within just about any kind of dataset if you want to Play around with learning data into my sql or if you want to Play around with loading data into spark or hadoop, all Those instances come available to you direct within the data Science virtual machine. And plus, the modern machine Learning and ai frameworks tensorflow, Cnck, pi torch, all those are built right into The data science virtual machine and also with lots of Sample datasets and tutorials, a great place to learn the new environments That’s something that you’re looking at thinking, as well And it’s designed to work with gpu workloads So if you’re working with tensorflow and want to speed Up the training of models using its gpu acceleration and Gpu distribution, that all happens out of the box And, in fact, the versions of tensorflow and other Frameworks that are installed on the data science virtual Machine are already preconfigured to use gpu. There’s actually nothing you Need to do. It happens automatically when You run it on the gpu-enabled virtual machines So that’s the nc class virtual machines that are attached to Nvidia, tesla various strings and flavors as you can see Here. You also might find as you’re Looking through the azure market is something called the Deep learning virtual machine. Don’t get confused That’s exactly the same as the data machine

To working on these class virtual machines and includes Extra datasets and tutorials to get you started with Working with these ai frameworks So the whole idea that i’m going to show you in inject is To do development first on your own environment and then With the data science virtual machine to scale up with the Workloads and then scale out on the multiidentify user applications So you can do — you can either connect data science Virtual machine or do this on your own environment Use your standard development tools, do your debugging, do That iterative process of trying out different models And to get better as quickly as you can. And then when you scale that Up to productions-level datasets, put it on a larger Virtual machine with more memory and faster cpus all for Those faster with gpus to speed things up even further For those applications are enabled to it And then scale that out to applications that have many, Many end points, perhaps applications, ai applications living at the back of a mobile Application or are highly-trafficked web-based application You can then through multinote clusters into spark-based Clusters, or even scale them out into vm scale sets that Are automatically scaled to accommodate the workload that Your develops do it around your data science Workloads. Just a couple of Recommendations when you’re looks at the various types of Instances, virtual machine instances that are available To use with the data science virtual machine. Really, there’s many, many options here One of the most common ones that i use The most basic one that i tend to jump in is this ds23 Instance, with four kidnaps. It has a decent amount of Ram. It’s decent for very active development In my demos today for the data science virtual machine, i’ll Be using ds4v2, which doubles The amount, so i will have a lot of applications there If you really want performance there, look at the aav2 Virtual machine, that is a compute optimized virtual Machine with faster processing capabilities. A little less memory than the Those general purpose cpus, but that hasn’t been a problem For me when i want to go there and get the speed When you’re going to the gpu-enabled virtual machine, i Tend to gravitate toward the standard nc6 That tends to be the one that has the most availability as Well to taking out the tensorflow workloads and then A higher version without the s. With double the ram and a much Faster instance of the gpu. I just put some representative Costs over here on the right that marketing people really Want me to emphasize that these aren’t quite It depends on which region you’re using it in, and then These things change over time. But you can chair the relative Prices mainly of the different types of vm options there One thing i’ll warn you against, by the way, is there Is a free option that’s available to developers in Azure, and that’s the standard v1s instances Just don’t even try those. They just don’t have enough Ram capacity, mainly, to do the kinds of things we want to do So save yourself the pain and don’t even try Once you have launched one of the data science virtual Machines, there’s ways to connect to it. And this is a problem in a Sense until you figure things out How do you use your graphical interfaces if you’re using Gooey-type applications in these environments. Windows is actually the easiest If you’re familiar with using windows, you can just use Remote desktop. Azure makes it easy There’s a file that you can use on your desktop Double-click it and up pops a window, which is the same as Your desktop, and you can use it like it was any other machine But like i said, i tend not to do that I tend to do linux-type ones. Of course, for command line Interface, we can just ssh over to the remote server and Log and connect to it that way. Just a little tip I only recently discovered ssh copy id, like, within the last Couple of months, and i was so angry with myself over the Years i spent manuel cutting and pasting keys over to Romantic servers until i discovered this, like, why did nobody tell me about this Before? so i’m sharing that about with You now. Or you can use the x to go Clients to where they get an entire x windows interface or even individual applications And i’ll show you a quick demo of that, but i tend not to use That too much myself. The main things that i use When i interface into these virtual machines is the

Application-specific interfaces. Mainly jupiter. Hub for Python and tensorflow-type things. And when i’m working in r, i Can interface into r studio server. And this gives me a really Nice rich interface that’s running in my local browser Nice and responsive, not laggy when the internet’s a bit Crappy, but it has on all the remote desktop, but i’ll show You how to do both of those right now So let me drop into a demo here I’m going to move out to my browser Right now i’m at portal azure.Com, this is the main User interface that you’re familiar with And what i want to show you here is the script i used Later on to launch the virtual machine, and it’s running on a Standard ds4 and 28 gigs of ram, and i also want to grab This ip address because i want to show you how i ssh things In a. This is the subsystem for linux If you haven’t tried it out, it’s super easy It looks like an app, but it’s actually an entire linux Environment sitting within windows and works exactly like Linux. So right now, i have mash And what i can do is ssh over to my thing As you can see, i’ve already set up my keys And now i just have the standard bash interface over In the remote machine. All right. And let’s actually Run h top so that i can have a look at what’s aring this machine right now It’s pretty quiet, but you can see it does have the 28 gigs Of ram available to me and eight cpus, which are mostly Idle. So let’s see what we can Actually do with those cpus at this point I mentioned you can use x windows to connect A nice open source client available for windows to do that And it’s called x to go. I’ve already set up the Connection with the ip address, and i’ve already Shared my sshts, so i can lob directly into the client at This point. It’s making the connections In a minute, it will pop up a window. With a default interface I know those fonts are quite small there. But i really wanted to show You if you are playing around with this, preinstalled into The data science virtual machine is an applications Menu, we can go through and see the various things Available to you for development, for example, Python, this is how you would want r studio Lots of interesting ways to explore the data science Virtual machine capabilities through that interface But personally as i mentioned, this is something i don’t use Very often. But if you like working in There, i just want to show you how that works Probably more relevant to me, though, is going back to my Browser i can log into jupiter lab So let’s have a look at what’s happening up here, first of all What i have logged into via my local browser is port 9999 at The ip address of this remote server that’s running in Azure, and that is running jupiter for me The default installation on the data science virtual Machine has a bunch of preinstalled tutorials, and i Will run through these in a moment But if you have your own that you have either developed Locally and moved over to the data science virtual machine Or you have developed in the virtual machine, you would Access this the exact same way. Lots of things to play around With in here, by the way. There are examples using pi Torch, examples — excuse me examples using cntk There’s a really good tutorial here. Again, i recommend Checking that out if you haven’t seen julia before But just for those of you that might not be familiar with Jupiter notebooks, i can jump in through this intro to Jupiter one, for example. And one of the things i can do Is press rho control enter and write python and returning the Result back into the python notebook. And i can run that cell as Many times as i want because i’m getting random numbers Every time and getting a little chart And this is a great way for me to develop code that i’m going To share with others. I did this in python but if i Want — the other document compare includes code and Output that i want to share with others, it’s a in this

Case really tool to do that. I mentioned earlier on all of The packages that you need or libraries in this case for Python you need are already preinstalled on the data Science virtual machine. So if you have a script that Uses pandas, for example, this is the famous dataset imported From a website. And then described to have a Look at the summaries. All of that is built into the Data science virtual machine. You don’t need to go ahead and do that yourself So that’s python notebooks. The other interface i wanted To show you for the data science virtual machine is r studio It looks like i’ve been logged out. Let me log myself back in Again. I don’t want to save that password If you’re familiar with the studio, you want to use it Within the local environment. This is r studio server, which Is exactly the same. And it’s surprising how much The same it is to an app. That’s actually running just Here in my browser. But the experience is entirely The same as that. With the one exception, and This is really handy exception is that if i quit my browser, Close my laptop, come back to this later on, everything is As i left it when i went away. So it’s actually a really Convenient way for keeping your work persistent as Opposed to on a local laptop when you restarted everything Goes away, typically. By the way, this is a nice Little example of doing the birthday calculation And the statistician in the audience, this will be very Familiar to you from your stats courses being able to Calculate for the number of people in a given room what Are the chances the two of us share the same birthday There’s probably 18 people here in this room, and we can Calculate it using a simulation. Using this function here in r Again, this computation — the interface is happening in My browser but that computation just happened over In the data science virtual machine in azure And with these 18 people here in this room, there’s about a 35 Chance that two of us share the same birdie You can calculate this really easy using statistical theory, Unless you take into account febru ary 29th Then it becomes a lot trickier. So the way i actually did this Calculation was to do a simulation of 100,000 rooms With n number of people in it. And then for each of those 100,000 Rooms, i simulate 18 birthdays and then check are There any duplicates in the birthdays? Average oar the 100,000 simulations, and it actually works pretty well And i did it as a simulation to demonstrate that it’s a Nice thing to use with a multicore virtual machine Because i can paralyze the simulation over a rumor size One, two, three, and so on to get a sense of how that Probability changes over the number of people in the room And here, i don’t want to go into the details here But i’m using a library that’s available called Dmc, paralyze iterations of the loop Simultaneously across the machine So if i go back here and look at h top, and you can see up Here all of my four corresponds are being maxed Out as r is running one of those 100 simulations on each Of those corresponds, and then collecting the results back Into the local r environment. So that should be finished in just a moment, which is a good Chance for me to take a glass of water Anybody got a guess from how many people need to be in a Room before there’s a 50/50 chance that some of us share The same birthday? some of us share the problem With the statistics? this is the 100 simulations i Just ran by dividing them across the eight corresponds If you look at the 50 mark, how many people in the room Does it take before there’s a 50/50 chance before showing The same birthday? it’s about 23 people So if you’re in a room with 22 other people, somebody in that Room shares a birthday with somebody else All right. So that was an example of using the data