How to Do Predictive Analytics in BigQuery (Cloud Next '18)

[MUSIC PLAYING] ABHISHEK KASHYAP: Good afternoon, everyone I hope I’m not going to put anyone to sleep It’s right after lunch, so we’ll try to make it very, very exciting– exciting in Google BigQuery stride I’m Abhishek, I’m a Product Manager on the Google BigQuery team And we’ll be talking about BigQuery ML, which we announced at the keynote this morning And in this session, I’ll be joined by Naveed Ahmad, who has been one of alpha customers And Naveed will share his experiences on the use case [INAUDIBLE] for Hearst Newspapers and how they implemented it Before we start talking, I just wanted to point out that you can actually ask a question through the Q&A link in your Next app And we’ll get to those questions during the Q&A part at the end of the presentation So let’s start with, how many of you use BigQuery? That’s amazing Thank you And I see a lot of hands did not go up, so I’ll spend just I minute introducing what it is So BigQuery is Google Cloud’s enterprise class data warehouse We have customers who have petabytes of data, and you can actually write a single query to analyze all of that at once You can use standard SQL It’s Google, so data is encrypted, data is durable, highly available The key part of BigQuery is that it’s serverless and fully managed, which means all you care about is what data you have, how you want to analyze it, and let us take care of how many nodes you need in the back So the concept of node does not exist, which means your life is much, much easier And finally, you can actually use it for real time analytics as well So what does it mean? A lot of you who have used BigQuery can attest to the fact that having it being serverless has really changed how data warehousing is done, as your life has become much, much easier by focusing just on the analysis and not worry about how it will scale when you go to petabytes of data With BigQuery ML, we are bringing the same to machine learning People say AI is changing the world, but in reality, even though you can think of use cases which can become predictive, it’s very, very hard to actually use machine learning So that’s what we are doing with BigQuery ML And to illustrate it, let’s look at how machine learning may look like in a couple of forms Let’s say you have a few terabytes of data in your data warehouse– it could be BigQuery or a different one How do you do machine learning? The first step in almost every instance we have seen is to export the data out So you go discover the data you want to use for training your model, you export it If you’re lucky to have a data science team, they will format it to your favorite Python framework, and come back to you in a few weeks saying, we need more data And you just take months to build that model If you don’t have a data science team, you just take a very, very small sample, you move it to Excel, and you just run basic regression And I have talked to customers who say even the Excel-based approach takes months to build a modeling of data So the two underlying problems for ML are, one, doing even the smallest ML project needs data scientists, and there are just too few of them It requires so much understanding of math or so much understanding of hyperparameter tuning that that becomes a bottleneck The second is it takes multiple tools and it takes moving data out to even start a machine learning project And all these are just productivity hassles that prevent you from trying out the cool ML stuff you have for your analysis So that’s what BigQuery ML enables We are introducing a way to do machine learning using SQL string within BigQuery Right where you write your select statement,

you just start doing machine learning on the data you already have So what do you get from this? You can build your own machine learning models in standard SQL You can predict in standard SQL, as well The models stay within BigQuery You can train it on whatever data you want to train it on within BigQuery or data that BigQuery federates from other sources And finally, you only have to worry about the data and SQL We take of the underlying hyperparameter tuning And we take of feature transformation standardization and other things that usually are part of a machine learning workflow To give you a more practical example, let’s say you have marketing data coming into BigQuery So you get data from your website visitors from Google Analytics, you have first party data from your customer relationship management platform You move it to BigQuery because that’s where it’s so easy to [INAUDIBLE] and analyze it Now, rather than just looking at historical trends, if you wanted to start making predictions for your marketing counterparts, you would write a model using BigQuery ML You would take those predictions, report it out to your favorite BI platform, and you would productionalize it by connecting it with your marketing automation systems to take action based on these predictions And in all of this, since you do it in BigQuery, it is all SQL It’s very, very easy to set this workflow up And when you want to retrain this model with new data, you just use Airflow or Cloud Composer So you can actually get to a fully production model– and we’ll have Naveed talk about it later in this session– staying within the BigQuery I’ll give a few more examples to get you thinking about what kind of use cases it can empower you for So these are some of our public alpha customers We had quite a successful alpha, and these are the use cases that some of these pursued So Hearst, we’ll learn more about it later They used it for customer churn prediction 20th Century Fox presented in the keynote today, where they talked about how they are now optimizing their media plans for new movies using BigQuery ML Geotab just gave a demo at the Spark [? late ?] session just before this one, where they have an app for smart cities And one of those plug-ins is to predict if there’s going to be bad driving in a particular neighborhood based on the weather So they again, used a public dataset on weather that’s within BigQuery, combined with their telematics data And they can now say, OK, based on the weather forecast for the next seven days, these are the places where you may see aggressive driving, or too much braking, or accidents And city planners can then plan things much, much better And again, they did that staying within BigQuery, just using SQL We have News UK who are using this for customer subscription prediction, again, using data that they get from a variety of sources in BigQuery Smart Parking– they are an IoT company out of Australia, New Zealand They are using it for graphic prediction And Reblaze, who is one of our partners– Cloud Armor is actually using them in the background And they are using it to automate IP address threat prediction, something they used to do using heuristics before So all these use cases have been enabled across these customers who just did not have enough data scientists to do machine learning for each of these business decisions I will show you a quick demo of how this works To set it up, let’s say you are a marketer at a retail e-commerce company And one of the very common problems is that if you have a million visitors a month to your website, how do you plan your marketing campaigns to them? We actually have data from the Google merchandise store that shows only 0.7% of website visitors actually come back and make a purchase With that low return on investment from your marketing, you’re not going to target all of them So in this example I’ll show you how, using machine learning,

you can do a much, much better job at planning your marketing campaigns And if you can please switch to the demo screen So this is the new BigQuery UI that we launched recently And I’m sure a lot of you are familiar with their data So what it shows is we have this public dataset from Google Analytics from the Google merchandise store We are selecting a few free columns from this to use as features And if you think about marketing data, some of these features are very, very obvious that you believe are going to have predictive influence on if a person will come back and purchase or not And you just select these, like how many page views happened in a session? Did the person buy? Where did the person come from? Was it from a desktop? Was it from a cell phone? Or what country was it? So you select these You do not select the person’s ID because that’s not useful And all you do is you say, I want to create a model And you give it a name The only options you need to give is what type of a model it is We support linear and logistic regression for forecasting or classification based on your use case So in this case, you are going to predict a label that says, will this visitor come back and buy? And just telling us what the label is on top of your regular SQL query– in this example, we took the first nine months of data to train this model You just simply let it go And this will take a few minutes to finish It’s working on a very large set of complex data But what you’ll get at the end of it– once you have this model, you use it for predictions, you can build a report, which is what any marketer does In this case, it’s on Data Studio that shows that if you were to just target the top 6% of the scored leads, you actually get nine times the people who come back and purchase And this was on the last three months of test data And furthermore, you capture 50% of the potential buyers So both the precision and the recall of this model is really high And if you look back at the create model statement we had, you actually did not have to give any hyperparameter tuning options You did not have to do anything, except just select the right features And if you selected more that don’t matter, those will be ignored with a low rate anyways So in this example, with a couple of lines of SQL code, you actually can manage to get nine times the ROI on your marketing campaigns, still reaching 50% of the population And if you were to think about the typical machine learning metrics, this model looks really good The ROC curve here is actually the best you can get It’s very hard to get a better ROC curve And similarly, the precision and recall curves are really good So through BigQuery ML, you got a model you could predict And those of you who actually want to dive into the performance metrics, it’s really easy to do that, as well Let’s see if it’s still running If we can go back to the presentation, please Thank you So what happened behind the scenes? You go to Create Model You did not have to think about the infrastructure at all

It created a bunch of queries on BigQuery, highly parallelized tech, and give you the answer So determining the infrastructure, thinking about how many compute hours I would need, do I have two VMs, do I have six VMs, all that is distraction With BigQuery, you build a model without having to think anything about it Second, there’s a hyperparameter called learning rate, which is how fast you want to change your parameters We autotune that one You never want to create your model on all of training data So to prevent overfitting, we auto-split data into training insight You did not have to do that in that example either When you think about numeric features, we automatically standardize those Again, it’s a very basic task– has to be done Why leave it to the user? We just do it And finally, for string features, we one-hot encode them So anything that needs to be a category, we see it and we one-hot encode it And this is just a start As you think about automation of all these basic transformations and feature engineering, we’ll just keep on adding to it And this should suffice a lot of the use cases For the advanced user who may want to do more, we actually do give the [? nodes. ?] So if you want to use, say, L1 regularization so you get rates close to zero or one for different features, you have that option You say, a random training test data split does not work I want to do sequential Well, I want to flag and split them myself You can do that, as well You want to set your own learning rate, you can do that, as well So the philosophy is you get a great model out of the box And if you want to further fine-tune it, we have the options for you Thinking about what we support– so you can use StandardSQL, you can use UDFs within the queries If you have a forecasting use case– let’s say you want to forecast the sales of a SKU, you want to forecast the number of users on your website, you can do it using linear regression If you want to classify, say, will a user come back and make a purchase is in this example, or if you wanted to say, will a user subscribe, will a user churn, all those things you can do using logistic regression For evaluation of models, we have table valued functions that you can use to get all standard metrics that you may be used to if you already know some machine learning And for logistic regression, you can get the ROC curve and precision recall numbers, as well If you want to look at the model weights to explain to someone which features mattered more and which mattered less, we have an ml.weights [INAUDIBLE] for that one And finally, to do your feature engineering, there’s a function called ml.featureinfo And you can use that function to see standard stats, like min/max, standard dev, on each of these features that went into your model As we announced it, we realized that BigQuery UI is not the only place that our users use us from So we have integrations with Data Studio, Looker, and Tableau And if you go through your documentation, you can find out more But in this example, on the left, Tableau has built a UI-based housing price prediction tool, which is a very hot topic But in this case, using a model built in BigQuery ML, you can get predictions out of the box On the right, you see the Data Studio performance report I just showed Add at the bottom you see the performance report on Looker And with Looker, we actually have the deepest integration So one, they have blocks so you can explore the data in BigQuery Second, using LookML, which is their markup language, you can actually create models staying within Looker And finally, you can operationalize your ML workflow through Looker, as well If you take a step back and think where it fits–

because everyone talks a lot about ML, Google announces things in TensorFlow, things about AutoML– where does BigQuery ML fit? We see this as a centerpiece in making ML accessible for all audiences TensorFlow Cloud ML Engine is for the data scientist to build the state of the art neural network-based model that’s required for a very high performance application AutoML leverages the Google IP on standard problem sets, like object detection in images, object detection in videos, semantic understanding on text, and gives a developer the ability to use that as an API to build those models But what about all these business use cases for which speed is more important and for which it’s more important to have someone who understands their data to build a model using the tools you know? That’s what BigQuery ML is for It’s for our BigQuery data analysts who work with data day in and day out, love SQL But it’s probably too much to ask to have them go learn TensorFlow, go set up five different tools, move it out of BigQuery just to be able to build a model So that’s where it fits within our portfolio We have AutoML for developers, BigQuery ML for analysts, and TensorFlow CMLE for data scientists And if you think about the use cases in different industries, any business use case you see– from prediction of lifetime value to if you are in a gaming company, what action will this player take? Will this player come back the next day? If you think about IoT, downtime prediction So things which require a lot of data that is or can be in BigQuery, which is structured, you can use BigQuery ML to start approaching those use cases It’s really easy to get started with And once you get going, then you can improve your models and put them into production With that, I’m going to call Naveed from Hearst to talk about their experience with BigQuery ML and how they have actually implemented it in production Naveed [APPLAUSE] NAVEED AHMAD: Hello My name is Naveed Ahmad I’m a Senior Director for Data Engineering and Machine Learning at Hearst Newspapers So you might not have heard about Hearst Hearst is a very large media company You might have heard about names such as, “San Francisco Chronicle” or SFGate Or magazines like “Esquire” or “Cosmopolitan” And there’s many, many more companies under the Hearst umbrella You can look it up on the Hearst website or the Hearst newspapers I’m here talking about Hearst newspapers and its application for BigQuery ML Hearst employs about 4,000 employees across the nation and it focuses on local newspapers It has about 40-plus websites and it’s continuously growing by acquiring more websites And this is a picture of the Hearst building in New York, very close to Central Park So why BigQuery ML for Hearst newspapers? We are already a big shop, a BigQuery We have a whole bunch of datasets already sitting in BigQuery, such as Google Analytics, our subscription data, newsletter usage, demographics So since this data was already there in BigQuery, it just made sense to– when I heard about this feature a few months ago– let’s give BigQuery ML a shot So another thing is, anybody who is familiar with the SQL syntax can really get onboard doing data science You need to learn basic concepts, like precision recall or ROC, which somebody can pick up very quickly But you don’t have to be familiar with Python or something like sci-kit learn or TensorFlow to be able to use this Another benefit is that you don’t

have to ETL out data, do machine learning, do all that massaging in a script like Python and ingest it back, and put it into reports That whole cycle just gets cut down dramatically because you do everything in BigQuery And some of the other goodies with BQML is it does normalization, one-hot encoding, as Abhishek already mentioned And then you can also, after training the model, you can get the training, the machine learning matrix using SQL syntax, as well So the use case that we developed is churn prediction Churn prediction is to predict given your existing active subscriber, how likely are they to cancel their subscription in the future? And this is a very relevant use case for media, because you probably have heard in news that people have lots of choices People tend to go from print to digital So if we could save some of our subscribers– I looked up some proverb for this Money saved is money made Then I thought, maybe there’s other proverbs that apply to this scenario For example, prevention is better than cure And I have another one that I’ll use as an Easter egg for my talk tomorrow I’m giving a talk about machine learning in context of publication– talks about features about BQML, AutoML, how to apply natural language processing and TensorFlow in the space of media So if you find that relevant, please attend that And this gives insights into future cancellation of subscriptions And this is a binary classification problem People who cancelled and who didn’t So that fits very nicely into the logistic regression facilities offered by BQML And we took a year’s worth of data for our subscriptions, newsletter, demographics, web browsing behavior of a customer, and customer service data setting in BQML So when I put this diagram– the architecture is really two nodes I just tried to fill this diagram out with all these different steps, really It’s BigQuery and Looker So everything you do is in BigQuery And our procedure really is, we’re ETLing our standard ETL process into BigQuery And then we do some preprocessing We preprocess some of the more complicated SQL, pre-create those tables, especially with Google Analytics The linkage of subscription data to Google Analytics, we want to compute that before And then once we have all our tables, it’s really a matter of running this Create Models SQL, which basically has one big table with a whole bunch of features from all these different datasets and a label column which says, is this person going to cancel or not? And we’re using last year of data, splitting it to two halves– six months of data where we’re extracting feature And from that six-month point onward, we know which subscribers cancel and who didn’t And then you run the SQL, go for a cup of tea or coffee, and then come back and your model is ready for usage And then we take the existing subscriber with the same feature set and run it through the predict SQL And then it produces a probability score for each subscriber, how likely it is to cancel And then on top of it, we built Looker dashboards, which display our end data, how likely a person is to cancel or not And also, the machine learning matrices– we can see how good or bad this machine learning model did, like show the weights that it learned, just to get more idea of what’s happening So this is a snapshot of our dashboard It’s probably hard to see, but over here this look shows you for each subscription, I have erased out their subscription IDs Like, what is the top people who are going to cancel their subscription? So this is something that a marketer or a person dealing with subscription retention can look at and take action on Right beside there, we can see a look at the score This is the area under the curve, which

essentially is, in this graph, it shows the area under this curve And all of these matrices you can get from the ROC SQL support in BQML And this other graph– the next question is, once you’ve done this modeling, what is your cutoff threshold? Like, are you going to take 50% and above and call them churners, or 30% and above? So using the same ROC, we’re plotting the false positive rate and the true positive rate And then each threshold shows what that is So our objective is to have a high true positive and a low false positive rate So about at 0.3, we have about 20% false positive and 80% is true positive Which I feel is a good trade-off between them, because the worst case scenario is that you’re going to send some extra emails or messaging Doesn’t hurt if those people didn’t actually turn out to be churners And we’re plotting out the weights So on the left side, these are the positive weights And this is another SQL from BQML And these are the negative weights So the positive weight means that they have a positive correlation to churning, while the negative one, which is good for us– good for retaining subscribers– are negatively correlated And there’s two types of features And you have to tweak your SQL If they’re categorical features, then you have to unnest and use a slightly separate syntax to extract it So we’ve plotted both of them I’ve deliberately erased the features because that gives out some of the key findings that we made from this So these are the benefits that we see from BigQuery ML When I heard about it– and I had previous experience building churn modeling– I actually built out that SQL, a very one feature-based churn model, in a day Since we had that data sitting there, it was a matter of coming up with a SQL As you’ve seen in the demo, it’s not very difficult And then we spent a few sprints refining that model And it really was like getting more features, testing it out, looking at the matrix, does it make sense? So it was that cycle was what we did in sprint And also, of course, built that dashboard in Looker If it was some other sci-kit learn, you would have been pulling this data out into another system And this two sprint could have become like six sprints or a much longer time And we focus on the problem rather than the data science nuances And the feature engineering is clearly laid out in SQL So if a new person looks at it, anybody familiar with SQL can easily understand what’s going on, rather than looking at Python or R code along with SQL And it’s easy to onboard new members People who are familiar with SQL who are not actually developers but work in BI or other departments, or who are willing to learn SQL, can actually use this And then also, it utilizes the scale of BigQuery So since it’s a distributed [INAUDIBLE] based thing, you can run your models on larger datasets in a quicker time All right Thank you [APPLAUSE] ABHISHEK KASHYAP: So before we wrap it up, let’s go back and look at what happens to the model Can you switch to the demo screen, please? So as you can see, the model was built, the model is stored It’s on [? Icon ?] in BigQuery These are the training stats So when you’re building the model, you want to look at what the data loss was And this shows the features and labels that went into building the model So when you start using it, you are going to start seeing model as a new type that’s showing up on your BigQuery console

And then you can start using it for predictions to get the dashboard reports that we saw So can we go back to the presentation? Thank you So just to finish it, what can you do going out of this? If you want to try machine learning, I think the first step is to identify the use case And as Naveed mentioned, look at what’s the most important for your company Then see, is it a forecasting problem or classification? And then you can use [INAUDIBLE] problem type You can go to the BigQuery console and start using it today If you have any feedback, please send it to us It’s a public beta now We will look at it and get back to you And finally, we have included it in the Coursera course on data analytics And using this QR code– it’s going to Next ’18– it’s actually free for the next month So you can go and learn more It will walk you through a few examples and actually get you started So we have some time for Q&A And just to remind again, you can ask them using your app or you can ask them at the mic here And the ones through the app which we do not cover today, we will get to them by tomorrow So you can come back and check for the answers Thank you [APPLAUSE] [MUSIC PLAYING]