Live Coding A Machine Learning Model from Scratch (Google I/O'19)



hello everyone welcome to live coding a machine learning model from scratch my name is Sarah Robinson I'm a developer advocate on the cloud platform team focused on machine learning you can find me on twitter at s rob tweets and most recently you can find my blog at Sarah Robinson dev so let's dive right in with what we're gonna cover today I'm gonna start out with a quick overview of what machine learning is then I'll talk about the model we'll be building and finally we'll get to the live coding so at a high level what is machine learning I really like this definition using data to answer questions the idea here is that as we provide more and more data to our machine learning systems they'll be able to improve and generalize on examples that they haven't seen before so we can think of almost any supervised learning problem in this way we have our labeled training inputs we feed them into our model and our model outputs a prediction now these training inputs could really be anything it could be the text of a movie review and our model could be doing sentiment analysis to tell us that this is a positive review it could be numerical or categorical fitness data and maybe our model predicts the quality of the sleep that we'll get it could be image data so in this example our model is predicting that this is an image of a cat so this model concept can seem really magical but it's actually not magic under the hood it all boils down to matrix multiplication so if any of you remember y equals MX plus B from your algebra classes this might look familiar so the idea here is that you have your features as matrices those your inputs and then you have the thing you're trying to predict and then you have these weight and bias matrices and when these are initialized they're initialized to random values so the idea here is that as you train your model you'll find the optimal values for these weight and bias matrices so that you get high accuracy predictions so that sounds great but you might be thinking the examples I showed in the previous it's none of those had matrices anywhere well it turns out that you can represent pretty much any type of data as a matrix so this image as an example all an image really is is just a bunch of pixels and this is a color image so each pixel has a red green and blue value or an RGB value so this image becomes three matrices of the RGB values for each pixel let's say you have categorical data so for a particular column in your data set it describes the industry that a person works in and there's three possible values the way you encode categorical data is through what's called one hot encoding so to do this we create an array with the number of elements corresponding to the number of categories you have in this case three and all of the values are zero except for a single one and the index of that one corresponds with the value of that category so maybe you're thinking wouldn't it be easier if we just numerically encoded these healthcare one finance to retail three we could do that but then our model would interpret these as continuous numerical values so it would assign higher weight to retail I would say retail is greater than finance which is greater than healthcare and we want our model to treat all these inputs equally which is why we one-hot encode them so I've shown you how we can transform our data for our machine learning models in this talk I want to share with you some tools on Google cloud that can help you build ml models and I like to visualize this as a pyramid where you can choose your level of abstraction if you want to get down to the details of matrix multiplication and building all the layers for your model from scratch which we're gonna do today you can do that but you also don't need ml expertise to get started so the tools on the bottom of the pyramid are targeted more towards data scientists and ml engineers and as we move towards the top these are targeted towards application developers that may not have machine learning expertise so in this talk we're gonna be building a model with tensorflow and we're gonna deploy it on cloud a AI platform the idea here to turn you all into data scientists and m/l engineers so when I was thinking about type of model I wanted to build for this I wanted to choose something that would resonate with developers and as a developer there's one school I can think of that I use on a daily basis and that is Stack Overflow so I wanted to build a text classification model to see if we could predict the tags of a stack overflow question to do this we need a lot of data and luckily we have a public data set available in Google Cloud bigquery which is our big data analytics warehouse on Google cloud platform McCrory has lots of really interesting public data sets for you to explore and play with and they happen to have one of Stack Overflow questions and it doesn't have just a few Stack Overflow questions it has over 26 gigabytes and over 17 million rows of questions so this is a great place to get started since we don't have a ton of time I wanted to simplify the problem space so in this example we're just gonna be classifying questions that have these five tags related to data science and machine learning so for this particular question our model should categorize it as pandas which is a Python library for data science so our first step is to get the data from bigquery bigquery has a great web UI where we can write sequel directly in the browser get our data and then download it as a CSV which is what I did so I'm extracting the title and the body of the question concatenated them into one field getting a comma separated string of the tags and then we're only getting tags with these questions with these five tags so I ran this query and as I was looking at the results I noticed something a lot of the questions the question posters conveniently we'll put the name of whatever framework they're using in the question which is very helpful that got me thinking do we actually need machine learning for this could we just replace our entire ml system with an if statement if tensorflow is in the question tag equal sensor foil the answer is no because although lots of question posters do this there are a lot of people that questions that are really good questions but they just dive right into the code and they may not mention the framework or tad name so we want to capture these questions too and we don't want our model to just pick up on those signal words so that it only identifies questions with the name with the word tensorflow in them as tensorflow we wanted to be able to generalize to find patterns among these tags so when I was pre-processing the training data what I wanted to do was take out these obvious keywords so I thought I could replace them with a common word and because everybody loves avocados I replaced all of these words including abbreviations for the frameworks with the word avocado so the results in bigquery looks something like this we have our questions we have avocado predictive models we have avocado data sets lots of avocados everywhere so what I didn't talk about before was how we encode freeform text data into matrices there's a couple different approaches for doing this I'm gonna use one called bag of words and this is a really simple one for getting started so you can think of each input to your model in bag of words as a bag of Scrabble tiles whereas instead of a letter on each tile you have a word on each tile so this type of model cannot detect the order of words in a sentence but it can detect the presence or absence of certain words so to show you how this works I want to show you a really simple example so for this example we'll limit our problem space even further to say we're only tagging questions with three types of tags and bag of word models have a concept called vocabulary so for pretend for a moment that you're learning English for the first time and you only know these ten words that's how our models gonna look at the problem might lead to some interesting conversations every time we only know these ten words so when we take this input question how to plot data frame bar graph we'll look at our vocabulary and we'll say okay I recognize these three words the rest of the words in the question are gonna be gibberish to the model and when we feed this what all of our questions into our model we want to feed them in as matrices that are the same size so what we do is our question becomes an array the size of our vocabulary with ones and zeroes indicating which words from our vocabulary are present so because data frame is the first index in our vocabulary the first element of our array becomes a 1 even though data frame is not the first word in our question same goes for graph and plot so to summarize our question becomes a vocabulary sized array of ones and zeros it's called multi hat encoding and our prediction since for this particular model our models going to be able to identify a question that has multiple tags not just one so this will also be a multi hot array so now we know how to encode our text data for our model what does our model actually look like so the input data to our model is going to be that vocabulary sized bag of words matrix and then we feed that into what's called hidden layers so this is going to be a deep neural network which means that we have layers in between our input and our output layer so it's gonna take this vocabulary sized array and resize it to whatever size we choose for our second layer and our third layer now the output of these hidden layers doesn't mean too much to us our models using it to represent complex relationships but what we really care about is the output of our final layer there's many options for how you can compute the output of this layer we're going to choose one called sigmoid and what this will do is it will return a value between 0 & 1 for each tag which corresponds to the probability that that tag is associated with this question so for this particular example it looks like the question has a high probability of being about chaos or tensorflow so what are all the tools we're going to use to build this I already showed you how we use bigquery to gather our data and downloaded the CSV we're gonna use 3 open source frameworks to do some pre-processing and some transformations on our data to get it into the right format will use pandas scikit-learn and we'll use tensorflow to build our model specifically TF chaos we will run our training and evaluation in collab collab is a Python notebook that's cloud hosted and you can run it in the browser it's totally free for anyone to use and finally we'll deploy our model on cloud a AI platform so the title says live coding let's go over to the demo can we switch to the demo awesome okay so here we have our collab notebook it's connected again anybody can access collab in the browser collab dot research Google com and as you can see we've got no code in here what could go wrong just a bunch of comments so I am going to start writing some code and run these cell-by-cell so what I'm gonna use is collab has this handy snippets tool so I've saved a couple of the snippets for this notebook that I'm just gonna drop in here so the first one is we are going to run this anyway alright had to reset the runtime and now we are connected so our first cell is just importing all the libraries we're going to use we're using tensorflow pandas and um hi a couple of scikit-learn utility functions and then we're using care us to build our model so we've got all of our imports let me actually make that just a tad bigger so that you can all see the next thing we want to do is authenticate here we go so collab also has a handy authentication function that we can run and what this will do is will pop up a URL for us to authenticate to our cloud account so i will allow access copy this code and paste it in okay so now we are authenticated and now we're gonna get to the fun stuff so first we want to download our CSV so now that imma sonicated I have saved all this data in a CSV in Google Cloud storage that's our object storage tool on Google cloud platform so I'm going to download the CSV to my local collab instance and then what I'm going to do is I'm going to use pandas to read this CSV in so let me just read it in here and this will transform our data into what's called a panda's data frame which you'll see in a moment the next thing we're gonna do is we're going to shuffle our data and this is a really important concept in machine learning in case your data was in any sort of order before you want to make sure that you shuffle it and I'm using the scikit-learn shuffle function to do that data dot head will let us preview our data so this is what our data looks like we've got our tags as comma separated strings and we've got our question text with lots of avocados now we can't feed this in its current form to our model so we're gonna need to do some encoding so first we'll take care of the tags so the first thing we want to do is encode these tags as we saw on the slides to multi hot five element arrays because we have five possible tags so what I'm doing in the first line is I'm splitting each tag into an array of strings and then I'm using this scikit-learn multi-label binarize ER function which will take all of those arrays of strings and transform them into a multi hot array so this is for the first question we can see it's about tensorflow and care us and here's this reference array that Saiki learners created for us so our input becomes this so we've encoded our labels and we are almost ready to move on to the questions before we do that we need to split our data so another important concept in machine learning is train test split so we take most of our data majority of our data and we'll use it for training our model in this case we'll do 80% we set aside smaller portion of our data it's used for testing so that we can see how our model performs on data that it has never seen before so in this case we have a hundred and fifty thousand questions in our training set and thirty-seven thousand in our test set so we will now split our labels into training and test sets and we'll say test tags equals tags encoded we'll run that so we've got our labels ready to go and the next thing we want to do so our our label data is ready to go now we need to encode our question data into bag-of-words matrices so I've written a class to do this I'll paste that in here it looks like a lot going on but I'll explain so what's happening here is we're using this chaos tokenizer utility so luckily we don't have to do all that code of converting our freeform text to bag of words by hand caris has this utility that'll do it for us all we do is pass it or vocabulary size so in this super simple example that I showed before we had a vocabulary size of ten so that I'll take the top ten words from our data set since this data set is way bigger I chose a vocab size of 400 which you'll see in a moment so you want to choose something that's not so small that it only identifies common words across all of your questions but you also don't want to choose something that's too big that your bag of words arrays become all ones so we'll run this and now we're actually going to use that class so we will import it and we will split our questions in to train and test sets so I'll do that right now there's still strings we haven't encoded them yet okay so the next step is to instantiate a vocab sized variable which is going to be 400 and now we will actually tokenize that text so we're gonna create a variable called processor and in then she ate that text preprocessor class passing it our vocab sighs and now we can call processor dot create tokenizer and pass it our trained questions that's this method right here and then finally the last part is actually creating those bag-of-words matrices so this will be equal to processor dot transform text that method right there and we'll pass that our trained questions and then the test set will be the same with our test questions which we defined up here so I'm gonna run that it's gonna take a bit of time to run because what it's doing is its transforming all 180,000 of our questions from text to bag of words so while that's running we can keep writing the other cells and they'll run when this one completes so what I want to do is print the length of the first instance should be 400 and then we're just gonna log that bag of words matrix so that you can see it and now we get to building our model we'll come back to this when it finishes running so the next thing that we want to do is save that tokenizer that we created because we're gonna need it when we deploy our model we're using a python utility called pickle to save that tokenizer object to a file so now it's time to create our model we've transformed our data we've saved our tokenizer now we actually want to write the code to our model using Kerris I'm gonna wrap this in a method we'll call it create model and it'll take our vocabulary size and our number of tags and we're gonna use the chaos sequential model API to do that which I imported above this is my favorite API for building models because it essentially lets you define your models as a stack of layers so you'll see that the code we're about to write will correspond really nicely to that model diagram that just showed okay so it looks like this cell finish running so we can see that this is what the first bag-of-words input looks like for our first question sarey of ones and zeros so let's continue building our model so the first layer to our model is going to be a dense fully connected layer that just means that every neuron an input layer is connected to every neuron in the output layer we need to tell our first layer what the input shape is going to be in this case it'll be the input of our vocab size which is 400 and then finally we need to tell care us how to compute the output of this layer the great thing about chaos is we don't need to know exactly how this activation function works under the hood we just need to know the correct activation function to use so in this case Ray Lu is what we'll be using will add one more hidden layer also with the rail of activation function we don't need input shape here because it'll infer it from the previous layer so so far these are our hidden layers so we don't care too much about the output of those layers but we do care about is the output of our final layer when that layer is gonna have five neurons it'll be a five element array since we have five tags in our data set and the activation function here will be sigmoid so that's really all the code for our model just four lines of code we've defined our model in care us we need to run model compile so that we can actually train and evaluate this model and to do this we need to tell it a couple of things so we need to tell it our loss function this is how Kaos will compute the error of our model so every time it runs training it'll use this function to say okay what was the error between what the model predicted and the ground truth what it should have predicted again we don't need to know exactly how this particular loss function works under the hood this is just the best one to use for this type of model I also need to tell it my optimizer the optimizer is how the model will update the way after it goes through a set of data and then finally I want to tell it how to evaluate my model as its training and we're gonna use accuracy as a metric here so I will return my model and now we'll actually create our model so we'll call create model we'll pass at our boat capsized and our num tags and then we can use this model dot summary method to see what our model looks like layer by layer so we can run that we've got our model ready to go we have not trained it yet so the next thing we'll do is we'll train our model and we can do that with one method call just model dot set we pass it a couple of things we pass it our bag of words array that's our input our features we pass it our labels those are the tags which we've encoded we need to tell it how many packs to run training for what this means is that this is how many times our model will iterate over our entire data set so we'll go through our entire data set three times batch size is how many elements our model will look at at one time before it updates the weights in this case we'll use 128 and then I'm gonna pass an optional parameter called validation split which will take 10% of our training data and evaluate it as our model is training so we're gonna run training and it should train pretty fast and what we ideally see here is that our validation and our training loss is decreasing so that's a good sign we've got our validation and training loss decreasing if your validation loss is increasing as your training loss is decreasing it might be a sign that your model is learning the training data too closely we can see that our accuracy is 96% which is pretty good so next step is evaluating our model so that 20% of the questions that we set aside we're now gonna see how our model performs on that data so we call model evaluate we pass it our test bag of words our test tags and we tell it our batch size and our evaluation accuracy is very close to our training accuracy which is a good sign and then the last thing we want to do is save our model to a file caris uses the h5 file format so we'll save our model and now we're gonna test it out locally so we have a custom prediction class that I've written and all this is going to do is pre process our data so it'll instantiate our model our saved model from the file it'll instantiate our tokenizer it'll take the question as text transform it and then return a prediction that's sigmoid probability array so let's save some questions to predict this is just an array of two sample questions and then we will make a prediction on our local model so this should return cool so that for the first question it predicted chaos which was accurate second one predicted pandas so we've got our model working locally accuracy is 96 percent next thing we want to do is package and deploy it to AI platform so I've written some code to do this and one of the features that we're gonna make use of that's new to cloudy I platform is custom code which lets us write custom server-side Python code that's run at prediction time now this is really useful because it'll let us keep our client super simple so our client will just pass the text to our model we don't have to do any transformations on the client we'll pass our text to the model and on the server will transform that text to bag-of-words and then return the correct tags so to do this we'll need to copy some files to Google Cloud storage so that cloudy I platform can find our model and our tokenizer I'm going to use the Google cloud CLI G cloud to set my current project to the one I've configured for this demo and finally we want to deploy this to AI platform so here's the deploy command that we're going to run we're gonna set our minimum nodes to one so that our model doesn't scale to zero and I'm gonna create a new version called IO 19 and we are gonna deploy that that'll take a couple of minutes to run but if I look over here in the cloudy I platform model UI we can see that my model is deploying which is pretty cool so while that's deploying let's go back to the slides so we built something that looks pretty good we've written some code to pre-process our data we've trained a model with pretty high accuracy but can we do better right now our model is pretty much a black box we don't know how it's making predictions we know that the predictions are accurate but we don't know what's making it make those conclusions so as sundar mentioned in the keynote we want to make sure that our model is using data to make predictions in a way that's fair to all of our model users to do that we need to uncover the black box and as model builders we need to remember that we're responsible for the predictions generated by our model this may sound really obvious but we need to say take steps to avoid bias by making sure that our training data is representative of the people that are using our model luckily there's a lot of great tools to help you do this I'm gonna look at one open-source tool called shop and what shop will let us do is interpret the output of any machine learning model so we'll use it here for our TF chaos model but we could use it for any type of model the way shop works is we'll create what's called an explainer object passing in our model and a subset of our training data and then Shapp returns these attribution values so what the attribution values are is it's positive and negative values indicating how much a particular feature impacted our models prediction so let's say we had a model with three feature we get back a three element array of attribution values high positive values I mean then that particular feature pushed our prediction up negative values would mean it pushed our prediction down well how will this work for our bag of words model so what I configured shop to do was I had it treat that 400 element vocabulary array as 400 features and so we get back 400 different attribution scores and we can take the highest values from that and the lowest values to see which words had the most impact on our model so I wrote a little code to do this I don't know if the colors are showing up well in there but what this is is I took the top five highest and lowest attribution scores so that I could see which words our model is using to make predictions so this particular question is about pandas and it picked up on the words column data frame D F series which is good because those are all words that are specific to pandas words that contribute at least were for each and you which are fairly common words that you might find in any stack overflow question so this is a good sign that our model is working correctly so let's go back to the demo and take a look at shap so it looks like our model finished deploying I'm going to go ahead and set that as the default version and before we get into shop we will generate some predictions on our deployed model so I'm going to write a file with some test predictions and now we're going to use the Google cloud CLI to get some predictions on our train model so we call G cloud AI platform predict we pass it our model name which is stack o model and we pass it our text instances which is this predictions dot txt file that we just saved and we will run that and now we are going to print our deployed model predictions so it looks good our model we're calling our model that's deployed on AI platform and it has predicted chaos successfully for the first question and pandas for the second question so now let's use shop to explain our model so shop does not call does not come pre-installed in collab so we will install it be a pip and then we will also install this module called coloured which I use to color the text to different words of the text so once that's done installing we will import shop and we can create our explainer so we will create what's called a deep explainer shop has many different types of explainers we're gonna use the deep explainer and we're gonna pass it a subset of our training data and then we will create our array of attribution values so we can call explainer dot shap values and we pass it a subset of our tests data in this case we'll just take the first 25 examples from our test set so we've got our attribution values the next thing you want to do is print them in a nice way so first to do that we're gonna take that tokenizer that we created in care US which is a dictionary and we're gonna convert it to an array so that we can take our attribution values from shop and see which word they correspond to so we'll convert our tokenizer to a word list and here we can see the top most frequently occurring words that our model is our model vocabularies using and the first thing we want to do is print a summary plot which is a method we have in shop to see a summary of the words that are most impactful on our models predictions so we'll pass it our shop values we'll also need to pass it our feature names which is that word list we just created and along with our class names which goes all the way back to our tag encoder so if I wrote that correctly we should get this nice plot here and what this tells us is the highest magnitude words that our model is looking at so it looks like data frame was the word that our model used most to make predictions probably because panda there was the most pandas questions in our model and we can see that data frame was most impactful for pandas it was also a signal word for the other libraries probably a negative signal word and then we can see the other words that were impactful here so plot was most important in matplotlib predictions which makes sense and we can see all these other words down the list so that gives us a nice aggregate view of how our model is making predictions for this particular data set and now if we want to highlight individual words I've written some code to do that so that we can see for a particular question which words are signalling predictions so here's an example of a panda's question the words column a lot of pandas dataframes have columns that have not a number of values and then we see common words like use and does are not being used by our model to make the predictions for tensorflow we get it looking at words like session sash ops tensor here's a car house example looks like the most common words or lsdm layers dents etc so now we can see how we can use shop to make sure that our model is behaving correctly one way you might want to use this in in a fairness example is if you have a model that's doing sentiment analysis you want to make sure that it's using the right characteristics to predict positive and negative things and whatever type of text you're using so if we could go back to the slides to wrap things up we're gonna put this all together and get predictions from our model in a web app so it'd be really useful if we could as we're typing a stack overflow question see which tags are associated with it so with the help of my teammate Jennifer person should follow her on Twitter we built a Chrome extension and what the Chrome extension does is it takes the HTML of our stack overflow question it passes it to a cloud function which I've written using the cloud functions Python runtime the function will use the AI platform API to make a prediction on our model and then in our function will return just the name of high confidence tags back to our Chrome extension so let's see it in action let's go back to the demo here we go okay so here is an example question about matplotlib in pandas I have just loaded up my chrome extension locally looks like it's not working at the moment hold on one second live demos you never know let's try it out once more all right something's going on here I'll try out the question draft all right not working but luckily I have a video of what it was supposed to do sorry about that everyone I always have a backup okay so this was is what it was supposed to do live demos you never know okay so I was supposed to predict matplotlib and pandas for this question and then we have a question that I've drafted based on the code that I wrote and a predicted chaos for that one I'm sorry that the demo didn't work let's give it one more shot yeah okay looks like it's not working anyway that's what the demo was supposed to do uh if we could go back to the slides so I covered a lot of different tools today I just wanted to give a summary of everything we built we built a custom text classification model using TF chaos we deployed our model on cloudy I platform I should also mention that I could have also deployed that shap code to AI platform using the custom code feature but in this case I decided just to keep it super simple and do that part separately we explained our model output using shap to make sure that our model was behaving in the way that we expected and we kind of called AI platform with cloud functions you saw a video of what it was supposed to do these are links to all of the code that I showed today and all the different products so the first link goes to a github repo with the code for everything I showed the other links are to various products that I covered and finally we are working on model interpretability support in cloudy i platform so if you have a use case for it if you want to stay in the loop and hear more about it we'd love to hear from you so please fill out this form at the bitly link at the bottom and finally please fill out the feedback on this talk in your i/o application I really use that a lot to make the talk better the next time so thanks everyone you

45 thoughts on “Live Coding A Machine Learning Model from Scratch (Google I/O'19)”

  1. You can find the code from this demo here: https://github.com/GoogleCloudPlatform/ai-platform-text-classifier-shap

    And a blog post about it here: https://sararobinson.dev/2019/04/23/interpret-bag-of-words-models-shap.html

  2. This is amazing and almost instantly got me tuned in. Developed the code to reproduce here
    https://lalit7jain.wordpress.com/2019/05/14/deep-learning-classification-of-stackoverflow-questions-with-interpretability/

    Enjoy!

  3. Can anyone help me with why am I getting different results in the 'code snippets' search option? eg. When I write "auth", "authenticate to your cloud account" option doesn't come.

  4. Very nice hands on coding. It would be better if you do it on regular basis on different kinds of problems.

  5. Edit: I made a jupyter notebook with the code of this session: https://github.com/kennycontreras/Jupyter-Notebooks/blob/master/ML_model_Google_IO19.ipynb
    Amazing session! thanks for sharing! I'm an Anaconda Jupyter Notebook user and I liked a lot all the features of Colab. I'm going to give it a try.

  6. yes I have a suggestion, if your going to represent google and not even YOU can get a working live demo by the end, don't blame it on something fictional. It's a bad look, if the employees can't figure it out lol how are the rest of us mortals expected to.. Other then that was a nice quick overview, a verywell setup notebook demo page also.

  7. Hi i am from India . and i am working on a project which uses dialogflow api . and i realise that dialogflow has many bugs.

  8. Very informative video. You explanation of how ML is doing matrix multiplication under the hood was amazing! As were your descriptions of one-hot and bag-of-words encoding. Never heard of SHAP before, looks like something I would find very useful. Thank you!

  9. from model building to deployment in less than an hour! and handling the live demo gotcha gracefully #awesome

Leave a Reply

Your email address will not be published. Required fields are marked *