3.4: Linear Regression with Gradient Descent – Intelligence and Learning



hello okay so this is now another video in my series about linear regression now why are you watching these videos not my shirt but the topic here and the skills here I hope are laying a foundation for what I'm going to get to in future videos which is building a neural network based machine learning system so at the top of this video why am i making another video about linear regression so what I did in the previous two videos is I created a p5.js sketch which implements linear regression using the ordinary least-squares method so a statistical approach there are a whole bunch of data points into the space and I try to fit a line but that fits to that data as best as possible so that I could predict new data points in this space and you can see as I start to click around how the line is fit sort of changes now I also discussed a little bit of all does linear regression make sense based on your data and these are big important questions in working your data science and machine learning but right now we're just trying to focus on the techniques now one thing you'll notice is here is i refresh this page I make I click twice I get a line instantly because I'm actually calculating the perfect exact best fit line according to the least squares method but something day we will have a data set that's not two-dimensional someday we will have a data set that has hundreds that's a big data set that's a many many dimensional and in this case there isn't going to be an easy statistical approach that could just be done to fit to create a model that fits the data perfectly with a single calculation on it okay so this is this is the problem that machine learning neural network based deep learning based systems are here to solve to figure out a way to create a model to fit a given data set and one technique for doing that which is different than say ordinary least squares is to use a technique called gradient descent and what gradient ascent descent essentially does it says let me make a guess I'm just going to put the line here and then the see how is that line good or not so good that's good let me try shifting it a little closer to the data let me try shifting it a little closer to the data let me shift this let me shift it so making lots of little small nudges and tweaks into what that line is doing I think I have a better way of explaining that so I'm going to come over here the whiteboard I'm going to do a little magic here which is that I'm going to stand right over here I'm going to snap my fingers and the moment I snap my fingers the whiteboard behind me could be erased Wow oh that work dork that's kind of interesting okay so let's I'm going to make this same kind of diagram that I've made a few times now and I'm going to we're going to like really simplify it so we have this idea this two-dimensional space there is some piece of data we'll call it X for example the temperature that's the that it is outside today and we're trying to predict some outcome may be based on that temperature yesterday we talked as we'll call that Y we have at of the sales of ice cream I saw actually a dataset that was interesting about like the frequency at which crickets chirp according to the temperature outside that's the dataset you can find online somewhere of use of so we have this idea and maybe there's some existing data points based on ice cream store that we have studied and I can grasp that data so the idea here is that we have our machine learning recipe we are going to take and I know I'm out of the frame here you're going to take one of our inputs called X feed it into the machine learning recipe and the machine learning recipe is going to give us a prediction Y so we have known data and we could if we had new input data we could make a guess so yesterday my machine learning recipe was the ordinary least-squares method meaning I was able to do a statistical analysis of all this data and create the line of best fit and then if I had a new input you know X value of such-and-such I could look up its corresponding spot on the line and that would be the Y output this is a function the machine learning recipe is essentially solving for M and B in the equation of a line so that's what I did yesterday today's technique I want to demonstrate the technique known as gradient descent so the idea of gradient descent is okay so boy so much to say where should I start where should I end I really have no idea so one thing I'll mention is that the mass required for gradient descent typically involve calculus and they involve two concepts from calculus one called a partial derivative which if you don't know calculus or what a derivative is well how can you be expected know what a partial derivative is as well as a single the chain rule and I think what I'm going to do is I'm going to walk through this entire system and how it works and explain it without diving deeper into the mask um but I will make a follow-up video where I discuss some of those pieces in a bit more detail so so okay so but here's a way that you could think about gradient descent thats related to stuff that I have done in previous videos and in my book nature of code where I reference the work of Craig Reynolds steering behaviors so think about this for a second this is this is great let's say you have a two-dimensional space and you have a vehicle that is moving around an agent that is moving around this space and the vehicle has a particular velocity expressed as a vector or an arrow in this case now what if the goal of this vehicle is to reach this target well we could say that this vehicle has a desired velocity its desired velocity city is to move at maximum speed from its current location to towards the target so this is a vector which is its desired velocity and it's useful and you can think about the difference like the it's currently this vehicles current philosophy is like its guess I don't know where I should go I'm going to try going this way oh but really I should go this way well if I'm going this way but I really should go this way what if I just turn a little bit towards the target what if I were to just steer a little bit in that direction and this is what gradient descent does you can think of desired as the known output the correct output what if I feed in one of these data points right and I say look at this particular X Y hair let me feed it in let me try to get a guess just sometimes I think written is why a tick I think but I'm going to say what if I get Y i'ma say Y get the error is the difference between what I guessed it would be minus what it actually should be right if I start with an X Y pair this is the error and you'll notice if you look at Craig Reynolds steering behaviors and all of these animated systems that I said I implemented from that work you'll see there's a formula in it steering equals desired minus velocity so you know I put I put it get in you know I kind of do the reverse here because this is really the equivalent desire but the point is the difference between the way that I should go and the way that I am going that's the error the difference between what my machine learning recipe what my model currently thinks the output should be compared to the known output that is the error and steering X if I adjust my velocity if I steer towards the desired I'm going to get a better model I'm going to move towards the target if I use this error to tweak the parameters machine learning recipe I'm going to make my model I'm going to have better M and B values for the next time and I could do this over and over and over and over again and this is we've been talking about this supervised learning I can take the known data send it in get a guess look at the error tweak the knobs send the next data point and get a guess look at the error tweak the knob I can do this over and over and over again and I can just start with random values for M and B so I don't know what it B I'm going to just put a line here and then I can start moving the line around according to the error as I go through all the data so this is what we're trying to do okay so there's more to how the math behind this stuff works and how we look at the overall error and there's some stuff that involves the derivative and the slope of the graph of the error and I'm going to I think I'm going to come back some of that stuff in a second video where I go a bit further into some of the math here but what I'm actually going to do is just start showing you how to set up to do gradient descent in the code itself so let me come over here so this as you saw before this is the example from yesterday that's using the ordinary least-squares method so what I'm going to do now is I am going to this so I had this function linear regression and this linear regression function calculates the slope of the line and the y-intercept and B according to or nearly squares so what I'm going to do is I'm just going to completely get rid of this so now nothing happens there so I can click and the first guest of the line I just plugged in some values that typically speaking I think what's probably typically done is these values are initialized at 0 these are like weights so to speak and ultimately you can see these are analogous to the weights of connections in a neural network but this M and B values I could start with in randomly I could pick something and hard-coded I could get let them both be 0 I think imma stick with actually 1 comma 0 just to sort of see because then at least I can see that the line is there so now what I want to do is I want to look at with the existing data points I want to look at the error and I want to adjust M and B in the direction of the error so let's see how that goes so I'm going to call this now gradient descent and so in the draw function I think I want to call this now gradient descent I'm back a little digression there that has added out thanks for thanks for tuning in ok so where I am is that I'm changing the name of the function to gradient descent and what I want to do is I'm going to just look through all of the data so let's just first look through all the data and ok so for each data set I have the Y is data index iy so we can get the X and the y then I can actually calculate a guess so my guess is M times X plus B right this is my machine learning recipe I am taking the input data X I am multiplying it by n I am adding B and that is my guests so now my error equals my error equals y minus the guess and I think technically speaking I think I should be saying guess minus y now you may recall that in the ordinary least-squares method I would always square the error because I want to get rid of the sort of positive or negative aspect of it in this case and again I'm going to go a little further into this in the next video I actually want the positive or negative direction of the error because I want to know which way in essence to tune the N and B values to get a better result so the issue here is now and this is what's known as stochastic gradient descent so I want to make an enforceable I want to make a change to M and B so I need to calculate how should I change M and how should I change B so really what I'm saying is M equals M plus some amount of change B equals B plus some amount of change and we can in this case kind of say this is us one way to think about and understand it I have this error who is responsible who is to blame here is it um is it you be so who's in charge here what's the what's going on I got to figure this out so in essence we could say if I adjust those values according to the error maybe if I tried it again I would get a better result and in this case B can be adjusted directly by the error because it's just the y-intercept should I move it up or down and M which is the slope can be adjusted by the error but according to according to also the input value itself so this is how you can kind of intuitively understand it I want to adjust those values according to the error the slope also relates to what the input actually was the y-intercept just the error itself now so I'm missing a whole bunch of steps and a bit a few pieces of explanation here but let's just run this and see what happens so first I always have to click okay well first of all I got an error but uncaught reference error and is not defined in gradient descent whereas I have n Oh B equals B plus err yeah I don't know what n is so you can see like okay well I don't know where that line went it was there for a second and it just went far away so here's the thing if I come back to my analogy from the steering one of the things in the steering behavior examples from nature of code and Craig Reynolds example is that there was a variable called maximum force I hope you can see that maximum force because one thing you might think about it here is well how powerful I know what the error is between the way I'm going and where I want to go how powerful is my ability to turn well maybe I'm able to turn at like Inc with infinite power and that be good but not so good because if I try to like push myself I might end up going all the way down this way and I'm like oh my god I'm going in the wrong direction and then they have going all the way up in the other direction maybe I just won't we want to be able to make little adjustments because it's the wrong way I want to just make a slight adjustment I don't want to overshoot the target the target being I want to find the parameters I want to find the weights m and B values to minimize the error so so I don't want to overshoot what that minimum that that optimal value is and so that is where a variable sometimes called alpha but most commonly called learning rate comes in so I could have a variable called learning rate usually this is a small number something to really reduce the size of that error so in this case I would say well let me take this change in the value of the slope and multiply it by the learning rate and let me change take this for B and multiply it by the learning rate okay so now I'm going to try this again with a learning rate of point zero zero one hey that doesn't look right come back to me okay so let's think about what might be wrong over here I wrote guess minus y and that's really what I that's what I wrote here no I want Y minus guess I knew it was always the same so hopefully you're not watching this but in this case here right steering if I want to move towards the target the error is the desired the known result minus the velocity and so this should really be I want to move in that direction Y minus y guess let me change that to I changed it already wait how did I do that okay I must have done this before that I went to explain it let's try this looks pretty good right now here's the thing let me put nav back to zero hit refresh here and so let's see so we can see interestingly enough this isn't the correct correct line because the line should really go through those two points you know I think I've got an issue here with the learning rate so you can see how it was kind of like moving to the right spot but then it's still making very very small small changes only have two points not a lot of data not a lot of time for it to change I probably just need kind of a larger higher learning rate here just for this demonstration let's make it at 0.05 and we can see now it's kind of moving much more quickly and it's starting to turn I'll be it very slowly but you can see as it's slowly slowly turning approaching the correct the correct or the optimal spot for this line and as you can see if I were to click again and click again try to you know click a lot up here and a lot down here ultimately eventually I should start getting the line of best fit now so there are you know some strategies that in theory you should really need to adjust the learning rate over time but there is a technique and a lot of machine learning systems that you will say see that you can call annealing I think that's the right word where you kind of start with a high learning rate and then slowly over time reduce it so you can kind of get some big correction at the beginning and then find some some smaller Corrections ok so some folks in the chat we're asking about like ok well it's sort of performing weirdly if I put a lot of like points above and below but if I put you know points to the right and left it's kind of it fits the line very nicely you can see they're not so big now I'm doing above and below again so here's the thing collinear any meaning like a lot of vertical points is not really good this isn't real if data doesn't really make sense for linear regression I'll try to make prediction so we're not necessarily going to get a good line and part of what I'm doing again is not to demonstrate the optimal way to do linear regression but to demonstrate the technique coders grading a set descent of making small adjustments to weights to parameters to the slope and y-intercept based on an error based on the supervised learning process so this is a start to that you could stop here and I highly recommended you do is what I'm going to do in the next video I don't really know how it's going to go to be honest but I'm going to try to look a little bit more closely as to why this works out the way that it works how do I know how to change and and be how do I know exactly how to change M & B to minimize the error I said kind of well the error kind of gives us the direction in which to change this has to do with calculus it has to do with comparing how changing one variable affects another variable so if I change M how does that change the error and can I look at the slope of a graph perhaps to see how to move along that graph to minimize that error so this is what I'm going to cover a bit more in the next video I'm not going to really I may not actually also change this is I said I think I said this stochastic gradient descent meaning I'm adjusting the weights and adjusting the m and B values with every data point but I could also look at this with error in totality and then adjust the weights all at once at the end of one cycle through all of the data and that's known as batch gradient descent so I'm going to do what I'm going to do is explain a bit more about the math here and then I'm going to do it and change the code to batch gradient descent in the next video it might be many parts to be honest with you but I don't know it how it's going to go maybe this video is not going to exist the next one you can look see if it's there because I don't know if I should really make it okay see you soon thanks for watching this

22 thoughts on “3.4: Linear Regression with Gradient Descent – Intelligence and Learning”

  1. For a more complete and in depth discussion of Linear Regression with Gradient Descent check out Professor Andrew Ng of Stanford series of machine learning videos: https://www.youtube.com/watch?v=PPLop4L2eGk&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN

  2. I tried to build a gradient descent algorithm from scratch. Why isn't mine working? Here's my code:

    for i in range(4):
    ypred = m * x + b
    error = (ypred – y) **2
    m = m – (0.001 * error)
    b = b -(0.001 * error)
    m = m.sum()
    b = b.sum()

    #My 'm' and 'b' values decrease infinitely

  3. Awesome cool….. What a teaching style I really love it you made my day by understanding linear regression with simple story really love you man

  4. You need separate learning rates for m and b. Then set the learning rate for b higher than the one for m so it would rotate faster, but move up and down slower.

  5. nice video…you can also check linear regression using tensorflow here …..https://www.youtube.com/watch?v=PGm8pLp7T40

  6. Would you have two separate learning rates for m and b? Seems like weighting the slope change higher could be beneficial.

  7. Could we have an explanation of why the x, y coordinates were mapped between 0 and 1? Without that, it doesn't work at all, without changing the learning rate to something really small like 10^-6. But then the b value changes way too slow, so I had to have 2 learning rates, with m's learning rate being 10^-6 and b's learning rate being 0.05 to achieve the same results. I'm surprise how effective the mapping was at solving this problem. When I tried other similar algorithms that took the average error of all the points (which is meant to be better), it didn't work either unless I did the mapping trick.

  8. I'm sure this video can be condensed without losing any real info. I don't have the patience to see it through half way.

  9. Can someone explain why this is correct ?
    m = m + (error * x) * learning rate;
    I mean how is it dimensionally correct ? Shouldn't error be divided by x so that m can be added to something that is of type m.

Leave a Reply

Your email address will not be published. Required fields are marked *