3.5: Mathematics of Gradient Descent – Intelligence and Learning

hello it's me coming to you again from the future you might recognize me some from my fails videos as a calculus partial derivative hey I made some videos with some calculus stuffs they can turn out very well you can find them if you want they're kind of unlisted now but I just I tried again and so this video if you're watching it this is a follow-up to my linear regression with gradient descent video that video stands alone it's a programming video where all I do is walk through with the code for how to build an example that demonstrates linear regression with gradient descent and this is a puzzle piece in my machine learning series that will hopefully act as a foundation the building blocks your understanding of hopefully some more creative or practical examples that will come later this video that if you're watching it's totally optional to watch as part of this series because you just applied the formula but what I try to do in this video is give some backgrounds and I kind of worked it all out here this is the end this is what's on the whiteboard I thought so now if I use multiple colored markers would somehow make a better video I don't think I really did that but so I kind of walk through and try to describe the map I should say that you know is this involves topics from calculus and there's a great video series by three blue one brown on YouTube that gives you great background and more depth in calculus so I'll put links to those videos in this video's description honestly if you're really interested in kind of soaking up as much as this as you can I would go watch those videos first and then come back here it'll give you that background for understanding the pieces that I've done here so I look forward to your feedback positive and negative constructive feedback into whether this was helpful if it made sense and if you then go on and keep watching there'll be some future videos well be getting back into the code there's no code in this video just a math stuff next max okay I enjoy so to recap I have a bunch of data points in 2d space I have a line in that 2d space the formula for that line is y equals MX plus B when I try to make a prediction right I get a piece of input of data input X and from there I try to make again in addition to the guests I have the known Y so this is the correct data that goes with X my machine learning system makes a guess the error is the difference between those two things the error is y the correct answer – the guess so this relates to the idea of a cost function a loss function so if we want to evaluate how is our machine learning algorithm performing we have this large data set maybe it has n elements so what we want to do is from 1 to n for all n elements we want to minimize that error so the cost function cost equals the sum of Y sub I every node answer – the guest sub I squared so this is the formula this is a cost known as a cost function this is the total error for the particular in the particular model being the current M and B values that describe this particular line this is the error so perhaps we can agree we can agree that our goal is to minimize this cost also known as maybe a loss we want to minimize that loss we want to have the lowest error we want the M and B values for the lowest error so we want to minimize this function now what does it mean to minimize a function so this function is something equals something school which is not that different from like me saying just for a moment like y equals x squared so if I were to take a Cartesian coordinate system and graph y equals x squared it would look something like this I'm drawing in purple now because I've stepped away from this notation and syntax for this particular scenario and I'm just talking about a function in general y equals x x squared you're going to also write this like f of X equals x squared but I'm graphing y equals x squared so what does it mean to find to minimize this function right I said I want to minimize the loss I want the smallest error I want the whatever line has the smallest error well what it means to minimize a function is that actually find the x value that produces the lowest y this is like the easiest thing in the world that we could ever possibly do right now you don't need any calculus fancy math or anything too fun to minimize this function is the lowest point zero its I could I could see it it's quite obvious so this is the thing eventually we're going to in in the machine learning systems that I'm going to get further into neural network based systems with many dimensions of data you know there might be some much more hard to describe crazy function that we're trying to approximate that it's much harder I mean of course we could eyeball this as well but as much part of to sort of mathematically just compute exactly where the minimum is especially if you imagine this as instead of a single line but a bowl and then what happens we can get into three dimensions and four dimensions and five dimensions things get kind of wonky but there is if we know the formula for this function there is another way that you can find that minimum that minimum minima minimum I don't know what your this and that is when I keep talking about gradient descent so let's think about what gradient descent means let's say we're looking at this point here and I'm gonna I'm gonna I'm oh I'm gonna walk along this function and I'm like I'm right here I'm like hello I'm looking for the minimum is it over there over there could you help me please keep these provide me can I use my like GPS Google Maps thing to find the minimum how would I find the minimum well if I'm right here I've got two options I could go this way or I could go this way and if I knew which direction I could go I could also say like I should take a big step or I should take a little step right there are all sorts of options so I need to know which way to go and how big of a step to take and there's a way to figure out how to do that and it's known as the derivative so the derivative is a term that comes from calculus and I would refer you to three blue one brown's calculus series or you can get a bit more background on how what the meaning of derivative is and how it works and how you can sort of think about these concepts from calculus but for us right now what we can think of is it's just the slope of the graph at this particular point and a way to describe that is like a tangent line to that graph so if I'm able to compute this line then I could say well this direction if I go this direction it's going up and I'm going away from the minimum if I go this direction I'm going down and I'm going towards the minimum so I want to go down and you can see like over here the slope is less extreme if I'm right here so maybe I don't need to go very far anymore but if I'm further up that slope is going to point much more this way oh I should take a bigger step down so this idea of being able to compute this slope this derivative of this function tells me how to search and find the bottom okay so this is the landscape of the puzzle we're trying to solve and pieces of that puzzle but what is the full what's what's the actual part of the code that I'm trying to give you more background on the actual part of the code that I'm trying to give you more background on is right over here so this is the gradient descent algorithm that I programmed in the previous video where what I did is we looked at every data point we made a guess we got the error the difference between the known output and the guess and then we adjusted the M and B values M equals so the idea here is that we want to say every firm as we're training the I don't know which color I'm using right now as we're training the system I want to say M equals M plus plus Delta M some change in n B equals B plus Delta B so I want to know what is a way that I could change the value of M in y equals MX plus B in order to make the error less the next step that I want to do is find the minimum cost I want to minimize this function for a particular I want to find the M and V values for the with the lowest error so to do that we've established that gradient descent says if I could find the derivative of a function I know which way to move to minimize it so somehow I need to find the derivative of this function to know which way to move okay so in order to do that though I'm going to have to rewrite this function in a different way so a couple things one is I think I made a mistake earlier where this should actually be done it's sort of doesn't matter but this should be on guess – why we were squaring it so in a way the positive negative doesn't matter but I think this is important for later so this should be guess – why that's technically the error is a guess – why not why I'm going to guess okay so I'm going to call the error function J and J is a function of M and B so I get something that sorry the error function office I'm about to call something else the error function the loss function the cost function J then I'm actually what I'm going to do is I'm going to say I'm going to just simplify this guess – why and I'm going to call that error I'm also going to take out the summation the summation is kind of important is but if this has to do with that stochastic versus batch gradient descent that I talked about in the previous video where I could I want to get the error over everything I just want to look at each error one at a time so let's simplify things that say we're looking at a chair one at a time so I'm going to now say this equals error squared so I have essentially rewritten this function and simplified it the cost J is equal to this error the guessed minus y squared so what I want to do is I want to find the derivatives of j relative to n i want to know how do I minimize J how how does J change when n changes dfj relative to n okay so in you know again I recommend that you go and check out some of the three blue one Brown calculus videos which will help give you more background here but what I'm actually going to need to do here is you use a use two rules from calculus I'm looking for another pen color for no reason I need to use the power rule that is one rule and I need to use the chain rule let me establish what the power rule is really quickly if I have a function like f of X equals x to the N the power rule says that the derivative is n times X to the N minus 1 so that's the power rule so I'm going to now apply that here and I'm going to say I don't know why I'm in purple now but I two times error to the first power so the power rule says now two times error okay but I also need the chain rule I'm not done why do I need the chain rule well the chain rule is a rule I'm going to erase this over here use another marker because somehow if I think these multiple colored markers all this will make sense the chain rule states who okay let's say I have a function like why can you reach you see this orange y equals x squared and I have a function like X equals Z squared so Y depends on X X depends on Z well what the chain rule says is if I want to get the derivative of Y relative to Z what I can do is I can get the derivative of Y relative to X to X and then multiply that by the derivative of X relative to Z which is then x 2z i i can change riveters I could get the derivative of 1 relative to something times the derivative of that something relative to something else and that's actually weirdly what's going on here it may not be immediately apparent to you J is a function of error and error is a function of N and B because I'm computing the error as the guessed MX plus B minus a known Y so here I could then say get this derivative 2 times error and multiply that by the derivative of that error function itself relative to M because I'm trying to get Delta n now I could also also do it relative to B when I want to be and this has to do with a partial derivative we'll see there's so many concepts baked into this that are a lot maybe that again I'm sitting here being like this was all just a bad idea okay but what is this it is actually quite simple to work out and I'm going to do that for you right now I'm going to get the black marker and what I'm going to do is now I want the derivative of error relative to end okay well what is this actual if I unpack this function guess is M X plus B minus y error equals this so when I say partial derivative means like the derivative relative to M what I mean is everything else is a constant X is a constant B is a constant Y is a constant I mean x and y are actually already constants because those are the things that X is the input data Y is the known output result so this really I should write this as like x times M plus B minus y so this the derivative of this right the power rule says 1 times x times m to the 0 power which means x and the derivative of a constant is 0 because the constant doesn't change right derivative describing how something changes the derivative of this is there so guess what it's just X meaning this whole thing turns out to just be x equals 2 times the error times X and guess what this – we're going up the whole point is if you watch the previous video is we're going to take this and multiply it by something called a learning rate because we want it to we want to like we know the direction to go this is giving us the direction to go to minimize that error minimize that cost but do I want to take a big step or a little step well if I'm going to multiply it by a learning rate anyway it's sort of like this – as no point I could have a learning rate that's twice as big or half as big so ultimately this is all it is air times X all of this math and craziness with powerful wood chain rule and partial derivative bits it all boys come to just finally we get this error times X that's what should go here in Delta M guess what let's go back over to our code and we can see there it is error times X air times X there we go that's it that's why that says error times X no it looks that was a lot that's why successes I feel so happy that we kind of even though it was not the best explanation and there's lots of confusing this in pieces I feel very happy to have arrived there this was useful for me just making this video makes me feel like something happened today okay so um two things I want to mention couple things I want to mention here a way that I can make this make a little bit more sense here although just to clarify this chain rule thing a little bit better thank you to K week mine in the black channel is that I could just to see here I'm what I'm looking for is the derivative of the cost function you know relative to M what happens when I change the end value what does that do to the cost and the chain rule says that if I look at the derivative of that function relative to the error I can multiply that by the derivative of the error relative to M right so this is actually the chain rule so I can get this by doing the derivative of relative to error the derivative error relative to m and that's what's going on here two times error times this and that's where I'm getting all this stuff okay so this is one way of looking at this and you can see like oh yeah it's kind of like the numerator and denominator cancel each other out so that makes sense the other thing is if I did this whole thing again but did the derivative down here relative to B right B instead of M what do I get here well I get this is now a constant so this becomes zero this is a constant this becomes zero and what is this I take the power rule so I take 1 times B to the zero I just get one so this becomes a error times rather than times points look at this mess that I wrote here can i we please end this video with this at least written it was a very nice handwriting so when it's relative to M this was 2 times error times X but when it's relative to B that's when it's relative to n but when it's relative to B it's 2 times error times 1 and again we could get rid of the 2 so it's really just error times X or error times 1 and then if I come back over here again there you go error times X M changes by era times xB changes by just error oh so that hopefully gives you some more background as to why these formulas exist this way and which chat and as I go forward in session for what comes after this is now session 4 where I'm going to build a neural network model for learning you're going to see this formula over and over again change the weight instead of st. and then B I'm going to say the weight well the weight changes based on the input multiplied by the error and then there's going to be a lot of other pieces of but this formula is going to be everywhere so I hope this is another attempt again did you know there's a lot of things I've glossed over here in terms of a lot of the background in terms of you know what really is a derivative why does calculus exist why is the chain rule work the way it works why is the power rule work the way it works what why what what move that partial derivative Han did you say that partial derivative and so again take a look at this video's description and I'm going to point you towards resources and tutorials that kind of dive into each of those components a bit more deeply but hopefully this gives you some semblance of that overall picture okay thanks for watching and uh I don't know maybe maybe you want to get like or subscribe to the honestly totally totally understand totally totally understand don't you get the thumbs down I get it again ok I'll see you in a future video maybe ok goodbye

42 thoughts on “3.5: Mathematics of Gradient Descent – Intelligence and Learning”

  1. If you want a clean version of what is drawn on the board, check this neat article :

  2. Daniel I have been following you since you had 2000 subs. I always enjoyed your videos man. I started learning Deep Learning on my own and got stuck at understanding Gradient descent and I know it is the back bone of Ml and DL I want to know it deeply. I have watched around 3 videos before this and your video just explains it beautifully. Thanks for this video it helped me alot. Please keep doing these kind of videos which explains the math behind these ML and DL algorithms and again Thank you for your videos. 🙂 Iam gonna follow you more and more from now. If it's possible try to make an awesome course on Udemy with math and programming of ML and DL . Thank you again.

  3. This is a clearer explanation than Professor Ng's explanation in his machine learning video series. Ng denotes m and b as theta0 and theta1. He also reverses the terms in his line equation which confuses the Hell out of everybody. In addition, he doesn't take you through how the partial derivative is worked out and he doesn't show the code. A great explanation in only 22 minutes.

  4. 10:07: Matematically, m=m+delta m does not make sense if delta m is not equal to zero. Updating a variable like this comes from programming languages. Rather use indices to discriminate between the updated m and the prior m to make it mathematically correct to not mix up mathematics and programming languages. Great video though!

  5. Usually, math makes me cry but while watching this I am learning and laughing at the same time. How cool is that? Lol. All thanks to you, bro! Keep the good work on. Cheers!!

  6. Unexptected (,/'_'.)

  7. I have a question, why do you add the error and m, and not subtract the error from m? After all the loss function is preditction – y, so shouldn't you subtract error from m?

  8. Thanks for help me arrive there! I did not recognize the partial derivative ended up just being the example x!

  9. Thank you for the nice explanation! Unfortunately, I was a little bit confused with m = m + derivative*lerning_rate equation. Why does it had to evantually lead us to derivative minimum? I mean, I pretty sure it does (given learning rate is small enought), but I don't feel it to be crystal clear for me unless I see some kind of example or proof.

  10. when calculating for the cost functions, couldn't we just calculate stationary points –> pick the one with the lowest value y –> and ensure the second derivative at that point x, is positive?

  11. Going through Andrew Ng's Coursera… got stuck on how the Cost Function derivatives/partial derivatives are obtained…. 11:00 and on… Oh… MY… GOSH… this is GOLD!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Thank you so much!

  12. Please explain at 10.45 why did u correct the formula for error? It was correctly specified. Error = actual observed y – guess. And not guess – error.

  13. I have one question though.

    At exactly 11:58 timestamp of the video you set the summation of (guess – y)^2 to be equal to ERROR^2


    I don't really understand what happened there if that is mathematically correct

Leave a Reply

Your email address will not be published. Required fields are marked *