Lesson 8 (2019) – Deep Learning from the Foundations

so welcome back to part two of what previously was called practical deep learning for coders but part two is not called that as you will see it's called deep learning from the foundations it's lesson eight because it's less than eight of the full journey lesson 1 of part 2 or less than 8 mod 7 as we sometimes call it for those of you I know a lot of you kind of do every year's course and keep coming back for those of you doing that this will not look at all familiar to you it's a very different kind of part 2 we're really excited about it and hope you like it as well the basic idea of deep learning from the foundations is that we are going to implement much of the faster a library from foundations now talk about exactly what I mean by foundations in a moment in a moment but it basically means from scratch ok so we'll be looking at basic matrix calculus and creating a training loop from scratch and creating a optimizer from scratch and lots of different layers and architectures and so forth and not just to create some kind of dumbed-down library that's not useful for anything but to actually build from scratch something you can train cutting-edge world-class models with so that's the goal we've never done it before I don't think anybody's ever done this before so I don't exactly know how far we'll get but this is you know this is the journey that we're on we'll see how we go so in the process we will be having to read and implement papers right because the Farseer library is full of implemented papers right so you're not going to be able to do this if you're not reading and implementing papers along the way we'll be implementing much of Pi torch as well as you'll see we'll also be going deeper into solving some applications that are not kind of fully baked into the FASTA a library yet so going to require a lot of custom work so things like object detection sequence to sequence with attention transformer and the transform excel cycle gain audio stuff like that I'll also be doing a deeper dive into some performance considerations like doing distributed multi-gpu training using the new just-in-time compiler which we wrote is called JIT from now on cooter and C++ stuff like that okay so that's the first five lessons and then the last two lessons implementing some subset of that in Swift so this is otherwise known as impractical deep learning for coders because really none of this is stuff that you're going to go and use right away it's kind of the opposite of part one right part one was like oh we've been spending 20 minutes on this you can now create a world class object as world class vision classification model this is not that right because you already know how to do that and so back in the earlier years part 2 used to be more of the same thing but it was kind of like more advanced types of model more advanced architectures but there's a couple of reasons we've changed this year the first is so many papers come out now because this this whole area has increased in scale so quickly that I can't pick out for you you know the 12 papers to do in the next seven weeks that you really need to know because there's there's too many and it's also kind of pointless right because once you get into it you realize that all the papers pretty much say minor variations on the same thing so instead what I want to be able to do is show you the foundations that let you read the twelve papers you care about and realize like oh that's just that thing with this minor tweak and I now have all the tools I need to implement that and test it and experiment with it all right so that's that's kind of a really key issue in why we want to go in this direction also it's increasingly clear that you know we used to call part to cutting-edge tech learning for coders but it's increasingly clear that the cutting edge of deep learning is really about engineering not about papers the difference between really effective people in deep work and the rest is really about who can like make things in code that work properly and there's very few of those people so really the the goal of this part two is to deepen your practice so you can understand you know the things that you care about and build the things you care about and have them work and perform at a reasonable speed so that's where we're trying to head to and so it's impractical in the sense that like none of these are things that you're going to go probably straightaway and say here's this thing I built right particularly Swift back because Swift we're actually going to be learning a language and a library that as you'll see is is far from ready for use and I'll describe why we're doing that in the moment so part 1 of this course was top-down right so that you got the context you needed to understand you got the motivation you needed to keep going and you got the results that you needed to make it useful but bottom-up is useful too and we started doing some bottom-up at the end of the part one right but but really bottom-up let's you when you've built everything from the bottom yourself then you can see the connections between all the different things you can see they're all variations of the same thing you know and then you can customize rather than picking algorithm a your algorithm B you create your own algorithm to solve your own problem doing just the things you need it to do and then you can make sure that you know that it performs well that you can debugger profile it maintain it because you understand all of the pieces so normally when people say bottom-up in in in this world in this field they mean bottom-up with math I don't mean that I mean bottom-up with code right so today step one will be to implement matrix multiplication from scratch in Python that took us and it's because bottom up with code means that you can experiment really deeply on every part of every bit of the system you can see exactly what's going in exactly what's coming out you can figure out why your models not training well or why it's slow or why it's giving the wrong answer or whatever so why swift what are these two lessons about and be clear we are only talking the last two lessons right the you know our focus as I'll describe is still very much play thon and pie touch all right but there's something very exciting going on the first exciting thing is is this guy's face you see here Chris Lattner Chris is unique as far as I know as being somebody who has built I think what is the world's most widely used compiler framework LOV M he's built the the the the default C and C++ compiler for for Mac being clang and he's built what probably like the world's fastest-growing fairly new computer language being Swift and he's now dedicating his life to deep learning right so we haven't had somebody from that world come into our world before and so when you actually look at stuff like you know the internals of something like tensorflow it looks like something that was built by a bunch of declining people not by a bunch of compiler people right and so I've been wanting for over 20 years for there to be a good numerical programming language that was built by somebody that really gets programming languages and it's never happened you know so you know we've we've had like in the early days it was a Lisp statins in Lisp and then it was ah and then it was Python none of these languages were built to be good at data analysis they weren't built by people that really deeply understood compilers they certainly weren't built for today's kind of modern highly parallel processor situation we're in but Swift was Swift is right and so we've got this unique situation where for the first time you know a really widely used language a really well designed language from the ground up is actually being targeted towards numeric programming and deep learning so I I there's no way I'm missing out on that boat and I don't want you to miss out on it either okay and I should mention there's another language which you could possibly put in there which is a language called Giulia which has baby as much potential but it's you know it's about ten times less used and Swift it doesn't have the same level of community but I would still say it's super exciting so I'd say like this maybe there's two languages which you might want to seriously consider picking one and and spending some time with it Giulia is actually further along thrift is very early days in this world but that's one of the things I'm excited about for it so I actually spent some time over the Christmas break kind of digging into numeric programming in Swift and I was delighted to find that I could create code from scratch that was competitive with the fastest hand-tuned vend or linear algebra libraries now even though though I am was and remained pretty incompetent at Swift I found it was a language that you know was really delightful it was expressive it was concise but it was also very performant and I could write everything in Swift you know rather than having to kind of get to some layer where it's like oh that's crude enn now well that's mkl now or or whatever so that got me pretty enthusiastic and so the really exciting news as I'm sure you've heard is that Chris Latner himself is going to come and join us for the last two lessons and we're going to teach Swift for deep learning together so Swift for deep learning means it's rift for tensorflow that's specifically the library that Chris and his team at Google are working on we will call that s for TF when I wrote it down because I couldn't be bothered typing Swift for tensorflow every time Swift for tensorflow has some pros and cons pi torch has some pros and cons and interestingly they're the opposite of each other PI torches and pythons pros you can get stuff done right now with this amazing ecosystem fantastic documentation and tutorials you know it's just a really great practical system for solving problems and to be clear Swift for tensorflow is not it's not any of those things right now right it's really early you almost nothing works you have to learn a whole new language if you don't know Swift already there's very little ecosystem yeah I'm talking I'm not read about Swift in particular but they're kind of Swift for tensorflow and Swift for deep learning and even Swift for numeric programming that was kind of surprised when I got into it to find there was hardly any documentation about Swift or numeric programming even though I was pretty delighted by the experience people have had this view that Swift is kind of iPhone programming I guess that's kind of how it was marketed right but actually it's a incredibly well designed incredibly powerful language and then tends to flow I mean to be honest I'm not a huge fan of tensorflow in general I mean if there was we wouldn't have switched away from it but it's getting a lot better you know tensorflow 2 is certainly improving and the bits of it I particularly don't like largely the bits that Swift for tensorflow will well avoid all right but I think long term they're kind of things I see happening like there's this fantastic new kind of a compiler project called ml IR which Kris is also collating which I think actually has the potential long term to allow Swift to replace most of the yucky bits or maybe even all of the attributes of tensorflow with stuff where Swift is actually talking directly to LLVM you'll be hearing a lot more about LLVM in the coming in the last two weeks the last two lessons but basically it's an it's the compiler infrastructure that kind of everybody uses that Julia uses that clan users and and swift is this kind of almost this thin layer on top of it where when you write stuff in swift it's really easy for LLVM to like compile it down to super fast optimized code which is like the opposite of Python right with Python as you'll see today we almost never actually write Python code rewrite code in Python that gets turned into some other language or library and that's what what gets run and this this mismatch this impedance mismatch between what I'm trying to write and what actually gets run makes it very hard to do the kind of deep dives that we're going to do in this course as you'll see right it's kind of a frustrating experience so so I'm excited about getting involved in these very early days for impractical deep learning in Swift for tensorflow because it means that me and those of you that want to follow along can be the pioneers in something that I think is going to take over this field you know it will be the first in there will be the ones that understand it really well and in your portfolio you can actually point at things and say that library that everybody use I wrote that you know or this piece of documentation that's like on the swift for tensorflow website I wrote that now that's the opportunity that you have so let's put that aside for the next five weeks and let's try to create a really high bar for this riff for tensorflow team to have to try to re-implement before six weeks time right we're gonna try to implement as much of fast AI and and and many parts of Pi torch as we can and then see if the spirit for tensorflow team can help us build that in Swift in six weeks time five weeks time so the goal is to recreate first AI from the foundations and much of pi torch like metrics modification to a lot of torched onnn torched a top TM data set data loader from the foundations and this is the game we're going to play the game we're gonna play is we're only allowed to use these bits are allowed to use pure Python anything in the Python standard library any non data science modules right so like a requests library for HTTP or whatever we can use PI torch but only for creating arrays random number generation and indexing into arrays we can use the faster ID datasets library because that's the thing that has access to like Amnesty and stuff so we don't have to worry about writing our own HTTP stuff and we can use matplotlib now we don't have to write our own plotting library that's it that's the game so we're going to try and recreate all of this from that and then the rules are that each time we have replicated some piece of fast AI or PI torch from the foundations we can then use the real version if we want to okay so that's that's the game we're going to play what I've discovered is I started doing that is that I started actually making things a lot better than fast AI so I'm now realizing that first day I version 1 is kind of a disappointment because there was a whole lot of things I could have done better and so you'll find the same thing as week as you go along this journey you'll find decisions that I made or the PI torch teammate or whatever where you think what if they'd made a different decision there and you can you know maybe come up with more examples of things that we could do differently right so why would you do this well the main reason is so that you can like really experiment right so you can really understand what's going on in your models what's really going on in your training and your you'll actually find that in the experiments that we're going to do in the next couple of classes we're going to actually come up with some new insights if you can create something from scratch yourself you know that you understand it and then once you've created something from scratch and you really understand it then you can tweak everything right you suddenly realize that there's not this object detection like a system and this confident architecture and that optimizer they're all like a kind of semi arbitrary bunch of particular knobs and choices and that it's likely that your particular problem would want a different set of knobs and choices so you can change all of these things for those of you looking to contribute to open source to fast AI or a PI torch you'll be able to write because you'll understand how it's all built up you'll understand what bits are working well which bits need help you know how to you know contribute tests or documentation or new features or create your own libraries when for those of you interested in going deeper into research you'll be implementing papers which means you'll be able to correlate the code that you're writing with the paper that you're reading and if you're poor mathematician like I am then you'll find that you'll be getting a much better understanding of papers you might otherwise have thought were beyond you and you realize that all those Greek symbols actually just map to pieces of code that you're already very familiar with so there are a lot of opportunities in plat one to kind of to blog and to do interesting things but the opportunities are much greater now in part two you can be doing homework that's actually at the cutting edge actually doing experiments people haven't done before making observations people haven't made before because you know you're getting to the point where you're you're a more competent deep learning practitioner than the vast majority that are out there and we're kind of looking at stuff that other people who haven't looked at before so so please try doing lots of experiments particularly in your domain area and consider writing things down right even if especially if it's not perfect all right so write write stuff down for the U of six months ago that's that's your that's your audience okay so I am going to be assuming that you remember the contents of part one which was these things here's the contents of part one in practice it's very unlikely you remember all of these things because nobody's perfect right so what I'm actually expect you to do is as I'm going on about something which you're thinking I don't know what he's talking about that you'll go back and watch the video about that thing right don't just keep blasting forwards because I'm assuming that you already know the content of part one right particularly if you're less confident about kind of the second half of part one where we kind of went a little bit deeper into like what's an activation really and what's a parameter really it exactly had as SGD work particularly in today's lesson I'm going to assume that you really get that stuff so if you don't then then then go back and and relook at those videos go back to that like SGD from scratch and and take your time right I've kind of designed this course to keep most people busy you know up until the next course right so feel free to like take your time and and dig deeply so the most important thing though is we're going to try and make sure that you can train really good models and there are three steps to training a really good model all right step one is to create something with way more capacity you need and basically no regularization and an overfit right so overfit means what it means that your training loss is lower than your validation loss no no it doesn't mean that remember it doesn't mean that a well fit model will almost always have training loss lower than the validation loss remember that over fit means you have actually personally seen your validation error getting worse okay until you see that happening you're not overfitting so step one is over fit and then step two is reduce overfitting and then step three okay there is no step three well I guess step three is to like visualize the inputs and outputs and stuff like that right is to experiment and see what's going on so one is pretty easy normally right two is the hard bit it's not really that hard but it's basically these are the five things that you can do in order of priority if you can get more data you should if you can do more data augmentation you should if you can use a more generalizable architecture you should and then if all those things are done then you can start adding regularization like drop out or weight decay but remember you know at that point you're reducing the effective effective capacity of your model so it's less good than the first three things and then last of all reduce the architecture complexity and most people most beginners especially start with reducing the complexity of the architecture but that should be the last thing that you try unless your architecture is so complex that it's too slow for your problem ok so that's a kind of a summary of what we want to be able to do that we learned about in in part 1 ok so we're going to be reading papers which we didn't really do in part 1 and papers look something like this which if you're anything like me that's terrifying and I'm not gonna lie it's still the case that when I start looking at a new paper every single time I think I'm not smart enough to understand this III just can't get past that immediate reaction because I just look at this stuff and I just go that's not something that I understand but then I remember this is the atom paper and you've all seen atom implemented in one cell of Microsoft Excel right like like but it actually comes down to it every time I do get to the point where I understand have an implemented a paper I go oh my god that's all it is all right so a big part of reading papers especially if you're less mathematically inclined than I am is just getting past the fear of the Greek letters I'll say something else about Greek letters there are lots of them right and it's very hard to read something that you can't actually pronounce right because you just send yourself ole squiggle bracket 1 plus square one G squiggle one – scribble and it's like all the squiggles you just get lost right so like believe it or not it actually really helps to go and learn the Greek alphabet so you can pronounce alpha times 1 plus beta 1 alright suddenly you can start talking to other people about it you can actually read it out loud it makes a big difference so learn to pronounce the Greek letters note that the people that write these papers are generally not selected for their outstanding clarity of communication all right so you will often find that you'll there'll be a blog post or a tutorial that does a better job of explaining the concept than the paper does so don't be afraid to go and look for those as well but do go back to the paper racks in the end the papers the one that's helpful II got it mainly right okay one of the tricky things about reading papers his the equations have symbols and you don't know what they mean and you can't google for them so a couple of good resources if you see symbols you don't recognize Wikipedia has an excellent list of mathematical symbols page that you can scroll through and even better DTAC rifai as a website where you can draw a symbol you don't recognize and it uses the power of machine learning to find similar symbols there are lots of symbols that look a bit the same so you'll have to use some level of judgement right but the thing that it shows here is the latex name and you can then google for the latex name to find out what that thing means okay so let's start here's what we're going to do over the next couple of lessons we're going to try to create a pretty competent modern CNN model and we actually already have this bit because we did that in the last course alright we already have our layers for creating a ResNet we actually got a good result so we just have to do all these things okay to get us from here to here this is just the next couple of lessons after that we're going to go a lot further right so today we're going to try to get to at least the point where we've got the backward pass going right so remember we're going to build a model that takes an input array and we're going to try and create a simple fully connected Network right so it's going to have one hidden layer so we're going to start with some input do a matrix multiply do a rail you do a matrix model play do a loss function okay and so that's a forward pass and that'll tell us our loss and then we will calculate the gradients of the weights and biases with respect to the loss sorry if the loss with respect to the weights and biases in order to basically multiply them by some learning rate which we will then subtract off the parameters to get our new set of parameters and we'll repeat that lots of times so to get to our fully connected backward pass we will need to first of all have the fully connected forward pass and the fully connected forward pass means we will need to have some initialized parameters and we'll need rlu and we will also need to be able to do matrix multiplication so let's start there so let's start at zero zero exports notebook and what I'm showing you here is how I'm going to go about building up our library in Jupiter notebooks a lot of very smart people have assured me that it is impossible to do effective library development in Jupiter notebooks which is a shame because I've built a library injury but our notebooks so anyway people will often tell you things are impossible but I will tell you my point of view which is that I've been programming for over thirty years and in the time I've been using but a notebook stupid do my development I would guess I'm about two to three times more productive right I've built a lot more useful stuff in the last two or three years than I did beforehand I'm not saying you have to do things this way either but this is how I develop and hopefully you find some of this useful as well so I'll show you how we need to do a couple of things we can't just create one giant notebook with our whole library somehow we have to be able to pull out those little gems those bits of code where we think oh this is good let's keep this we have to pull that out into a package that we reuse so in order to tell our system that here is a cell that I want you to keep and reuse I use this special comment cache export at the top of the cell and then I have a program called notebook to script which goes through the notebook and finds those cells and puts them into a Python module alright so let me show you so if I run this cell okay so if i run this cell and then i head over and notice i don't have to type all of oh oh exports because I have tab completion even for file names in Jupiter notebooks ooo tab is enough and I could either run this here or I could go back to my console and run it so let's run it here okay so that says converted exports I plan B to NB o o and what I've done is I've made it so that these things go into a directory called exp for exported modules and here is that env oh oh and there it is right so you can see other than a standard header it's got the contents of that one cell so now I can import that at the top of my next notebook from XP MBO Oh import star and I can create a test that that variable equals that value so let's see it does okay and notice there's a lot of test frameworks around but it's not always helpful to use them it like here we've created a test framework or the start of one I've created a function called test which checks whether a and B return true or false based on this comparison function by using assert and then I've created something called test equals which calls tests passing an A and B and operator dot equals okay so if they're wrong assertion error equals test test one whoops okay so we've been able to write a test which so far is basically tested that our little module exporter thing works correctly we probably want to be able to run these tests somewhere other than just inside a notebook so we have a little program called run notebook top py and you pass it the name of notebook and it runs it so I should save this one with our failing test so you can see it fail so first time it passed and then I make the failing test and you can see here it is assertion error and tells you exactly where it happened okay so we now have an automotive all unit testing framework in our Jupiter notebook I'll point out that the contents of these two Python scripts let's look at them so the first one was run duck walked up py which is our test Runner there is the entirety of it okay so there's a thing called env format so if you Condor install MD format then it basically lets you execute a notebook and it prints out any errors so that's the entirety of that you'll notice that I'm using a library called fire fire is a really neat library that lets you take any function like this one and automatically converts that into a command-line interface all right so here I've got a function called run notebook and then it says fire run notebook so if I now go python run notebook then it says oh this function received no value path usage run notebook path so you can see that what it did was it converted my function into a command line interface it is really great and it handles things like optional arguments and classes and it's super useful particularly for this kind of Jupiter first development because you can grab stuff that's in Jupiter and turn it into a script often by just copying and pasting the function or exporting it and then just add this one line of code the other one notebook to script is not much more complicated it's one screen of code which again the main thing here is to call fire which calls this one function and just say basically it uses JSON load because notebooks JSON the reason I mentioned this to you is that Jupiter notebook comes with this whole kind of ecosystem of libraries and api's and stuff like that and on the whole I hate them I find it's just JSON I find that just during JSON load is the easiest way and specifically I build my jupiter notebook infrastructure inside jupiter notebooks so here's how it looks right import JSON JSON load this file and it gives you an array and there's the contents of source my first row right so if you do want to play around with doing stuff and Jupiter notebook it's a really great environment for kind of automating stuff and running scripts on it and stuff like that so there's that all right so that's the entire contents of our development infrastructure we now have a test let's make it pass again one of the great things about having unit tests in notebooks is that when one does fail you open up a notebook which can have pros saying this is what this test does it's implementing this part of this paper you can see all the stuff above it that's setting up all the context for it you can check in each input and output it's a really great way to fix those failing tests because you've got the whole you know truly literate programming experience all around it so I think that works great okay so before we start doing matrix multiply we need some matrices to multiply so these are some of the things that are allowed by our rules we've got some stuff that's part of the standard library this is the faster I datasets library to let us grab the datasets we need it's a more standard library stuff we're only allowed to use this for indexing and a recreation matplotlib there you go so let's grab em list so to grab em nest we just don't we can use past our data sets to download it and then we can use the standard library gzip to open it and then we can pick or load it so in python the kind of standard serialization format is called pickle and so this amnesty version on declining net is stored in that in that format and so we can it basically gives us at a pool of tuples of datasets like so x-ray and white ray and x valid way valid it actually contains none play arrays but none play arrays are not allowed in our foundations so we have to convert them into tensors so we can just use the Python map to map the tensor function of each of these four arrays to get back four tensors okay a lot of you will be more familiar with numpy arrays then apply torched tensors but you know everything you can do in numpy arrays you can also do in pi torch tensors but you can also do it on the GPU and have all this nice deep learning infrastructure so it's a good idea to get used to using apply torch tensors in my opinion so we can now grab the number of rows a number of columns in the training set and we can take a look so here's em most hopefully pretty familiar to to you already it's fifty thousand rose by 784 columns and the the Y data looks something like this the Y shape is just fifty thousand rows and the minimum a maximum of the dependent variable is zero tonight so hopefully that all looks pretty familiar so let's add some tests so the n should be equal to the shape of the Y should be equal to fifty thousand the number of columns should be equal to 28 by 28 because that's how many pixels there are an amnesty and so forth and we're just using that taste equals function that we created just above so now we can plot it okay so we've got a float tensor and we pass that to imshow after casting it to a 28 by 28 that view is really important I think we saw it a few times in part one but get very familiar with that this is how we reshape our 107 68 long vector into a 28 by 28 matrix that's suitable for plotting okay so there's our data and let's start by creating a simple linear model so for a linear model we're going to need to basically have something where y equals ax plus B and so our a will be a bunch of weights so it's going to be to be 784 by 10 matrix because we've got 784 coming in and 10 going out all right so that's going to allow us to take in our independent variable and map it to something which we compared to our dependent variable and then for our bias we'll just start with 10 zeros okay so if we're going to do y equals ax plus B then we're going to need a matrix multiplication so almost everything we do deep learning is basically matrix multiplication or a variant thereof affine functions as we call them so you want to be very comfortable with matrix multiplication so this core website matrix multiplication XYZ shows us exactly what happens when we multiply these two matrices so we take the first column of the first row and the first row and we multiply each of them element wise and then we add them up and that gives us that one and now you can see we've got two sets going on at the same time so that gives us two more and then two more and then the final one and that's our matrix multiplication okay so we have to do that alright so we've got a few loops going on right we've got the loop of this thing scrolling down here we've got the loop of these two rows well they're really column so we flip them around and then we've got the loop of the multiply and add so we're going to need three loops and so here's our three loops now notice this is not going to work unless the number of rows here and the number of columns here sorry the number of columns here and the number of rows here are the same so let's grab the number of rows and columns of a and the number of rows and columns of B and make sure that AC equals B are just a double check and then let's create something of size AR by BC because the size of this is going to be AR by BC with zeros in and then have our three loops okay and then right in the middle let's do that okay so right in the middle the result in I comma J is going to be AI K by B KJ and this is the vast majority of what we're going to be doing a deep learning so get very very comfortable with that equation because we're going to be seeing it in three or four different variants of notation and style in the next few weeks in the next few minutes okay and it's got a kind of a few interesting things going on this I here appears also over here this J here appears also over here and then the K in the loop appears twice and look it's going to be the same number in each place because this is the bit where we're multiplying together the element wise things so there it is so let's create a nice small version grab the first five rows of the validation set we'll call that M one and grab our weight matrix we'll call that M to grab our weight great matrix call that M two and then is this sizes five because we described the first five rows five by seven 84 okay x seven 84 by ten so these match as they should and so now we can go ahead and do that matrix multiplication and it's done okay and it's given us 50,000 sorry lengths of sorry it's given us T one shape as you expect a five rows by ten column output now to talk about a second so took about a second for five rows our data set Emin astiz 50,000 rows so it's going to take about 50,000 seconds to do a single matrix multiplication in Python so imagine doing end nest where every layer for every pass took about 10 hours not going to work right so that's why we don't really write things in Python like when we say Python is too slow we don't mean 20% to slow we mean thousands of times to slow so let's see if we can speed this up by 50,000 times because if we could do that it might just be fast enough so the way we speed things up is we start in the innermost loop and we make each bit faster so the way to make Python faster is to remove Python and the way we remove python is by passing our computation down to something that's written in something other than Python like paid watch because pay torch behind the scenes is using a library called a 10 ok and so we want to get this going down to the a 10 library so the way we do that is to take advantage of something called element wise operations so you've seen them before for example if I have two tensors a and B both of length 3 I can add them together and when I add them together it's simply multiple adds together their corresponding items so that's called element wise addition or I could do less then in which case it's going to do element wise less than so what percentage of a is less than the corresponding item of B a less than B float mean we can do element wise operations on things not just of rank 1 but we could do it on a rank two tensor also known as a matrix so here's our rank two tensor M let's calculate the Frobenius norm how many people know about the Frobenius norm right almost nobody and it looks kind of terrifying right but actually it's just this it's the matrix times itself dot some dot square root so here's the first time we're going to start trying to translate some equations into code to help us understand these equations so this says when you see something like a with two sets of double lines around it and an F underneath that means this we are calculating the Frobenius norm so anytime you see this and you will it actually pops up semi-regularly in deep one in literature when you see this what it actually means is this function as you probably know capital Sigma means sum and this says we're going to sum over two for loops the first for loop will be called I and we'll go from 1 to n and the second for loop will also be called well so it will be called J and we'll also go from 1 to N and in these nested for loops we're going to grab something out of a matrix a that position IJ we're going to square it and then we're going to add all of those together and then we'll take the square root ok which is that now I have something to admit to you I can't rightly tick and yet I did create this stupid a notebook so it looks a lot like I created some low tech which is certainly the impression I like to give people sometimes but the way I actually write lay tech is they find somebody else who wrote it and then I copy it and so the way you do this most of the time is you google for Frobenius norm you find the wiki page for Frobenius norm you click Edit next to the equation and your copy and paste it ok so that's a really good way to do it and Chuck dollar signs or even two dollar signs 3 under two dollar signs make it a bit bigger so that's way way one to get equations method two is if it's in a paper on archive did you know on archive you can click on download other formats in the top right and then download source and that will actually give you the original tech source and then you can copy and paste there lay tech right so I'll be showing you a bunch of equations during these lessons and I can promise you one thing I wrote none of them by hand so this one was stolen from Wikipedia all right so you now know how to implement the Frobenius norm from scratch in tensor flow you could also have written it of course as m dot pal – but that would be illegal under our rules right we're not allowed to use pal yeah so that's why we did it that way okay so that's just doing the element wise multiplication of a rank two tensor with itself 1 times 1 2 times 2 3 times 3 etc ok so that is enough information to replace this loop right because this loop is just going through the first row of a and the first column of B and doing an element wise multiplication and some so our new version is going to have two loops not 3 here it is so this is all the same right but now we've replaced the inner loop and you'll see that basically it looks exactly the same as before but where it used to say K that now says : so in pi torch and none play : means the entirety of that axis right so raychel helped me remember the order of rows and columns in when we talk about matrices which is the song row by column row by column yeah so that's the song so I is the row number okay so this is row number I the whole row and this is column number J the whole column so multiply all of column J by all of row I and that gives us back a rank one tensor which we add up okay that's exactly the same as what we had before and so now that takes 1.45 milliseconds we've removed one line of code and it's a hundred and seventy eight times faster okay so we successfully got rid of that in a loop and so now this is running and see but we didn't really write Python here we wrote kind of a pythonic ish thing that said please please call this C code for us and that made it 178 times faster let's check that it's right we can't really check that it's equal because you know floats are sometimes change slightly depending on how you calculate them so instead let's create something called Nia which calls torch or close to some tolerance and then we'll create a test near function that calls our test function using our near comparison and let's see yep passes okay so we've now got our matrix multiplication at sixty-five microseconds now we need to get rid of this loop because now this is our innermost loop and to do that we're going to have to use something called broadcasting who here is familiar with broadcasting about half okay that's what I figured so broadcasting is about the most powerful tool we have in our toolbox for writing code in Python that runs at C speed or in fact with PI torch if you put it on the GPU it's going to run out CUDA speed it allows us to get rid of nearly all of our loops as you'll see right now the term broadcasting comes from numpy but the idea actually goes all the way back to APL from 1962 and it's a really really powerful technique it's it's a lot of people consider it a different way of programming where we get rid of all of our for loops and replace them with these implicit broadcasted loops in fact you've seen broadcasting before remember our tensor a which contains 10 6-4 if you say a greater than 0 then on the left hand side you've got a rank 1:10 sir on the right hand side you've got a scaler and yet somehow it works and the reason why is that this value zero is broadcast three times they become zero comma zero comma zero and then it does an element-wise comparison so every time for example you've normalized a data set by subtracting the mean and divided by the standard deviation in a kind of one line like this you've actually been broadcasting you're broadcasting a scaler to a tensor so a plus one also broadcasts a scaler to a tensor and the tensor doesn't have to be Rank 1 here we can multiply our rank two tensor by two okay so there's the simplest kind of broadcasting and anytime you do that you're not operating at Python speed you're operating at sea or CUDA speed so that's good we can also broadcast a vector to a matrix so here's a Rank 1 tensor C okay and here's our previous rank two tensor M so M shape is three three C's shape is three and yet M plus C does something what did it do ten twenty thirty plus one two three ten twenty thirty plus four five six ten twenty thirty plus seven eight nine huh it's broadcast this row across each row of the matrix and it's doing that at C speed right so this there's no loop but it sure looks as if there was a loop C plus M does exactly the same thing so we can write C dot expand as M and it shows us what C would look like when broadcast to M ten twenty thirty ten twenty thirty ten twenty thirty so you can see M plus T is the same as C plus M right so basically it's creating or acting as if it's creating this bigger rank two tensor so this is pretty cool because it now means at any time we need to do something between a vector and a matrix we can do it at C speed with no loop right now you might be worrying though that this looks pretty memory intensive if we're kind of turning all of our rows into big matrices but fear not because you can look inside thus the actual memory used by pi torch so here T is a three by three matrix but T dot storage tells us that actually is only storing one copy of that data T dot shape tells us that T knows it's meant to be a three by three matrix and T dot stride tells us that it knows that when it's going from column to column it should make take one step through the storage but when it goes from row to row it should take zero steps and so that's how come it repeats 10 20 30 10 20 30 10 20 30 right so this is a really powerful thing that appears in pretty much every linear algebra library you'll come across is this idea that you can actually create tensors that behave like higher rank things than they're actually stored as write so this is really neat it basically means that this Broadcasting functionality gives us see like speed with no additional memory overhead okay what if we wanted to take a column instead of a row so in other words a rank two tensor of shape three comma one we can create a rank two tensor of shape three comma one from a rank one tensor by using the unscrews method done squeezing additional dimension of size one wherever we have been or wherever we ask for it so unsquare is zero let's check us out unscrews zero is of shape 1 comma 3 it puts the new dimension in position 1 and squeeze one is shape 3 comma 1 it creates the new axis in position 1 so unsquare zero looks a lot like c right but now rather than being a Rank 1 tensor it's now a rank two tensor see how it's got two square brackets around it right see how its size is 1 comma 3 all right perhaps more interestingly say dot unn squeeze 1 now looks like a column alright it's also a rank two tensor but it's three rows by one column why is this interesting because we can say actually before we do I'll just mention writing done squeezed is kind of clunky so ply torch and numpy have a neat trick which is that you can index into a rune array with a special value none and none means squeeze a new axis in here please so at ads so you can see that see none : is exactly the same shape 1 comma 3 as seed or unscrew 0 and see : common none is exactly the same shape as C dot and squeeze 1 so I hardly ever use on squeeze and well I'm like particularly trying to demonstrate something for teaching purposes I pretty much always use none apart for anything else I can add additional axis this way or else with on squeeze you ever go done squeeze on squeeze on squeeze so this is handy so what did we do all that the reason we did all that is because if we go C colon comma none so in other words we turn it into a column vector column kind of a columnar shape so it's now of shape 3 comma 1 dot expand as it doesn't douse a 10 20 30 10 20 30 10 20 30 but it says 10 10 10 20 20 20 30 30 30 so in other words getting broadcast along columns instead of rows so as you might expect if I take that and add it to M then I get the result of broadcasting the column so it's now not 11 22 33 but 11 12 13 so everything makes more sense in Excel let's look so here's broadcasting in Excel right here is a 1 comma 3 shape rank two tensor so we can use the rows and columns functions in Excel to get the rows and columns of this object here is a three by one rank two tensor again rows and columns and here is a 2 by 2 rank sorry 3 by 3 rank two tensor as you can see rows by columns so here's what happens if we broadcast this to be the shape of M okay and here is the result of that C + M and here's what happens if we broadcast this to that shape and here is the result of that addition and there it is 11 12 13 24 25 26 right that it up okay so basically what's happening is when we broadcast it's taking the the thing with which has a which has a unit axis and is kind of effectively copying that unit axis so it is as long as the the larger tensor on that axis but it doesn't really copy it just pretends as if it's being copied so we can use that to get rid of our loop so this was the loop we were trying to get rid of going through each of range BC and so here it is so now we are not anymore going through that loop so now rather than setting C I comma J we can set the entire row of CI right this is the same as CI Cola comma colon right anytime there's a trailing colon in numpy apply torch you can delete it optionally right you don't have to so before we had a few of those right let's see if we can find one here's one comma colon so I'm claiming we could have got rid of that let's see yep still torch size 1 comma 3 all right and similar thing anytime you see any number of : commas at the start you can replace them with a single ellipsis which in this case doesn't save us anything because there's only one of these but if you've got like a really high rank tensor that can be super convenient especially if you want to do something where the rank of the tensor could vary you don't know how big it's going to be ahead of time so we're going to set the whole of row I and we don't need that : so it doesn't matter if it's there and we're going to set it to the whole of row I of a okay and then now that we've got row I of a that is a rank 1 tensor so let's turn it into a rank two tensor okay so it's now got a new and see how this is minus 1 so minus 1 always means the last dimension right so how else could we have written that we could also have written it like that with a special value none okay so this is of now length whatever the size of a is which is a R so it's of length it's a shape a R comma one all right so that is a rank two tensor and B is also a rank two tensor that's the entirety of our matrix right and so this is going to get broadcast over this it is exactly what we want we wanted to get rid of that loop and then so that's going to return because it broadcast it's actually going to return a rank two tensor and then that rank two tensor we want to sum it up over the rows and so some you can give it a dimension argument to say which access to sum over so this one is kind of our most mind-bending broadcast of the lesson so I'm going to leave this as a bit of homework for you to go back and convince yourself as to why this works so maybe put it put it in Excel or do it on paper if it's not already clear to you why this works but this is sure handy because before we were broadcasting that we were at one point three nine milliseconds after using that broadcasting we're down to 250 microseconds so at this point we're now three thousand two hundred times faster than Python and it's not to speed once you get used to this style of coding getting rid of these loops I find really reduces a lot of errors in in my code it takes a while to get used to but once you're used to it it's a really comfortable way of programming once you get to kind of higher rank tensors these this broadcasting can that getting a bit complicated so what you need to do instead of trying to keep it all in your head is apply the simple broadcasting rules here are the rules I've listed them here that in numpy and pipe torch and tensor flow it's all the same rules what we do is we compare the shapes element-wise so let's look at a slightly interesting example here is our rank one cancer C and let's insert a leading unit axis so this is a shape 1 comma 3 see how there's two square brackets and here's the diversion here's the version with a sorry this one's a preceding axis this one's a trailing axis so this is a shaped 3 comma 1 and we should take a look at that so just remind you that looks like a column what if we went see non coma : x c colon comma none what on earth is that and so let's go back to excel here's our row version here's our column version what happens is it says okay you want to multiply this by this element ways right it's not this is not that sign this is asterisk so element wise multiplication it broadcasts this to be the same number of rows as that like so and it broadcasts this to be the same number of columns as that like so and then it simply multiplies those together that's it all right so the rule that it's using you can do the same thing with greater then write the rule that it's using is let's look at the two shapes 1 3 & 3 1 and see if they're compatible they're compatible if element-wise that either the same number or one of them is 1 so in this case one is compatible with three because one of them is one and three is compatible with one because one of them is one and so what happens is if it's one that dimension is broadcast to make it the same size as the bigger one okay so 3 comma 1 became 3 comma 3 so this one was multiplied 3 claims down the rows and this one was multiplied three times down the columns and then there's one more rule which is that they don't even have to be the same rank right so something that we do a lot with image normalization is we normalize images by Channel right so you might have an image which is 256 by 256 by 3 and then you've got the per channel mean which is just a rank 1 tensor of size 3 they're actually compatible because what it does is anywhere that there's a missing dimension it inserts a 1 there at the start of the search leading dimensions and sets a 1 so that's why actually you can normalize by channel with no lines of code mind you in pi torch it's actually channel by height by width so it's slightly different but this is the basic idea so this is super cool we're going to take a break but we're getting pretty close my goal was to make our Python code 50,000 times faster we're up to 4000 times faster and the reason this is really important is because if we're going to be like doing our our own stuff you know like build building things that people haven't built before we need to know how to write code that we can write quickly and concisely but operates fast enough that is actually useful right and so this broadcasting trick is perhaps the most important trick to know about so let's have a 6 minute break and I'll see you back here at 8 o'clock so broadcasting you know when I first started teaching tape learning here and I asked how many people are familiar with broadcasting this is back when we used to do it in Theano almost no hands went up so I used to kind of say this is like my secret magic trick I think it's really cool it's kind of really cool that now half of you have already heard of it and it's kind of sad because it's now not my secret magic trick it's like yeah here's something half of you already knew but the other half of you there's that there's you know there's a reason that people are learning this quickly it's because it's super cool here's another magic trick how many people here know I'm Stein summation notation okay good good almost nobody so it's not as cool as broadcasting but it is still very very cool let me show you right and this is a technique which I don't think it was invented by Einstein I think it was popularized by Einstein as a way of dealing with these high rank tensor kind of reductions that were used in the general relativity I think here's the trick this is our the their innermost part of our original matrix multiplication for loop remember that and here's the version when we removed the innermost loop and replaced it with an element-wise product and you'll notice that what happened was that the repeated K got replaced with a : okay so what's this what if I move okay so first of all let's get rid of the names of everything and let's move the this this move this to the end and put it after an arrow and let's keep getting rid of the names of everything the commas and replace spaces with commas okay and now I just created Einstein summation notation so Einstein summation notation is like a mini language you put it inside a string all right and what it says is however minute so there's an error right and on the left of the arrow is the input and on the right of the arrow is the output how many inputs do you have well they're delimited by comma so in this case there's two inputs they're the inputs what's the rank of each input it's however many letters there are so this is a rank two input and this is another rank two input and this is a rectory output how big are the inputs there if this is one as the size I by K this one is a size K by J and the output is of size I by J when you see the same letter appearing in different places is referring to the same size dimension so this is of size I the output is always has also has I rose this has J columns the output also has J columns alright so we know how to go from the import shape to the output shape what about the K you look for any place that a letter is repeated and you do a dot product over that dimension in other words it's just like the way we were placed K with : okay so this is going to create something of size I by J by doing dot products over these shared case which is matrix multiplication okay so that's how you write matrix multiplication with Einstein summation notation and then all you just do is go toward a line some if you go to the player torch earn some Docs or Docs of most of the major libraries you can find all kinds of cool examples the vine some you can use it for transpose diagonalization tracing all kinds of things batch wise versions of just about everything so for example if my torch didn't have batch wise matrix multiplication I just created it there's batch wise matrix multiplication all right so there's all kinds of things you can kind of invent and often it's quite handy if you kind of need to put a transpose in somewhere or you know tweak things to be a little bit different you can use this so that's Einstein summation notation here's Matt mole and that's now taken us down to 57 microseconds so we're now 16,000 times faster than Python I will say something about eins um it's a travesty that this exists because we've got a little mini language inside Python in a string I mean that's horrendous you shouldn't be writing programming languages inside a string this is as bad as a regex you know like regular expressions are also many languages inside a string you want your languages to be like typed and haven't any sense and like be things that you can like you know extend this this mini language does it's amazing but there's so few things that it actually does right what I actually want to be able to do is create like any kind of arbitrary combination of any axes and any operations and any reductions I like in any order in the actual language I'm writing in right so that's actually what APL does that's actually what J and K do these are the J and K of the languages that kind of came out of APL that this is a kind of a series of languages that have been around for about 60 years and everybody's pretty much failed to notice my hope is that things like Swift and julia will give us this like the ability to actually write stuff in actual swift and actual julia that we can run in an actual debugger and use an actual profiler and do arbitrary stuff that's really fast and actually swift seems like it might go even quite a bit faster than eins um in an even more flexible way thanks to this new compiler infrastructure called ml IR which actually builds off there's some really exciting new research in the compiler world it's kind of been coming over the last few years particularly coming out of a system called halide which is h a lote which is this reciprocal language that basically showed it's possible to create a language that can create like very very very like like totally optimized kind of linear algebra computations in a really flexible convenient way and since that came along there's been all kinds of cool research using these techniques like something called polyhedral comprehend compilation which kind of have the promise that we're going to be able to hopefully within the next couple of years right swift code that runs as fast as the next thing I'm about to show you because the next thing I'm about to show you is the PI torch operation called metal and Matt mole takes 18 microseconds which is 50,000 times faster than Python why is it so fast well if you think about what you're doing when you do a matrix multiply of something that's like 50,000 by 768 by 768 by 10 you know these are things that aren't going to fit in like the cash in your cpu so if you do the kind of standard thing of going down all the rows and across all the columns by the time you cut to the end and you go back to exactly the same column again it forgot the contents it has to go back to RAM and pull it in again right so if you're smart what you do is you break your matrix up into little smaller matrices and you do a little bit at a time and that way everything is kind of in cache and it goes super fast now normally to do that you have to write kind of assembly language code particularly if you want to kind of get it all running in your vector processor and that's how you get these 18 microseconds so currently to get a fast matrix model play things like pi torch they don't even write it themselves they basically push that off to something called a blast elas blast is a basic linear algebra subprograms library where companies like Intel and AMD and NVIDIA rat these things for you right so you can like look up COO blasts for example and this is like Nvidia's version of bless or you could look up MK l and this is Intel's version of bless and so forth right and this is kind of awful because you know the program is limited to this like subset of things that that your blas can handle and to use it you don't really get to write it in Python you kind of have to write the one thing that happens to be turned into that pre-existing blas call so this is kind of why we need to do better right and and there are people working on this and there are people actually increase latinus team working on this you know there's some really cool stuff like there's something called tensor comprehensions which is like really originally came in a torch and I think they're now inside chris's team at Google where people are basically saying hey here are ways to like compile these much more general things and this is this is what we want as more advanced practitioners anyway for now in pi torch world ristic at this level which is to recognize there are some things this is you know three times faster than the best we can do in an even vaguely flexible way and if we compare it to the actually flexible way which is broadcasting we had 254 yeah so still you know over ten times better right so so wherever possible today we want to use operations that are predefined in our library particularly for things that kind of operate over lots of rows and columns the things we're kind of dealing with this memory caching stuff is going to be complicated so keep an eye out for that matrix multiplication is so common and useful that it's actually got its own operator which is at these are actually calling the exact same code so they're the exact same speed at is not actually just matrix modification at covers a much broader array of kind of tensor reductions across different levels of axes so it's worth checking out what mat Bowl can do because often it'll be able to handle things like batch wise or matrix versus vectors don't think of it as being only something that can do rank to you by rank two because it's a little bit more flexible okay so that's that we have matrix multiplication and so now we're let's use it and so we're going to use it to try to create a forward pass which means we first need well you and matrix initialization because remember our model contains parameters which start out randomly initialized and then we use the gradients to gradually update them with SGD so let's do that so here is o2 so let's start by importing NBO one and I just copied and pasted the three lines we used to grab the data and I'm just going to pop them into a function so we can use it to grab em missed when we need it and now that we know about broadcasting let's create a normalization function that takes our tensor and subtracts the means and divides by the standard deviation so now let's grab our data okay and pop it into X Y X Y let's grab the mean and standard deviation I notice that they're not 0 and 1 and why would they be right but we want them to be 0 and 1 and we're going to be seeing a lot of Y we want them to be a lot about why we want them to be 0 and 1 over the next couple of lessons but for now let's just take my word for it we want them to be 0 and 1 so that means that we need to subtract the mean divided by the standard deviation but not for the validation set we don't subtract the validation sets mean and divide by the standard validation set standard deviation because if we did those two data sets would be on totally different scales right so if the training set was mainly green frogs and the validation set was mainly red frogs right then they ghosts got to be then if we normalized with the validation sets been invariance we would end up with them both having the same like average coloration and we wouldn't be able to tell the two apart all right so that's an important thing to remember when normalizing is to always make sure your validation and training set you've normalized in the same way so after doing that twice okay so after doing that our mean is pretty close to zero and our standard deviation is very close to one and it would be nice to have something to easily check that these are true so let's create a test near zero function and then test that the mean is near zero and one minus the standard deviation is near zero and that's all good let's define N and M and see the way but same as before so the size of the training set and the number of activations we're going to eventually need in our model being C and let's try to create now model okay so the model is going to have one hidden layer and normally we would want the final output to have ten activations because we would use cross-entropy against those ten activations but to simplify things for now we're going to not use cross-entropy we're going to use mean squared error which means we're going to have one activation okay which makes no sense for my modeling point of view we'll fix that later but just to simplify things for now so let's create a simple neural net with a single hidden layer and a single output activation which we're going to use mean squared error so let's pick a hidden size so the number of hidden will make 50 okay so our two layers we're going to need to weight matrices and tree BIOS vectors so here are now two weight matrices w1 and w2 so they're random numbers normal random numbers of size M which is the number of columns 768 by n H number of hidden and then this one is n H by 1 now inputs now a mean 0 standard deviation 1 the inputs to the first layer we want the inputs to the second layer to also be mean 0 standard deviation 1 well how are we going to do that because if we just grab some normal random numbers and then we define a function called linear this is our linear layer which is X by W plus B all right and then create T which is the out the activation of that linear layer with their validation set and our weights and biases we have a mean of -5 and a variance a standard deviation of 27 which is terrible all right so I'm going to let you work through this at home right but once you actually look at what happens when you multiply those things together and add them up as you do in matrix multiplication you'll see that you're not going to end up with 0 1 but if instead you divide by square root M so root 768 then it's actually damn good ok so this is a simplified version of something which pi torch calls climbing initialization named after climbing who who wrote a paper or was the lead writer of a paper where we're looking at look at in a moment so the weights ran n gives you random numbers with a mean of 0 and a standard deviation of 1 so if you divide by root m it will have a mean of zero and a standard deviation of one on route M so we can test this so in general normal random numbers of mean 0 and standard deviation of 1 over root of whatever this is so here it's M and here it's n H will give you an output of 0 comma 1 now this may seem like a pretty minor issue but as we're going to see in the next couple of lessons it's like the thing that matters when it comes to training neuro Nets it's actually in the last few months people have really been noticing how important this is there are things like fix-up initialization where these folks actually trained a 10,000 layer deep neural network with no normalization layers just by basically doing careful initialization so it's really people are really spending a lot of time now thinking like ok how we initialize things is really important and you know we've had a lot of success with things like one cycle training and super convergence which is all about what happens in those first few iterations and it really turns out that it's all about initializations so we're going to be spending a lot of time studying this in depth so the first thing I'm going to point out is that this is actually not how our first layer is defined our first layer is actually defined like this it's got a RAL you on it so first let's define real you so real you is just grab our data and replace any negatives with zeros that's all clamp min means that now there's lots of ways I could have written this but if you can do it with something that's like a single function in pi torch it's almost always faster because that things generally written in C for you so try to find the thing that's as close to what you want as possible there's a lot of functions in Python so there's that's a good way of implementing value and unfortunately that does not have four main 0 and standard deviation of one why not well okay so we had some data that had a mean of 0 and a standard deviation of 1 and then we took everything that was smaller than 0 and removed it so that obviously does not have mean of 0 and it obviously now has about half the standard deviation that it used to have okay so this was one of the fantastic insights and one of the most extraordinary papers over the last few years it was the paper from the 2015 image net winners led by the person we've mentioned timing hula climbing at that time was at Microsoft Research and this is full of great ideas reading papers from competition winners is a very very good idea because they tend to be you know normal papers will have like one tiny tweak that they spend pages and pages trying to justify why they should be accepted into Europe where else competition winners have 20 good ideas and only time to mention them in passing this paper introduced us to resonates prelude layers and climbing initialization amongst others so here is section 2.2 section 2.2 initialization of feel free weights are rectifiers what's a rectifier a rectifier is a rectified linear unit or rectifier network is any neural network with rectifier linear units in it this is only 2015 but all he reads like something from another age in so many ways like even the word rectifier units and traditional sigmoid activation networks no one uses sigmoid activation xeni more you know so a lots changed since 2015 so when you read these papers you kind of have to keep these things in mind they describe how what happens if you train very deep models with more than eight lawyers so things have changed right there anyway they said that in the old days people used to initialize these with random Gaussian distributions so this is a Gaussian distribution it's just a fancy word for normal or bell-shaped and when you do that they tend to not train very well and the reason why they point out or actually Lauro and Ben geo pointed out let's look at that paper so you'll see two initializations come up all the time one is either chi-ming or her initialization which is this one or the other you'll see a lot is glory or severe initialization again named after surveyor chloro this is a really interesting paper to read it's a slightly older one it's from 2010 being massively influential and one of the things you'll notice if you read it is it's very readable it's very practical and the the actual you know final result they come up with is it's incredibly simple and we're actually going to be re-implementing much of the stuff in this paper over the next couple of lessons but basically they describe one suggestion for how to initialize neural nets and they suggest this particular approach which is route six of the root of the number of input filters plus the number of output filters and so what happened was coming her and and that team pointed out that that does not account for the impact of irelia the thing that we just noticed so this is a big problem right if your variance haves each layer and you have a massive deep network with like eight layers then you've got one over two to the eighth squishes like by the end it's all gone and if you want to be fancy like them like the the fix-up people with ten thousand layers forget it right your gradients have totally disappeared so this is totally unacceptable so they do something super genius smart they replace the one on the top with a two on the top all right so this you know which is not to take anything away from this um it's a fantastic paper right but in the end the thing they do is to stick a two on the top right so we can do that by taking that exact equation we just used and sticking a two on the top that if we do and the result is much closer it's not perfect right but it actually varies quite a lot it's really random sometimes it's quite close sometimes it's further away but it's certainly a lot better than it was so that's good all right and it's really worth reading so more homework for this week is to read 2.2 of of the ResNet paper and what you'll see is that they describe what happens in the forward pass of a neural net and they point out that for the conv layer this is the response y equals WX plus B now if you're concentrating that might be confusing because a comm flare isn't quiet y equals WX plus B a con flare has a convolution but you remember in part one I pointed out this neat article from a plane Smith where he showed that CNN's in convolutions actually are just matrix multiplications with a bunch of zeros and some tied weights all right so this is basically all they're saying here so sometimes there are these kind of like throwaway lines and papers that are actually quite deep and worth thinking about so they point out that you can just think of this as a linear layer and then they basically take you through step by step what happens to the variance of your network depending on with the initialization and so just try to get to this point here get as far as backward propagation case so you've got about I don't know six paragraphs to read none of the math notation is weird maybe this one is if you haven't seen this before this is exactly the same as Sigma but but instead of doing a sum you do a product okay so this is a great way to kind of warm up your paper reading muscles is to try and read this section and then if that's going well you can keep going with a backward propagation case because the forward pass does a matrix modal play and as we'll see in a moment the backward plus does a matrix multiplied with the transpose of the matrix so the backward pass is slightly different but it's nearly the same and so then at the end of that they will eventually come up with their suggestion let's see if we can find it oh yeah here it is they suggest root 2 over N L where n L is the number of input activations okay so that's what we're using that is called climbing initialization and it gives us a pretty nice variance it doesn't give us a very nice mane though all right and the reason it doesn't give us a very nice mane is because as we saw we deleted everything below the axis so naturally I mean is now half not zero I haven't seen anybody talk about this in the literature that's something I was just trying over the last week is something kind of obvious which is to replace value with not just next up plant min but extra plant min minus 0.5 and in my brief experiments that seems to help so there's another thing that you could try out and see if it actually helps or if I'm just imagining things it certainly returns you to the correct mean okay so now that we have this formula we can replace it with in it not claiming normal according to our rules because it's the same thing and let's check that it does the same thing and it does ok so again we've got this about half main and bit under one standard deviation you'll notice here I had to add something extra which is mode equals fan-out what does that mean what it means is explained here fan in or fan out fan in preserves the magnitude of variance in the forward pass fan-out preserves the magnitudes in the backward pass basically all it's saying is are you dividing by root M or root NH because if you divide by root M as you'll see in that part of the paper I was suggesting you read that will keep the variance at 1 during the forward pass but if you use NH it will give you the right unit variance in the backward pass so it's weird that I had to say fan out because according to the documentation that's for the backward pass to keep the unit variance so why did I need that well it's because our weights shape is 7/8 4 by 50 but if you actually create a linear layer with play torch of the same dimensions it creates it of 50 by 784 it's the opposite so how can that possibly work right and these are the kind of things that it's like useful to know how to dig into so how is this working so to find out how it's working you have to look in the source code so you can either set up visual studio code or something like that and kind of set it up so you can jump between things it's a nice way to do it or you can just do it here with question mark question mark and you can see that this is the forward function and it calls called F dot linear in PI torch capital F always refers to the torch at n dot functional module because you use it so you like it's used everywhere so they decided that's worth a single letter so torch n n dot functional linear is what it calls and let's look at how that's defined in port MAP mol wait T t means transpose okay so now we know in pi torch a linear layer doesn't just do a matrix product it does a matrix product with a transpose so in other words it's actually going to turn this into seven eight four by 50 and then do it and so that's why we kind of had to give it the opposite information when we were trying to do it with our linear layer it doesn't have transpose okay so the main reason I show you that is to kind of show you how you can dig in to the PI torch source code see exactly what's going on and you know when you come across these kind of questions you want to be able to answer them yourself which also then leads to the question if this is how linear layers can be initialized what about convolutional layers what does pi torch do for convolutional layers so we could look inside torch to den kampf to D and when I looked into it I noticed that it basically doesn't have any code it just has documentation all of the code actually gets passed down to something called underscore con n D and so you need to know how to find these things and so if you go to the very bottom you can find the file name it's in and so you see this is actually torch edn modules con so we can find torch dead-end modules cannot underscore conv ending and so here it is and here's how it initializes things and it calls climbing uniform which is basically the same as climbing normal but it's uniform instead but it has a special multiplier of math square root 5 and that is not documented anywhere I have no idea where it comes from and in my experiments this seems to work pretty badly as you'll see so it's kind of useful to look inside the code and we you're writing your own code it like presumably somebody put this here for a reason wouldn't it have been nice if they had a URL and above it with a link to the paper that they're implementing so we could see what's going on so that's always a good idea you know is to put some comments in your code to let the next person know what the hell are you doing so that particular thing I have a strong feeling isn't isn't great as you'll see okay so so we're going to try this thing and subtracting 0.5 from our Lu so like this is pretty cool right we've already designed our own new activation function is it great is it terrible I don't know but like it's this kind of level of tweak which is kind of you know when people write papers this is the total level of like it's like a minor change to one line of code may be interesting to see how much it helps but if I use it then you can see here yep now I have a main that's zero there abouts Matt and interestingly I've also noticed it helps my variants a lot before my variants remember was generally around 0.7 2.8 but now it's generally above point 8 so it it helps both which makes sense as to why I think I'm seeing these better results so now we have R Lu we have linear we have in it so we can do a forward pass right so we're now up to here and so here it is and remember in PI torch a model can just be a function and so here's our model it's just a function that does one linear layer one real you layer and one more linear layer and let's try running it and okay it takes 8 milliseconds to run it the model on the validation set so it's plenty fast enough to train it's looking good add an assert to make sure this shape seemed sensible so the next thing we need for our forward pass is a loss function and as I said we're going to simplify things for now by using main square error even though that's obviously a dumb idea our model is returning something of size 10,000 by one but mean squared error you would expect it just to be a single vector of size 10,000 so I want to get rid of this unit access in pite watch the thing to get the thing to add a unit axis we've learned is called squeeze sorry unhhhh squeeze the thing to get rid of a unit axis therefore is called squeeze so we just go upward dog squeeze to get rid of that unit axis but actually now I think about it this is lazy because our put squeeze gets rid of all unit axes and we very commonly see on the fast äô forums people saying that their codes broken and it's when they've got squeeze and it's that one case where maybe they had a batch size of size one and so that one comma one will get squeezed down to a scaler and things would break so rather than just calling squeeze it's actually better to say which dimension you want to squeeze which we could write over 1 or minus 1 you'll be the same thing and this is going to be more resilient now to that weird edge case of a better size of size 1 okay so output minus target squared mean that's mean squared error okay so remember in pi torch loss functions can just be functions all right four main squared arrow we're going to have to make sure these are floats so let's convert them so now we can calculate some predictions that's the shape of our predictions and we can calculate our mean squared error so there we go so we've done a forward pass so we're up to here a forward pass is useless what we need is a backward pass because that's the thing that tells us how to update our parameters so we need gradients okay how much do you want to know about matrix calculus I don't know it's up to you but if you want to know everything about matrix calculus I can point you to this excellent paper by Terrence power and Jeremy Howard which tells you everything about matrix calculus from scratch so this is a few weeks work to get through but it absolutely assumes nothing at all right so even like basically Terrence and I both felt like well we don't know any of this stuff let's learn all of it and tell other people and so we've wrote it with that in mind and so this will take you all the way up to knowing everything that you need for deep learning you can actually get away with a lot less but if you're here yeah maybe it's worth it but I tell you what you do need to know what you need to know is the chain rule right because let me point something out we start with some input we start with some input and we stick it through the first linear layer and then we stick it through value and then we stick it through the second linear layer and then we stick it through MSE and that gives us our predictions all right oh you're just right that or to put it another way we start with X and we put it through the function lien one and then we take the output of that and we put it through the function value and then we take the output of that and we put it through the function win two and then we take the output of that and we put it through the function MSC and strictly speaking MSC has a second argument which is the actual target value right and we want the gradient of the output with respect to the input so it's a function of a function of a function of a function of function so if we simplify that down a bit we could just say like what if it's just like a little y equals f of X so y equals f of u and u equals f of X so that's like a function of a function simplify it a little bit then the derivative is that that's the chain rule if that doesn't look familiar to you or you've forgotten it go to Khan Academy Khan Academy has some great tutorials on the chain rule but that's this is actually the thing we need to know because once you know that then all you need to know is the derivative of each bit on its own and you just multiply them all together and if you ever forget the chain rule just cross multiply so that would be two wieder you'd over udx cross out two to use you get 2y DX right and if you went to like a fancy school they would have told you not to do that they said you can't treat calculus like this because there's special magic small things actually you can there's actually a different way of treating calculus called the cat plus of infinitesimals which where all of this just makes sense and you suddenly realized you actually can do this exact thing so anytime you see a derivative just remember that all it's actually doing is it's taking some function right and it's saying as you go across a little bit how much do you go up all right and that it's dividing that change in Y divided by that change in X all right that's literally what it is where Y and X you must make them small numbers so they behave very sensibly you know when you just think of them as a small change in Y over a small change in X as I just did showing you the chain rule so to do the chain rule we're going to have to start with the very last function the very last function on the outside was the loss function means great error so we start by so we just do each bit separately so the gradient of the loss with respect to what should I say output output of previous layer okay so the output of the previous layer the MSE is just input miners target squared and so the derivative of that is just two times input Motors target because the derivative of blast squared is 2 times blah ok so that's it now I need to store that gradient somewhere now the thing is that for the chain rule I'm going to need to multiply all these things together right so if I store it inside the dot G attribute of the previous layer because remember this is the previous layer right then when the previous layer 4 so the input of MSC is the same as the output of the previous layer so if I store it away in here I can then quite comfortably refer to it right so here look rel you let's to rel you so rel you is this ok what's the gradient there zero what's the gradient there 1 so therefore that's the gradient of the real you it's just imperator than zero but we need the chain rule ok so we need to multiply this by the gradient of the next layer which remember we stored away ok so we can just grab it so this is really cool so the same thing for the linear layer the gradient is simply and this is where the matrix calculus comes in the gradient of a matrix product is simply the matrix product with the transpose so you can either read all that stuff I showed you or you can take my word for it so here's the cool thing right here's the function which does the forward pass that we've already seen and then it goes backwards it calls each of the gradients backwards right in reverse order because we know we need that for the chain rule and you can notice that every time we're passing in the result of the forward plus and it also has access as we discussed to the gradient of the next layer this is called back propagation right so when people say as they love to do that propagation is not just the chain rule they're basically lying to you that propagation is the chain rule where we just save away or the intermediate calculations so we don't have to calculate them again okay so this is a full forward and backward pass one interesting thing here is this value here loss this value here loss we never actually use it right because the loss never actually appears in the gradients I mean just by the way you still probably want it to print it out or whatever but it's actually not something that appears in the gradients so that's it so w1g WTO G etc they now contain all of our gradients which we're going to use for the optimizer and so let's treat and use PI torch Auto grad to check our results because play torch can do this for us so let's clone all of our weights and biases and input and then turn on requires grad for all of them so requires grad underscore is how you take apply torch tensor and turn it into a magical Auto gratified PI torch tensor so what it's now going to do is everything that gets calculated with test tensor it's basically going to keep track of what happened so it basically keeps track of these depp's so that then it can do these things it's it's not actually that magical right you could totally write it yourself you just need to make sure that each time you do an operation you remember what it is and so then you can just go back through them in reverse order okay so now that we've done the requires Grad we can now just do the forward pass like so that gives us loss in PI torch you say lost step backward and now we can test that and remember PI torch doesn't store things in G it starts the stores them in bread and we can test them and all of our gradients were correct or at least they're the same as PI torches okay so that's pretty interesting right I mean that's an actual neural network that kind of contains all the main pieces that we're going to need and we've written it we've written all these pieces from scratch so there's nothing magical here right but let's do some cool refactoring I really love this refactoring and this is massively inspired by in Dec very closely stolen from the PI torch API but it's kind of interesting I didn't have the plied torch API in mind as I did this but as I kept refactoring I kind of noticed like oh I just recreated the PI torch API that makes perfect sense so let's take um H of our layers value and linear and to n create classes right and for the forward let's use dunder call now do you remember that dunder call means that we can now treat this as if it was a function right so if you call this class just with parentheses it calls this function and let's save the input let's save the output and let's return the output right and then backward do you remember this was our backward pass okay so it's exactly the same as we had before okay but we're going to save it inside self dot input or gradient so this is exactly the same code as we had here okay but I've just moved the forward and backward into the same class right so here's linear forward exactly the same but each time I'm saving the input I'm saving the output I'm returning the output and then here's our backward one thing to notice the backward pass here we for linear we don't just want the gradient of the outputs with respect to the inputs we also need the gradient of the outputs with respect to the weights and the output words back to the biases right so that's why we've got three lots of G's going on here okay so there's our linear layers forward and backward and then we've got our mean squared error okay so there's our forward and we'll save away both the input and the target for using later and there's our gradient again same as before two times input minus target so with this refactoring where you can now create our model we can just say let's create a model class and create something called dot layers with a list of all of our layers but notice I'm not using any pay torch machinery this is all from scratch let's define loss and then let's define call and it's going to go through each layer and say x equals L X so this is how I do that function composition we're just calling the function on the result of the previous thing okay and then at the other very end call self dot loss on that and then for backward we do the exact opposite we go self got lost our backward and then we go through the reversed layers and call backward on each one all right and remember the backward classes are going to save the gradient away inside the dot G so with that let's just set all of our gradients to none so that we know we're not cheating we can then create our model right this class model and call it and then we can call it as if it was a function because we have done to call right so this is going to call dunder call and then we can call backward and then we can check that our gradients are correct all right so that's nice one thing that's not nice is holy crap that took a long time let's run it there go three point four seconds so that was really really slow so we'll come back to that I don't like to pick a code there's a lot of Tripucka code here self dot in pickles in return self dot out that's messy so let's get rid of it so what we could do is we could create a new class called module which basically does the self dot in pickles imp and returns self dot out for us and so now we're not going to use dunder call to implement our forward we're going to have a call something called self dot forward which we will initially set to raise exception not implemented and backward is going to call self dot bwd passing in the thing that we just saved and so now RAL you has something called forward which just has that so we're now basically back to where we were and backward just has that right so now look how neat that is right and we also realized that this thing we were doing to not this thing this thing we were doing to calculate the derivative of the output of the linear layer with respect to the weights we're we're doing an unscrews and a none squeeze just basically a big out of product in the sum we could actually reexpress that with own some okay and when we do that so our code is now neater and our three point four seconds is down to 143 milliseconds okay so thank you again to and some so you'll see this now look Eckles model loss equals blah blah drop backward and now the gradients are all there that looks almost exactly like plant watch and so we can see why why it's done this way why do we have to inherit from NN module why do we have to define forward this is why right it lets PI torch factor out all this duplicate stuff so all we have to do is do the implementation so I think that's pretty fun and then once we realized we thought more about it more like what we doing with this ein sum and we actually realized that it's exactly the same as just doing import transpose times output so we replace the ein sum with a matrix product and that's 140 milliseconds and so now we've basically implemented an end linear and an end module so let's now use in end linear and an end up module because we're allowed to that's the rules and the forward pass is almost exactly the same speed as our forward pass and their backward pass is about twice as fast I'm guessing that's because we're calculating all of the gradients and they're not having they're not calculating all of them only the ones they need but it's basically the same thing okay so at this point we're ready in the next lesson to do a training loop we have something we have a multi-layer fully connected neural network what her paper would call a rectified network we have matrix multiply organized we have our forward and backward passes organized it's so nice to be reflected out in two classes and a module class so in the next lesson we will see how far we can get hopefully we will build a high quality fast ResNet and we're also going to take a very deep dive into optimizers and Colbert and trading loops and normalization methods any questions before we go no that's great okay thanks everybody see you on the forums [Applause]

4 thoughts on “Lesson 8 (2019) – Deep Learning from the Foundations”

Leave a Reply

Your email address will not be published. Required fields are marked *