MIT 6.S191 (2018): Introduction to Deep Learning

good morning everyone thank you for thank you all for joining us this is MIT six s-191 and we'd like to welcome to welcome you to this course on introduction to deep learning so in this course you'll learn how to build remarkable algorithms intelligent algorithms capable of solving very complex problems that just a decade ago were not even feasible to solve and let's just start with this notion of intelligence so at a very high level intelligence is the ability to process information so that it can be used to inform future predictions and decisions now when this intelligence is not engineered but rather a biological inspiration such as in humans it's called human intelligence but when it's engineered we refer to it as artificial intelligence so this course is a course on deep learning which is just a subset of artificial intelligence and really it's just a subset of machine learning which involves more traditional methods where we tried to learn representations directly from data and we'll talk about this more in detail later today but let me first just start by talking about some of the amazing successes that deep learning has had in the past so in 2012 this competition called imagenet came out which tasked AI researchers to build an AI system capable of recognizing images objects in images and there was millions of examples in this data set and the winner in 2012 for the first time ever was a deep learning based system and a when it came out it absolutely shattered all other competitors and crushed the competition across the country crush the challenge and today these deep learning based systems have actually surpassed human level accuracy on the image net challenge and can actually recognize images even better than humans can now in this class you'll actually learn how to build complex vision systems building a computer that how to see and just tomorrow you'll learn how to build an algorithm that will take as input x-ray images and as output it will detect if that person has a pneumothorax just from that single input image you'll even make the network explain to you why it decided to diagnose the way it diagnosed by looking inside the network and understanding exactly why I made that decision deep neural networks can also be used to model sequences where your data points are not just single images but rather temporally dependent so for this you can think of things like predicting the stock price translating sentences from English to Spanish or even generating new music so actually today you'll learn how to create and actually you'll create yourselves an algorithm that learns that first listens to hours of music learns the underlying representation of the notes that are being played in those songs and then learns to build brand new songs that have never been heard before and there are really so many other incredible success stories of deep learning that I could talk for many hours about and will try to cover as many of these as possible as part of this course but I just wanted to give you an overview of some of the amazing ones that we'll be covering as part of the labs that you'll be implementing and that's really the goal of what we want you to accomplish as part of this class firstly we want to provide you with the foundation to do deep learn to understand what these algorithms are doing underneath the hood how they work and why they work we will provide you some of the practical skills to implement these algorithms and deploy them on your own machines and will talk to you about some of the stating state of art and cutting edge research that's happening in deep learning industries and deep learning academia institutions finally the main purpose of this course is we want to build a community here at MIT that is devoted to advancing the state of artificial intelligence advancing a state of deep learning as part of this course we'll cover some of the limitations of these algorithms there are many we need to be mindful of these limitations so that we as a community can move forward and create more intelligent systems but before we do that let's just start with some administrative details in this course so this course is a one-week course today is the first lecture we meet every day this week 10:30 a.m. to 1:30 p.m. and this during this three hour time slot were broken down into one and a half hour time slots around 50% of the course see each and each of those have half sections of this course will consist of lectures which is what you're in right now and the second part is the labs where you'll actually get practice implementing what you learn in lectures we have an amazing set of lectures lined up for you so today we're going to be talking about some of the introduction to neural networks which is really the backbone of deep learning we're also talking about modeling sequence data so this is what I was mentioning about the temporally dependent data tomorrow we'll talk about computer vision and deep generative models we have one of the inventors of generative adversarial networks coming to give that lecture for us so that's going to be a great lecture and the day after that we'll touch on deep reinforcement learning and some of the open challenges in AI and how we can move forward past this course we'll spend the final two days of this course talking or hearing from some of the leading industry representatives doing deep learning in their respective companies and these are bound to be extremely interesting or extremely exciting so I highly recommend attending these as well for those of you who are taking this course for credit you have two options to fulfill your graders assignment the first option is a project proposal it's a one-minute project pitch that will take place during Friday and for this you have to work in groups of three or four and what you'll be tasked to do is just come up with interesting deep learning idea and try to show some sort of results if possible we understand that one week is extremely short to create any type of results or even come up with a interesting idea for that matter but we're going to be giving out some amazing prizes so including some nvidia gpus and google homes on friday you'll like I said give a one-minute pitch there's somewhat of an arts to your idea in just one minute even though it's extremely short so we will be holding you to a strict deadline of that one minute the second option is a little more boring but you'll be able to write a one-page paper about any deep learning paper that you find interesting and really that's if you can't do the project proposal you can do that this class has a lot of online resources you can find support on Piazza please post if you have any questions about the lectures the labs installing any of the software etc also try to keep up to date with the course website we'll be posting all of the lectures labs and video recordings online as well we have an amazing team that you can reach out to at any time in case you have any problems with anything feel free to reach out to any of us and we wanted to give a huge thanks to all of our sponsors who without this without their support this class would simply not happened the way the way it's happening this year so now let's start with the fun stuff and let's start by actually asking ourselves a question why do we even care about deep learning so why now and why do we why do we even sit in this class today so traditional machine learning algorithms typically define sets of pre-programmed features and the data and they work to extract these features as part of their pipeline now the key differentiating point of deep learning is that it recognizes that in many practical situations these features can be extremely brittle so what deep learning tries to do is learn these features directly from data as opposed to being hand engineered by the human that is can we learn if we want to learn to detect faces can we first learn automatically from data that to detect faces we first need to detect edges in the image compose these edges together to detect eyes and ears then compose these eyes and ears together to form higher-level facial structure and in this way deep learning represents a form of a hierarchical model capable of representing different levels of abstraction in the data so actually the fundamental building blocks of deep learning which are neural networks have actually been existing have actually existed for decades so why are we studying this now well there's three key points here the first is that data has become much more pervasive we're living in a big data environment these algorithms are hungry for more and more data and accessing that data has become easier than ever before second these algorithms are massively parallel Liza below and can benefit tremendously from modern GPU architectures that simply just did not exist just less more than a decade ago and finally due to open-source tool boxes like tensor flow building and deploying these algorithms has become so streamlined so simple that we can teach it in a one-week course like this and it's become extremely deployable for the massive public so let's start with now looking at the fundamental building block of deep learning and that's the perceptron this is really just a single neuron in a neural network so the idea of a perceptron or a single neuron is extremely simple let's start by talking about the forward propagation of information through this data unit we define a set of inputs x1 through XM on the left and all we do is we multiply each of these inputs by their corresponding weight theta1 through theta m which are those arrows we take this weighted we take this weighted combination of all of our inputs sum them up and pass them through a nonlinear activation function and that produces our output why it's that simple so we have M inputs one output number and you can see it summarized on the right-hand side as a mathematic single mathematical equation but actually I left that one important detail that makes the previous slide not exactly correct so I left that this notion of a bias a bias is a that green term you see on the left and this just represents some way that we can allow our model to learn or we can allow our activation function to shift to the left or right so it allows if we provide allows us to when we have no input features to still provide a positive output so on this equation on the right we can actually rewrite this using linear algebra and dot products to make this a lot cleaner so let's do that let's say X capital X is a vector containing all of our inputs x1 through XM capital theta is now just a vector containing all of our Thetas theta 1 to theta M we can rewrite that equation that we had before is just applying a dot product between X and theta adding our bias theta 0 and apply our non-linearity G now you might be wondering since I've mentioned this a couple times now what is this nonlinear function G well I said it's the activation function but let's see an example of what in practice G actually could be so one very popular activation function is the sigmoid function you can see a plot of it here on the bottom right and this is a function that takes its input any real number on the x-axis and transforms it to an output between 0 and 1 because all outputs of this function are between 0 & 1 it makes it a very popular choice in deep learning to represent probabilities in fact there are many types of nonlinear activation functions in Durrell networks and here are some of the common ones throughout this presentation you'll also see tensorflow code snippets like the ones you see on the bottom here since we'll be using tensorflow for our labs and well this is some way that I can provide to you to kind of link the material in our lectures with what you'll be implementing in labs so the sigmoid activation function which I talked about in the previous slide now on the left is it's just a function like I said it's commonly used to produce probability outputs each of these activation functions has their own advantages and disadvantages on the right a very common activation function is the rectified linear unit or Lu this function is very popular because it's extremely simple to compute it's piecewise linear it's zero before with inputs less than zero it's X with any input greater than zero and the gradients are just zero or one with a single non-linearity at the origin and you might be wondering why we even need activation functions why can't we just take our dot product at our bias and that's our output why do we need the activation function activation functions introduce nonlinearities into the network that's the whole point of why activations themselves are nonlinear we want to model nonlinear data in the world because the world is extremely nonlinear but suppose I gave you this this plot green and red points and I asked you to draw a single line not a curve just a line between the green and red points to separate them perfectly you'd find this really difficult and probably you could get as best as something like this now if your activation function in your deep neural network was linear since you're just composing linear functions with linear functions your output will always be linear so the most complicated deep neural network no matter how big or how deep if the activation function is linear your output can only look like this but once we introduce nonlinearities our network is extremely more as the capacity of our network has extremely increased we're now able to model much more complex functions we're able to draw decision boundaries that were not possible with only linear activation options let's understand this with a very simple example imagine I gave you a train to network like the one we saw before sorry a trained perceptron not in network yet just a single node and the weights are on the top right so theta 0 is 1 and the theta vector is 3 and negative 2 the network has two inputs X 1 and X 2 and if we want to get the output all we have to do is apply the same story as before so we apply the dot product of X and theta we add the bias and apply our non-linearity but let's take a look at what's actually inside before we apply that non-linearity this looks a lot like just a 2d line because we have two inputs and it is we can actually plot this line when it equals zero in feature space so this is space where I'm plotting x1 one of our features on the x-axis and x2 the other feature on the y-axis we plot that line it's just the decision boundary separating our entire space into two subspaces now if I give you a new point negative 1/2 and plot it on the sub in this feature space depending on which side of the line it falls on I can automatically determine whether our output is less than 0 or greater than 0 since our line represents a decision boundary equal to 0 now we can follow the math on the bottom and see that computing the inside of this activation function we get 1 minus 3 minus 2 sorry minus 4 and we get minus 6 at the output before we apply the activation function once we apply the activation function we get zero point zero zero two so negative what was applied to the activation function is negative because we fell on the negative piece of this subspace well if we remember with the sigmoid function it actually divides our space into two parts greater than 0.5 and less than 0.5 since we're modeling probabilities and everything is between 0 & 1 so actually our decision boundary where the input to our network equals 0 sorry the side the input to our activation function equals 0 corresponds to the output of our activation function being greater than or less than 0.5 so now that we have an idea of what a perceptron is let's just start now by understanding how we can compose these perceptrons together to actually build neural networks and let's see how this all comes together so let's revisit our previous diagram of the perceptron now if there's a few things that you learned from this class let this be one of them and we'll keep repeating it over and over in deep learning you do a dot product you apply a bias and you add your non-linearity you keep repeating that many many times three each node each neuron and your neural network and that's a neural network so it's simplify this diagram a little I remove the bias since we were going to always have that and we just take you for granted from now on I'll remove all of the weight labels for simplicity and note that Z is just the input to our activation function so that's just the dot product plus our bias if we want the output of the network Y we simply take Z and we apply our non-linearity like before if we want to define a multi output perceptron it's very simple we just add another perceptron now we have two outputs y1 and y2 each one has weight matrices it has weight vector theta corresponding to the weight of each of the inputs now let's suppose we want to go the next step deeper we want to create now a single layered neural network single layered neural networks are actually not deep networks yet they're only there's still shallow networks they're only one layer deep but let's look at the singledom layered neural network where now all we do is we have one hidden layer between our inputs and outputs we call this a hidden layer because it's states are not directly observable they're not directly enforced by by the AI designer we only enforce the inputs and outputs typically the states in the middle are hidden and since we now have a transformer to go from our input space to our hidden hidden lair space and from our hidden lair space to our output layer space we actually need two weight matrices theta 1 and theta 2 corresponding to the weight matrices of each layer now if we look at just a single unit in that hidden layer it's the exact same story as before it's one perceptron we take its top product with all of the X's that came before it and we apply I'm sorry we take the dot product of the X's that came before with the weight matrices theta is theta one in this case we apply a bias to get Z 2 and if we look we're to look at a different hidden unit let's say Z 3 instead we would just take different weight matrices different our dot product to change our bias would change but and this means that Z would change which means this activation would also be different so from now on I'm going to use this symbol to denote what is called as a fully connected layer and that's what we've been talking about so far so that's every node and one layer is connected to every node and another layer by these weight matrices and this is really just for simplicity so I don't have to keep redrawing those lines now if we want to create a deep neural network all we do is keep stacking these layers and fully connected weights between the layers it's that simple but the underlying building block is that single perceptron set single dot product non-linearity and bias that's it so this is really incredible because something so simple at the foundation is still able to create such incredible algorithms and now let's see an example of how we can actually apply neural networks to a very important question that I know you are all extremely worried about you care a lot about here's the question you want to build an AI system that answers the following question will I pass this class yes or no one or zero is the output to do this let's start by defining a simple two feature model one feature is the number of lectures that you attend the second feature is the number of hours that you spend on your final project let's plot this data in our feature space reply Greenpoint's are people who pass red points are people I fail we want to know given a new person this guy he spent ersity they spent five hours on their final project and we went to four lectures we want to know did that person pass or failed a class and we want to build a neural network that will determine this so let's do it we have two inputs one is for the others five we have one hidden layer with three units and we want to see the final output probability of passing this class and we computed as 0.1 or 10% well that's really bad news because actually this person did pass the class they passed it with probability one now can anyone tell me why the neural network got this such so wrong why I do this yeah it is exactly so this network has never been trained it's never seen any data it's basically like a baby it's never learned anything so we can't expect it to solve a problem and knows nothing about so to do this to tackle this problem of training a neural network we have to first define a couple of things so first we'll talk about the loss the loss of a network basically tells our algorithm or our model how wrong our predictions are from the ground truth so you can think of this as a distance between our predicted output and our actual output if the two are very close if we predict something that is very close to the true output our loss is very low if we predict something that is very far in a high-level sense far like in distance then our loss is very high and we want to minimize this from happening as much as possible now let's assume we're not given just one data point one student but we're given a whole class of students so as previous data I used this entire class from last year and if we want to quantify what's called the empirical loss now we care about how the model did on average over the entire data set not for just a single student but across the entire data set and how we do that is very simple we just take the average of the loss of each data point if we have n students it's the average over end data points this has other names besides empirical law sometimes people call it the objective function the cost function etc all of these terms are completely the same thing now if we look at the problem of binary classification predicting if you pass or fail this class yes or no 1 or 0 we can actually use something that's called the softmax cross entropy loss now for those of you who aren't familiar with cross entropy or entropy this is a extremely powerful notion that was actually developed or first introduced here at MIT over 50 years ago by Claude Shannon and his master's thesis like I said this was 50 years ago it's huge in the field of signal processing thermodynamics really all over computer science that seen in information theory now instead of predicting a single one or zero output yes or no let's suppose we want to predict a continuous valued function not will I pass this class but what's the grade that I will get and then as a percentage let's say 0 to 100 now we're no longer limited to 0 to 1 but can't actually output any real number on the number line now instead of using cross entropy we might want to use a different loss and for this let's think of something like a mean squared error loss whereas your predicted and your true output diverged from each other the loss increases as a quadratic function ok great so now let's put this new loss information to the test and actually learn how we can train a neural network by quantifying its loss and really if we go back to what the loss is at the very high level the loss tells us how the network is performing right that loss tells us the accuracy of the network on a set of examples and what we want to do is basically minimize the loss over our entire training set really we want to find the set of parameters theta such that that loss J of theta that's our empirical loss is minimum so remember J of theta takes as input theta and theta is just our weight so these are the things that actually define our network remember that the loss is just a function of these weights if we want to think about the process of training we can imagine this landscape so if we only have two weights we can plot this nice diagram like this theta zero and theta one are our two weights they're on the four they're on the planar axis on the bottom J of theta zero and theta one are plotted on the z axis what we want to do is basically find the minimum of this loss of this landscape if we can find the minimum then this tells us where our loss is the smallest and this tells us where theta want with where or what values of theta zero and theta one we can use to attain that minimum loss so how do we do this well we start with a random guess we pick a point theta zero theta one and we start there we compute the gradient of this point on the lost landscape that's DJ D theta it's how the loss is changing with respect to each of the weights now this gradient tells us the direction of highest ascent not descent so this is telling us the direction going towards the top of the mountain so let's take a small step in the opposite direction so we negate our gradient and we adjust our weight such that we step in the opposite direction of that gradient such that we move continuously towards the lowest point in this landscape until we finally converge at a local minima and then we just stop so let's summarize this with some pseudocode so we randomly initialize our weights we loop until convergence the following we compute the gradient at that point and simply we apply this update rule where the update takes as input the negative gradient now let's look at this term here this is the gradient like I said it explains how the lost changes with respect to each weight in the network but I never actually told you how to compute this this is actually a big big issue in neural networks I just kind of took it for granted so now let's talk about this process of actually computing this gradient because it's not that gradient you kind of helpless you have no idea which way down is you don't know where to go in your landscape so let's consider a very simple neural network probably the simplest neural network in the world it contains one hidden unit one hidden layer and one output unit and we want to compute the gradient of our loss J of theta with respect to theta to just data to for now so this tells us how a small change in theta 2 will impact our final loss at the output so let's write this out as a derivative we can start by just applying a chain rule because J of theta is dependent on Y right so first we want to back propagate through Y our output all the way back to theta 2 we can do this because Y our output Y is only dependent on the input and theta 2 that's it so we're able to just from that perceptron equation that we wrote on the previous slide compute a closed-form gradient or closed form derivative of that function now let's suppose I change theta to 2 theta 1 and I want to compute the same thing but now for the previous layer and the previous weight all we need to do is apply the chain rule one more time back propagate those gradients that we previously computed one layer further it's the same story again we can do this for the same reason this is because z1 our hidden state is only dependent on our previous input X and that single weight theta one now the process of back propagation is basically you repeat this process over and over again for every way in your network until you compute that gradient DJ D theta and you can use that as part of your optimization process to find your local minimum now in theory that sounds pretty simple I hope I mean we just talked about some basic chain rules but let's actually touch on some insights on training these networks and computing back propagation in practice now the picture I showed you before is not really accurate for modern deep neural network architectures modern deep neural network architectures are extremely non convex this is an illustration or a visualization of the landscape like I've plotted before but of a real deep neural network of ResNet 50 to be precise this was actually taken from a paper published about a month ago where the authors attempt to visualize the lost landscape to show how difficult gradient descent can actually be so there's a possibility that you can get lost in any one of these local minima there's no guarantee that you'll actually find a true global minimum so let's recall that update equation that we defined during gradient descent let's take a look at this term here this is the learning rate I didn't talk too much about it but this basically determines how large of a step we take in the direction of our gradient and in practice setting this learning rate it's just a number but setting it can be very difficult if we set the learning rate too low then the model may get stuck in a local minima and may never actually find its way out of that local minima because at the bottom a local minima obviously your gradient is 0 so it's just going to stop moving if I set the learning rate to large it could overshoot and actually diverge our model could blow up ok ideally we want to use learning rates that are large enough to avoid local minima but also still converge to our global minima so they can overshoot just enough to avoid some local local minima but then converge to our global minima now how can we actually set the learning rate well one idea is let's just try a lot of different things and see what works best but I don't really like the solution let's try and see if we can be a little smarter than that how about we tried to build an adaptive algorithm that changes its learning rate as training happens so this is a learning rate that actually adapts to the landscape that it's in so the learning rate is no longer a fixed number it can change it can go up and down and this will change depending on the location that that the update is currently at the gradient in that location may be how fast were learning and many other many other possible situations in fact this process of optimization in in deep neural networks and non convex situation has been extremely explored there's many many many algorithms for computing adaptive learning rates and here are some examples that we encourage you to try out during your labs to see what works best and for your problems especially real-world problems things can change a lot depending on what you learn in lecture and what really works in lab and we encourage you to just experiment get some intuition about each of these learning rates and really understand them at a higher level so I want to continue this talk and really talk about more of the practice of deep neural networks this incredibly powerful notion of mini batching and I'll focus for now if we go back to this gradient descent algorithm this is the same one that we saw before and let's look at this term again so we found out how to compute this term using back propagation but actually what I didn't tell you is that the computation here is extremely calm is extremely expensive we have a lot of data points potentially in our data set and this takes as input a summation over all of those data points so if our data set is millions of examples large which is not that large and the realm of today's deep neural networks but this can be extremely expensive just for one iteration so we can compute this on every iteration instead let's create a variant of this algorithm called stochastic gradient descent where we compute the gradient just using a single training example now this is nice because it's really easy to compute the gradient for a single training example it's not nearly as intense as over the entire training set but as the name might suggest this is a more stochastic estimate it's much more noisy it can make us jump around the landscape in ways that we didn't anticipate doesn't actually represent the true gradient of our data set because it's only a single point so what's the middle ground how about we define a mini batch of B data points compute the average gradient across those B data points and actually use that as an estimate of our true gradient now this is much faster than computing the estimate over the entire batch because B is usually something like 10 to 100 and it's much more accurate than SGD because we're not taking a single example but we're learning over a smaller batch a larger batch sorry now the more accurate our gradient estimation is that means the more or the easier it will be for us to converge to the solution faster means will converge smoother because we'll actually follow the true landscape that exists it also means that we can increase our learning rate to trust each update more this also allows for massively parallel Liza become petition if we split up batches on different workers on different GPUs or different threads we can achieve even higher speed ups because each thread can handle its own batch then they can come back together and aggregate together to basically create that single learning rate or completely complete that single training iteration now finally the last topic I want to talk about is that of overfitting and regularization really this is a problem of generalization which is one of the most fundamental problems in all of artificial intelligence not just deep learning but all of artificial intelligence and for those of you who aren't familiar let me just go over in a high level what overfitting is what it means to generalize ideally in machine learning we want a model that accurately describes our test data not our training data but our test data said differently we want to build models that can learn representations from our training data still generalized well on unseen test data assume we want to build a line to describe these points under fitting describes the process on the left where the complexity of our model is simply not high enough to capture the nuances of our data if we go to overfitting on the right we're actually having to complex of a model and actually just memorizing our training data which means that if we introduce a new test data point it's not going to generalize well ideally what we want to something in the middle which is not too complex to memorize all the training data but still contains the capacity to learn some of these nuances in this in the test set so address to address this problem let's talk about this technique called regularization now regularization is just this way that you can discourage your models from becoming too complex and absolutely as we've seen before this is extremely critical because we don't want our data we don't want our models to just memorize data and only do well in our training set one of the most popular techniques for regularization in neural networks is dropout this is an extremely simple idea let's revisit this picture of a deep neural network and then drop out all we do during training on every iteration we randomly drop some proportion of the hidden neurons with some probability P so let's suppose P equals 0.5 that means we dropped 50% of those neurons like that those activations become zero and effectively they're no longer part of our network this forces the network to not rely on any single node but actually find alternative paths through the network and not put too much weight on any single example with any single single node so it discourages memorization essentially on every iteration we randomly drop another 50% of the node so on this iteration I may drop these on the next iteration I may drop those and since it's different on every iteration you're encouraging the network to find these different paths to its answer the second technique for regularization that we'll talk about is this notion of early stopping now we know that the definition of overfitting actually is just when our model starts to perform worse and worse on our test data set so let's use that to our advantage to create this early stopping algorithm if we set aside some of our training data and use it only as test data we don't train with that data we can use it to basically monitor the progress of our model on unseen data so we can plot this curve we're on the x axis we have the training iterations on the y axis we have the loss now they start off going down together this is great because it means that we're learning we're training right that's great there comes a point though where the testing data where the testing data set and the add the loss for that data set starts to Plateau now if we look a little further the training data set loss will always continue to go down as long as our model has the capacity to learn and memorize some of that data but that doesn't mean that it's actually generalizing well because we can see that the testing data set has actually started to increase this pattern continues for the rest of training but I want to focus on this point here this is the point where you need to stop training because after this point you are overfitting and your model is no longer performing well on unseen data if you stop before that point you're actually under fitting and you're not utilizing the full potential the full capacity of your network so I'll conclude this lecture by summarizing three key points that we've covered so far first we've learned about the fundamental building blocks of neural networks called the perceptron we've learned about stacking these units these perceptrons together to compose very complex hierarchical models and we've learned how to mathematically optimize these models using a process called back row back propagation and gradient descent finally we adjust some of the practical challenges of training these models in real life that you'll find useful for the labs today such as using adaptive learning rates batching and regularization to combat overfitting thank you and I'd be happy to answer any questions now otherwise we'll have Ferrini talk to us about some of the deep sequence models for modeling temporal data

43 thoughts on “MIT 6.S191 (2018): Introduction to Deep Learning”

  1. Very good 🙂 I'm Computer Engineering student in Italy and love this 🙂 Great job 😉
    Lecture starts at 8:12

  2. Thanks for the free education
    Your lecture can help me understanding the fundamental of deep learning. Thanks a lot. from Japan.
    Very good 🙂 I'm Computer Engineering student in Italy and love this 🙂 Great job 😉

  3. Thankyou Sir for Video, it's really help people outside Education Institution to get a good quality of Education

  4. Is there any way of learning about deep learning more practically / hands on rather than having to sit and watch someone talk? This is the reason why I didn't go to uni, because its really boring to watch someone talk. Is there any alternative for higher learning other than watching someone talk because I would class myself as smart enough to understand this topic but I can't learn the way higher education is taught.

  5. 25:00 binary cross entropy loss

    1) Sould not there be a minus sign ?

    2) And how do we deal with the infinity ? When we predict totaly wrong { when abs(actual – prediction) is 1 } we get infinity

  6. Wow! that moment when I was opening the description being absolutely sure that I won't find links to the other lectures. That made my day, thank you!

  7. @Alexander Amini, 17:38 the graph seems wrong. The straight line should be below x2 = 0 and pass through point (1,0). I might be wrong, so, don't take this seriously.

  8. Free education . You deserve an award Sir ! You have no idea how terrific your actions are. Much appreciated.
    Big Thanks From Pakistan

  9. How can we poll AI agents to find out what are the best proposals for making prices honest (accounting for econ. externalities) and sharing natural wealth?

    'Best' might mean most efficient, most fair, or consistent with democratic principles.

    Biological model for politics and economics::

  10. Thank you very much for sharing this very important topic with us! I've been looking around for a very long time to understand the basics! Now I found what I wanted! From DR Congo (Electrical engineering student )

  11. I walk through the first class, and now I can tell people I've learned deep learning when I hunt for a job. Lol.

  12. This is not deep learning. This is mediocre training for stooge. Do you want to be narrow technician mechanic OR do you want to be mastermind engineer? If you want deep learning study (mathematical) logic, philosophy, semiotics, theology, geometry. These will get you started.

  13. Free education….u sir deserve an award! You have no idea how terrific your actions are. Much appreciated.

  14. I would like to know the study path of this topic. Referring to the computer science study plan on MIT website, there is no such subject called intro to deep learning.

  15. If you want to learn or comment on this topic, I invite you to read this article:

  16. Watching and pretending I get all of the equations, haha. But really, what are the weights (theta?) associated with the input data?

  17. so the value between "0 and 1" is like the noise function for noise maps? cause that also returns a value between 0 and 1 unlike perlin noise

Leave a Reply

Your email address will not be published. Required fields are marked *