MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task


Alright, welcome back everyone. Sound okay? Alright. So today we will- We talked a little bit about neural networks, started to talk about neural networks yesterday. Today we’ll continue to talk about neural networks that work with images, convolutional neural networks, and see how those types of networks can help us drive a car. If we have time we’ll cover a simple illustrative case study of detecting traffic lights. The problem of detecting green, yellow, red. If we can’t teach our neural networks to do that, we’re in trouble, but it’s a good, clear, illustrative case study of a three-class classification problem. Okay, next there’s DeepTesla here looped over and over in a very short GIF. This is actually running live in a website right now. We’ll show it towards the end of the lecture, this once again just like DeepTraffic is a neural network that learns to steer a vehicle based on the video of the forward road way. And once again, doing all of that in the browser using javascript. So you’ll be able to train your own very network to drive using real world data. I’ll explain how. We will also have a tutorial and code. Briefly described today at the end of the lecture, if there’s time how to do the same thing in TensorFlow. So if you want to build a network that’s bigger, deeper and you want to utilize GPUs to train that network, you want to not do it in your browser, you want to do it offline using TensorFlow and having a powerful GPU on your computer and we’ll explain how to do that. Computer vision. So we talked about vanilla machine learning where there’s no- Where the size, yesterday, where the size of the input is small for the most part. The number of neurons, in the case the neural networks, is on the order of 10, 100, 1,000. When you think of images, images are a collection of pixels, one of the most iconic images from computer vision on the bottom left there is Lenna. I encourage you to Google it and figure out the story behind that image. It’s quite shocking when I found out recently. So once again, computer vision is, these days, dominated by data driven approaches by machine learning where all of the same methods that are used on other types of data are used on images where the input is just a collection of pixels and pixels are numbers from 0 to 255 discrete values. So we can think exactly what we’ve talked about previously, you could think of images in the same exact way. It’s just numbers and so we can do the same kind of thing. We could do supervised learning where you have an input image and output label. The input image here is a picture of a woman; the label might be “woman”. On supervised learning, same thing. We’ll look at that briefly as well as clustering images into categories. Again semi-supervised and reinforcement learning. In fact, the Atari games that talked about yesterday. do some pre-processing on the images. They’re doing computer vision; they’re using convolutional neural networks as we’ll discuss today and the pipeline for supervised learning is again the same: there’s raw data in the form of images, there’s labels on those images. We perform a machine learning algorithm, performs feature extraction, it trains given the inputs and outputs on the images and the labels of those images, constructs the model and then test that model. And we get a metric and accuracy. Accuracy is the term that’s used to often describe how well the model performs. The percentage. I apologise for the constant presence of cats throughout this course. I assure you this course is about driving, not cats. but images are numbers. So for us we take it for granted. We’re really good at looking and converting visual perception as human beings, converting visual perception, into semantics. We see this image and we know it’s a cat but a computer only sees numbers: RGB values for a color image. There’s three values for every single pixel from 0 to 255. And so given that image, we can think of two problems: one is regression and the other is classification. Regression is when given an image we want to produce a real value of output put back. So if we have an image of the four roadway, we want to produce a value for the steering wheel angle and if you have an algorithm that’s really smart, It can take any image of the forward roadway and produce the perfectly correct steering angle that drives the car safely across the United States. We’ll talk about how to do that and where that fails. Classification is when the input again is an image and the output is a class label, a discrete class label. Underneath it though often is still a regression problem and once produced is a probability that this particular image belongs to a particular category. And we use a threshold to chop off the outputs associated with low probabilities and take the labels associated with high probabilities and convert it into a discrete classification. I mentioned this yesterday but it bears saying again, computer vision is hard. We, once again, take it for granted. As human beings, we’re really good at dealing with all these problems. There’s viewpoint variation: the object looks wholly different in terms of the numbers behind the images in terms of the pixels when viewed from a different angle. Viewpoint variation: objects when you’re standing far away from them or up close are totally different size. We’re good at detecting that there are different size. It’s still the same object as human beings but that’s still a really hard problem because those sizes can vary drastically. We talked about occlusions and deformations with cats; well understood problem. There’s background clutter. You have to separate the object of interest from the background and given the three dimensional structure of our world. There’s a lot of stuff often going on in the background: the clutter, their inter-class variation. That’s often greater than inter-class variation; meaning objects of the same type often have more variation than the objects that you’re trying to separate them from. There is the hard one for driving: illumination. Light is the way we perceive things; the reflection of light off the surface and the source of that light changes the way that object appears and we have to be robust to all of that. So the image classification pipeline is the same as I mentioned. There are categories, It’s the classification problems for those categories of cat, dog, mug, hat. You have a bunch of examples, image examples of each of those categories and so the input is just those images paired with the category. And you train to map, to estimate a function that maps from the images to the categories. For all of that you need data; a lot of it. There is, unfortunately, a growing number of data sets but there are still relatively small. We get excited. There are millions of images but they’re not billions or trillions of images and these are, the data sets that you will see if you read academic literature most often. Mnist, the one that’s been beaten to death. And then we use as well in this course the data set of handwritten digits where the categories are 0 to 9. ImageNet, one of the largest image data sets; fully labeled image data sets in the world has images with a hierarchy of categories from Word Net. And what you see there is a labeling of what image is associated with which words are present in the data set. CIFAR-10 and CIFAR-100 are tiny images that are used to prove in a very efficient and quick way offhand that your algorithm that you’re trying to publish on, or trying to impress the world with, works well. It’s small, it’s a small data set: CIFAR-10 means there’s 10 categories. And places is a data set of natural scenes: woods, nature, city, and so on. So let’s look at CIFAR-10 as a data set of 10 categories: airplane, automobile, bird, cat, and so on. They’re shown there with sample images as the rose. And so let’s build a classifier that’s able to take images from one of these 10 categories and tell us what is shown in the image. So how do we do that? Once again, all the algorithm sees is numbers. So we have to try to have at the very core, we have to have an operator for comparing two images. So given an image and I want to save it as a cat or dog. I want to compare it to images of cats and compare it to images of dogs and see which one matches better. So there has to be a comparative operator. Okay so one way to do that is take the absolute difference between the two images pixel by pixel, take the difference between each individual pixel shown on the bottom of the slide for a 4×4 image. And then we sum that pixel-wise pixel-wise absolute difference into a single number. So if the image is totally different pixel-wise, that will be a high number. If it’s the same image, the number will be 0. Oh, it’s the absolute value too of the difference. And that’s called L1 distance. It doesn’t matter. When we speak of distance, we usually mean L2 distance. And so, if we try to- So we can build the classifier that just uses this operator to compare it to every single image in the data set and say I’m going to pick the, I’m going to pick the category that’s the closest using this comparative operator. I’m going to find- I have a picture of a cat and I’m going to look through the dataset and find the image that’s the closest to this picture and say that is the category that this picture belongs to. So if we just flip the coin and randomly pick which category an image belongs to get that accuracy, would be on average 10%. It’s random. The accuracy with which our brilliant image difference algorithm that just goes through the data set and finds the closest one is 38% which is pretty good, it’s way above 10%. So you can think about this operation of look into the base and finding the closest image as what’s called K-Nearest Neighbors or K in that case. Meaning you find the one closest neighbor to this image that you’re asking questions about and accept the label from that image. You could do the same thing increasing K. Increasing K to 2 means you take the two nearest neighbors. You find the two closest in terms of pixel-wise image difference through this particular query image and find which categories did those belong to. What’s shown up top on the left is the data set we’re working with: red, green, blue. What’s shown in the middle is the one nearest neighbor classifier, meaning this is how you segment the entire space of different things that you can compare. And if a point falls into any of these regions, it will be immediately associated with the nearest neighbor algorithm to belong to that image, to that region. With the five nearest neighbors, there’s immediately an issue. The issue is that there is white regions. There’s tie breakers where your five closest neighbors are from various categories. So it’s unclear where you belong to. So this is a good example of parameter tuning. You have one parameter: K. And your task as a teacher of machine learning, you have to teach this algorithm how to do your learning for you, is to figure out that parameter. That’s called “parameter tuning” or “hyper-parameter tuning” as it’s called in neural networks. And so on the bottom right of the slide on the x-axis is K. As we increase it from 0 to 100 and the y-axis is classification accuracy. It turns out that the best K for this data set is 7, 7 years neighbors. With that we get a performance of 30% human level performance and I should say that the way we get that number as we do with a lot of the machine learning pipeline process is you separate the data into the parts of days that you use for training and another part they use for testing. You’re not allowed to touch the testing part. That’s cheating. You construct your model of the world on the training data set and you use what’s called cross validation where you take a small part of the training data shown “fold five” there in yellow to leave that part out from the training and then use it as part of the hyper-parameter tuning. As you train, figure out with that yellow part fold five how well you’re doing and then you choose a different fold and see how well you’re doing And keep playing with parameters never touching the test part. And when you’re ready, you run the algorithm on a test data to see how well you really do. How will it really generalizes. Yes, question. (INAUDIBLE QUESTION) So, the question was: “is there a good way to- Is any good intuition behind what a good K is?” There are general rules for different data sets but usually you just have to run through it. Grid search, brute force. Yes, question. (INAUDIBLE QUESTION) (CHUCKLING) Good question. Yes. (INAUDIBLE QUESTION) Yes, the question was: “is each pixel 1 number or 3 numbers?” For majority of computer vision throughout its history used grayscale images so it’s 1 number but RGB is 3 numbers and there’s sometimes a depth value too, so it’s 4 numbers. So it’s- If you have a stereo vision camera that gives you the depth information of the pixels, that’s a fourth and then if you stack two images together there could be 6. In general, everything we work with will be 3 numbers for a pixel. Yes, so the question: “as to the absolute value is just one number?” Exactly right. So in that case, those are grayscale images. So it’s not RGB images. So, you know, this algorithm is pretty good if we use the best. We optimize the hyper-parameters of this algorithm, choose K of 7, seems to work well for this particular CIFAR-10 data set. Okay, we get 30% accuracy. It’s impressive, higher than 10%. Human beings perform at about 94, slightly above 94% accuracy for CIFAR-10. So given an image and it’s a tiny image. I should clarify it, it’s like a little icon. Given that image, human beings are able to determine accurately one of the 10 categories with 94% accuracy. And the currently state-of-the-art convolutional neural networks is ninety five, it’s 95.4% accuracy and, believe it or not, it’s a heated battle but the most important, the critical fact here, is it’s recently surpassed humans. And certainly surpass the k-nearest neighbors algorithm. So,how does this work? Let’s briefly look back. It all still boils down to this little guy: the neuron, that sums the weights of its inputs, adds a bias, produces an output based on an activation, a smooth activation function. Yes, question. (INAUDIBLE QUESTION) The question was: “do you take a picture of Cassie, you know it’s a cat, but that’s not encoded anywhere, like you have to write that down somewhere. So you have to write as a caption: “This is my cat.” And then the unfortunate thing, given the internet and how woody it is, you can’t trust the captions on images. because maybe you’re just being clever and it’s not a cat all, it’s a dog dressed as a cat. Yes, question. (INAUDIBLE QUESTION) Sorry. Seen as do better than what? Yes, so the question was: “do convolutional neural networks generally do better than nearest neighbors? There’s very few problems on which neural networks don’t do better, yes ,they almost always do better except when you have almost no data. So you need data. And convolutional neural networks isn’t some special magical thing. It’s just neural networks with some cheating up front that I’ll explain, some tricks to try to reduce the size and make it capable to deal with images. So again. Yes, the input is, in this case that we looked at classifying an image of a number, as opposed to doing some fancy convolutional tricks. We just take the the entire 28×28 pixel image that’s 784 pixels as the input. That’s 784 neurons in the input, 15 neurons on the hidden layer and 10 neurons in the output. Now everything we’ll talk about has the same exact structure. Nothing fancy. There is a forward pass through the network where you take an input image and produce an output classification and there’s a backward pass through the network for Back Propagation where you adjust the weights when your prediction doesn’t match the Ground Truth output. And learning just boils down to optimization; it’s just optimizing a smooth function. Differentiable function; that’s defined as the lost function. That’s usually as simple as a squared error between the true output and the one you actually got. So what’s the difference? What are convolutional neural networks? Convolutional neural networks take inputs that have some spatial consistency, have some meaning to the spatial- Has some spatial meaning in them like images. There’s other things, you can think of the dimension of time. And you can input audio signal into a convolutional neural network. And so the input is, usually for every single layer, that’s a convolutional layer, the input is a 3D volume and the output is a 3D volume. I’m simplifying because you can call it 4D too but it’s 3D. There’s height, width and depth. So that’s an image. The height and the width is the width and the height of the image. And then the depth for grayscale image is 1; for an RGB image is 3; for a ten-frame video of greyscale images the depth is 10. It’s just a volume, a three-dimensional matrix of numbers. And everything- The only thing that a convolutional layer does is take a 3D volume’s input, produce a 3D volume as output and has some smooth function. Operating on the inputs, on the sum of the inputs, that may or may not be a parameter that you tune, that you try to optimize. That’s it. So Lego pieces that you stack together in the same way as we talked about before. So what are the types of layers that a convolutional neural networks have? There’s inputs. So for example a color image of 32×32 will be a volume of 32x32x3. A convolutional layer takes advantage of the spatial relationships of the input neurons and a convolutional layer, it’s the same exact neuron as for fully connected network, the regular we talked about before. But it has a narrower receptive field, it’s more focused, the inputs to a neuron on the convolutional layer come from a specific region from the previous layer. And the parameters on each filter, you can think of this as a filter, because you slide it across the entire image. And those parameters are shared. So supposed you’ve taken the- If you think about two layers, as opposed to connecting every single pixel in the first layer to every single neuron in the following layer. You only connect the neurons in the input layer that are close to each other, to the output layer, and then you enforce the weights to be tied together spatially. And what that results in is a filter every single layer on the output, you can think of as a filter, they get excited for example for an edge and when it sees this particular kind of edge in the image, it will get excited. And it’ll get excited in the top left of the image, on the top right, bottom left, bottom right. The assumption there is that a powerful feature for detecting a cat is just as important no matter where in the image it is. And this allows you to cut away a huge number of connections between neurons but it still boils down on the right, as a neuron that sums a collection of inputs and applies weights to them. The spatial arrangement of the output volume relative to the input volume is controlled by three things. The number of filters. So for every single “filter” you get an extra layer on the output. So if the input, let’s talk about the very first layer, the input is 32x32x3. It’s in RGB image of 32×32. If the number of filters is 10, then the resulting depth the resulting number of stacked channels in the output will be 10. Stride is given. is the step size of the filter that you slide along the image. Often times as just 1 or 3 and that directly reduces the size, the spatial size the width and the height, of the output image. and then there is a convenient thing that it’s often done is padding. The image on the outside zeros. So that the input and the output have the same height and width. So this is a visualization of convolution. I encourage you to, kind of maybe offline, think about what’s happening. It’s similar to the way human vision works, crudely so, if there’s any experts in the audience. So the input here on the left is a collection of numbers: 0, 1, 2. And a filter or there are two filters shown as W1- W0 and W1. Those filters shown in red, are the different weights applied in those filters. And each of the filters have a certain depth; just like the input a depth of 3. So there are three of them in each column and so, so you slide death filter along the image keeping the weights the same. this is the sharing of the weights and so your first filter you pick the weights, this is an optimization problem. you pick the weights in such a way that it fires, it gets excited, for useful features and doesn’t fire for not useful features. And then there’s a second filter that fires for useful features and not. And produces a signal on the output depending on a positive number, meaning there’s a strong feature in that region, and negative number if there isn’t but the filter is the same. This allows for a drastic reduction in the parameters and so you can deal with inputs. There are a thousand by thousand pixel image, for example, or video. There’s a really powerful concept there. The spatial sharing of weights. That means there’s a spatial invariance to the features you’re detecting. It allows you to learn from arbitrary images so you don’t have to be concerned about pre-processing the images in some clever way, you just give the raw image. There is another operation: pooling. It’s a way to reduce the size of the layers by, for example in this case, it’s max pooling for taking a collection of outputs and choose x1 and summarizing those collection of pixels such that the output of the pooling operation is much smaller than the input. Because the justification there is that you don’t need a high resolution. Localization of which pixel is important in the image or according to, you know, you don’t need to know exactly which pixel is associated with the cat ear or a cat face. As long as you, kind of, know it’s around that part and that reduces a lot of complexity in the operations. Yes, question. The question was: “when is too much pooling, when do you stop pooling?” So pooling is a very crude operation that doesn’t have any, one thing you need to know, is it doesn’t have any parameters that are learnable. So you can’t learn anything clever about pooling. You’re just picking, in this case max pool, so you’re picking the largest number. So you’re reducing the resolution, you’re losing a lot of information. There’s an argument that you’re not, you know, losing that much information as long as you’re not pooling the entire image into a single value but you’re gaining training efficiency, you’re gaining the memory size, reducing the size of the network. So, it’s definitely a thing that people debate and it’s a parameter that you play with to see what works for you. Okay, so how does this thing look like as a whole, a convolutional neural network, the input is an image there’s usually a convolutional layer, there is a pooling operation, another convolutional layer, another pooling operation and so on. At the very end, if the task is classification you have a stack of convolutional layers and pooling layers. There are several fully connected layers. So, you go from those spatial convolutional operations to fully connecting every single neuron in a layer to the following layer. And you do this so that by the end, you have a collection of neurons each one is associated with a particular class. So in what we looked at yesterday is the input, is an image of a number 0 through 9. The output here would be 10 neurons. So you blow down that image with a collection of convolutional layers, with 1 or 2 or 3 fully connected layers at the end that all lead to 10 neurons and each of those neuron’s job is to get fired up when it sees a particular number and for the other ones to produce a low probability. And so this kind of process is how you have the 95 percentile accuracy on the CIFAR-10 problem. This here is ImageNet data set that I mentioned. It’s how you take this image of a leopard, of a container ship, and produce a probability that that is a container ship or a leopard. Also shown there are the outputs of the other nearest neurons in terms of their confidence. Now you can use the same exact operation by chopping off the fully connected layer at the end and as opposed to mapping from image to a prediction of what’s contained in the image, you map from the image to another image. And you can train that image to be one that gets excited spatially, meaning it gives you a high, close to one value, for areas of the image that contain the object of interest and then a low number for areas of the image that are unlikely to contain that image. And so from this you can go on the left, an original image of a woman on a horse, to a segmented image of knowing where the woman is and where the horse is and where the background is. The same process can be done for detecting the object. So you can segment the scene into a bunch of interesting objects, candidates for interesting objects and then go through those candidates one by one and perform the same kind of classification as in the previous step where it’s just an input as an image and the output as a classification. And through this process of hopping around an image, you can figure out exactly where is the best way to segment the cow out of the image. That’s called object detection. Okay, so how can these magical convolutional neural networks help us in driving? This is a video of the forward road way from a data set that we’ll look at, that we’ve collected from a Tesla. But first let me look at driving. Briefly, the general driving task from the human perspective. On average an American driver in the United States drives 10,000 miles a year. A little more for rural, a little less for urban. There is about 30,000 fatal crashes and>32,000 sometimes as high as 38,000 fatalities a year. This includes car occupants, pedestrians, bicyclists and motorcycle riders. This may be a surprising fact but in a class on self-driving cars we should remember that. So ignore the 59.9%, that’s other. The most popular cars in the United States are pickup trucks: Ford F-1 Series, Chevy Silverado, Ram. It’s an important point that we’re still married to our, to wanting to be in control and so one of the interesting cars that we look at and the car that is the days that we provide to the class is collected from is a Tesla. It’s the one that comes at the intersection of the Ford F-150 and the cute, little Google self-driving car on the right. It’s fast, it allows you to have a feeling of control but it can also drive itself for hundreds of miles on the highway, if need be. It allows you to press a button and the car takes over. It’s a fascinating trade-off, of transferring control from the human to the car. It’s a transfer of trust and it’s a chance for us to study the psychology of human beings as they relate to machines at>60 miles an hour. In case you’re not aware a little summary of human beings, where distracted things: would like to text, use the smartphone, watch videos, groom, talk to passengers, eat, drink, texting. 169 billion texts were sent in the US every single month in 2014. On average, 5 seconds our eyes spent off the road while texting – 5 seconds. That’s the opportunity for automation to step in. More than that, there’s what NHTSA refers to as the 4 D’s: drunk, drugged, distracted and drowsy. Each one of those opportunity is for automation to step in. Drunk driving stands to benefit significantly from automation, perhaps. So the miles, let’s look at the miles. The data. There’s 3 trillion (3 million million) 3 million million miles driven every year and TESLA autopilot, our case study for this class, and as human beings is driven on full auto-pilot mode. So it’s driving by itself 300 million miles as of December 2016 and the fatalities for human control vehicles is 1:90,000,000. It’s about>30,000 fatalities a year and currently under TESLA auto-pilot there’s one fatality. There’s a lot of ways you could tear that statistic apart but it’s one to think about. Already, perhaps automation results in safer driving. The thing is, we don’t understand automation, because we don’t have the data: we don’t have the data on the forward roadway video, we don’t have the data on the driver and we just don’t have that many cars on the road today that drive themselves. So we need a lot of data. We’ll provide some of it to you in the class and as part of our research at MIT were collecting huge amounts of it, of cars driving themselves, and collecting that data is how we get to understanding. So talking about the data and what we’ll be doing training our algorithms on, here is a Tesla Model S, Model X we’ve instrumented 17 of them, have collected over 5,000 hours and 70,000 miles. And I’ll talk about the cameras that we’ve put in them. We’re collecting video of the forward road way. This is a highlight of a trip from Boston to Florida of one of the people driving a Tesla. What’s also shown in blue is the amount of time that autopilot was engaged: currently 0 minutes and then it grows and grows. For prolonged periods of time, so hundreds of miles, people engage autopilot. Out of 1.3 billion miles driven a Tesla, 300,000,000 are on autopilot. You do the math whatever that is, 25%. So we are collecting data of the forward roadway, of the driver. We have 2 cameras on the driver. What we’re providing with the class is epics of time of the forward roadway, for privacy considerations. Cameras used to record are your regular Webcam, the work horse of the computer vision community. The C920, and we have some special lenses on top of it. Now what’s special about these webcams? Nothing that costs $70 can be that good, right? What’s special about them is that they do onboard compression and allow you to collect huge amounts of data and use reasonably sized storage capacity to store that data and train your algorithms on. So what on the self-driving side do we have to work with? How do we build a self-driving car? There is these sensors: radar, lidar, vision, audio – all looking outside helping you detect the objects in the external environment to localize yourself and so on. And there’s the sensors facing inside: visible light camera, audio again, and infrared camera to help detect peoples. So we can decompose the self-driving car task into 4 steps: localization, answering where am I; scene understanding, using the texture of the information of the scene around, to interpret the identity of the different objects in the scene and the semantic meaning of those objects, of their movement. There’s movement planning – once you figured all that out, found all the pedestrians, found all the other cars, how do I navigate through this maze, a clutter of objects in a safe and legal way. And there’s driver state, how do I detect using video or other information. The video of the driver detect information about their emotional state or their distraction level. Yes, question. (INAUDIBLE QUESTION) Yes, that’s the real-time figure from lidar. Lidars are sensors that provides you the 3D point cloud of the external scene. So lidar is the technology used by most folks working with self-driving cars to give you a strong Ground Truth of the objects. It’s probably the best sensor we have for getting 3D information, the least noisy 3D information about the external environment. Question. So autopilot is always changing. One of the most amazing things about this vehicle is that the updates to autopilot come in the form of software. So the amount of time it’s available to changes has become more conservative with time. But in this, this one of the earlier versions, and it shows, the second line in yellow, shows how often autopilot was available but not turned on. So the total driving time was 10 hours, autopilot was available 7 hours and was engaged an hour. This particular person is a responsible driver because what you see or is a more cautious driver. What you see is it’s raining, autopilot is still available but- (INAUDIBLE QUESTION) the comment was that you shouldn’t trust that one fatality number as an indication of safety because the drivers elect to only engage the system when it’s safe to do so. It’s a totally open, there’s a lot bigger arguments for that number than just that one, the question is whether that’s a bad thing so maybe we can trust human beings to engage, you know, despite the poorly filmed YouTube videos, despite the hype in the media, you’re still a human being. riding 60 miles an hour in a metal box with your life on the line. You won’t engage the system unless you know it’s completely safe unless you’ve built up a relationship with it. It’s not all the stuff you see where a person gets in the back of a Tesla and start sleeping or is playing chess, or whatever. That’s all for YouTube, the reality is when it’s just you in the car it’s still your life on the line and so you’re going to do the responsible thing unless perhaps you’re a teenager and so on but that never changes no matter what you’re in. Question. (INAUDIBLE QUESTION) The question was: “what do you need to see or sense about the external environment to be able to successfully drive? Do you need lane markings? Do you need other- what are the landmarks based on which you do the localization and navigation?” And that depends on the sensors. So with the Google self-driving car in sunny California, it depends on lidar in a high-resolution way, map the environment in order to be able to localize itself based on lidar. And lidar, now I don’t know the details of exactly where lidar fails, but it’s not good with rain, it’s not good with snow, it’s not good when the environment is changing. So what snow does is it changes the visual, the appearance, the reflective texture of the surfaces around. Us human beings are still able to figure stuff out but a car that’s relying heavily on lidar won’t be able to localize itself using the landmarks it previously has detected because they look different now with the snow. Computer vision can help us with lanes or following a car. The two landmarks that we used in a lane is following the car in front of you or staying between two lanes. That’s the nice thing about our roadways it’s they’re designed for human eyes. So you can use computer vision for lanes and for cars in front to follow them. And there is radar. That’s a crude but a reliable source of distance information that allows you to not collide with metal objects. So all that together depending on what you want to rely on more gives you a lot of information. The question is when its messy complexity of real life occurs, how reliable it would be in the urban environment and so on. So localization- How can deep learning help? So first, just a quick summary of visual odometry. It’s using a monocular or stereo input of video images to determine your orientation in the world. The orientation, in this case, of a vehicle in the frame of the world and all you have to work with is a video of the forward roadway and with stereo you get a little extra information of how far away different objects are. And so this is where one of our speakers on Friday will talk about his expertise (SLAM) Simultaneous Localization and Mapping. This is a very well-studied and understood problem of detecting unique features in the external scene and localizing yourself based on the trajectory of those unique features. When the number of features is high enough it becomes an optimization problem. You know this particular lane moved a little bit from frame to frame you can track that information. And fuse everything together in order to be able to estimate your trajectory through the three dimensional space. You also have other sensors to help you out. You have GPS which is pretty accurate, not perfect but pretty accurate. It’s another signal to help you localize yourself. You also have IMU. Accelerometer tells you your acceleration, from the gyroscope, the accelerometer, you have the six degree of freedom of movement information about how the moving object, the car, is navigating through space. So you can do that using the old school way of optimization. Given a unique set of features, like sift features, and that step involves with stereo input understorting and and rectifying the images. You have two images, from the two images compute the depth map but for every single pixel computing the best estimate of the depth of that pixel, the three dimensional position, relative to the camera then you compute, that’s where you compute the disparity map, that’s what that’s called, from which you get the distance then you detect unique, interesting features in the scene. Sift is a popular one. It’s a popular algorithm for detecting unique features and you, over time, track those features. And that tracking is what allows you through the vision alone to get information about your trajectory through three-dimensional space. You estimate that trajectory. There’s a lot of assumptions, assumptions that bodies are rigid. So you have to figure out if a large object passes right in front of you, you have to figure out what that was. You have to figure out mobile objects in the scene. And those are the stationary. Or you can cheat or we’ll talk about and do it using neural networks end-to-end. Now what does end-to-end mean? And this will come up a bunch of times throughout this class and today. End-to-end means, and I refer to it as cheating because it takes away a lot of the hard work of panageneric features. You take the raw input of whatever sensors. In this case, it’s taking stereo input from a stereo vision cameras so two images, a sequence of two images coming from a stereo vision camera, and the output is a estimate of your trajectory through space. So it’s supposed to be doing the hard work of SLAM, of detecting unique features, of localizing yourself, of tracking those features and figuring out where your trajectory is. You simply train the network. With some Ground Truth, you have form a more accurate sensor like lidar, and you train it on a set of inputs, the stereo vision inputs, and outputs is the trajectory through space. You have a separate convolutional neural networks for the velocity and for the orientation. And this works pretty well. Unfortunately, not quite well and John Leonard will talk about that. SLAM is one of the places were deep learning is not being able to outperform the previous approaches. Where deep learning really helps is the scene understanding part. It’s interpreting the objects in the scene. It’s detecting the various parts of the scene, segmenting them and with optical flow determining their movement. So previous approaches for detecting objects like the traffic signal, the classification of detection that we have the TensorFlow tutorial for or to use car-like features or other types of features that are hard-engineered from the images. Now we can use convolutional neural networks to replace the extraction of those features. And there’s TensorFlow implementation of SegNet which is taking the exact same neural network that I talked about. It’s the same thing, the beauty is you just apply similar types of networks to different problems and depending on the complexity of the problem, can get quite amazing performance. In this case, we convolutionize network, meaning the output is an image, input is an image, a single monocular image. The output is a segmented image where the colors indicate your best pixel-by-pixel estimate of what object is in that part. This is not using any spatial information, it’s not using any temporal information. So it’s processing every single frame separately and it’s able to separate the road from the trees, from the pedestrians, other cars, and so on. This is intended to lie on top of a radar / lidar type of technology that’s giving you the three dimensional or stereo vision three-dimensional information about the scene. You’re, sort of, painting that scene with the identity of the objects that are in it, your best estimate of it. This is something I’ll talk about tomorrow is recurring neural networks and we can use recurring neural networks that work with temporal data to process video and also process audio. In this case, we can process what’s shown on the bottom is a spectrogram of audio for a wet road and a dry road. You can look at that spectrogram as an image and process it in a temporal way using recurring neural networks. Just slide it across and keep feeding it to a network. And it does incredibly well on the simple tasks, certainly of dry road versus wet road. This is important, a subtle, but very important task and there’s many like it to know the road, the texture, the quality., the characteristics of the road, wetness being a critical one. When it’s not raining but the road is still wet, that information is very important. Okay, so for movement planning. The same kind of approach. On the right is work from one of our other speakers Sertec Karaman. The same approach we’re using to solve traffic through friendly competition is the same that we can use for what Chris Gerdes does with his race cars for planning trajectories in high speed movement along complex curve. So we can solve that problem using optimization, solve the control problem using optimization, or we can use it with reinforcement learning by running tens of millions, hundreds of millions of times through that simulation of taking that curve and learning which trajectory doesn’t both optimizes the speed at which you take the turn and the safety of the vehicle. Exactly the same thing that you’re using for traffic. And for driver state, this is what will talk about next week. It’s all the fun face stuff: eyes, face, emotion. This is with video of the driver, video of the driver’s body, video the driver’s face. On the left is one of the TAs in his younger days. Still looks the same. There he is. So in that particular case, you’re doing one of the easier problems which is one of detecting where the head and the eyes are positioned. The head and eye pose. You know it determine what’s called he gaze of the driver, where the driver’s looking, glance. And so, we’ll talk about these problems. From the left to the right: on the left in green are the easier problems; on the red are the harder from the computer vision aspect. So on the left is body pose, head pose. The larger the object the easier it is the detect and the orientation of it is easier to detect. And then there is pupil diameter. Detecting the pupil, the characteristics, the position, the size of the pupil. And there’s micro saccade, things that happen at one millisecond frequency, the tremors of the eye. All important information to determine the state of the driver. Some are possible computer vision, some are not. This is something that we’ll talk about, I think, on Thursday. Is the detection of where the driver’s looking. So, this is a bunch of the cameras that we have in the Tesla. This is This is Dan driving a Tesla and detecting exactly where of one of six regions We’ve converted into a classification problem of left, right, rear view mirror instrument cluster center stack or forward roadway. So we have to determine out of those six categories which direction is the driver looking at. This is important for driving. We don’t care exactly the X, Y, Z position of where the driver is looking at. We care that they’re looking at the road or not. Are they looking at their cell phone in their lap or are they looking at the forward roadway? And we’ll be able to answer that pretty effectively using convolutional neural networks. You can also look at emotion using CNNs to extract, again converting emotion, the complex world of emotion, into a binary problem of frustrated versus satisfied. This is the video of drivers interacting with a voice navigation system. If you’ve ever used one, you know that may be a source of frustration from folks. And so this is self reported, this is one of the hard, you know, driver emotion if you’re in what’s called “Effective Computing.” It’s the field of studying emotion from the computational side. If you work in that field, you know that the annotation side of emotion is really challenging one. So getting the Ground Truth of, well okay since this guy’s smiling so can I label that as happy or he’s frowning because that mean he’s sad. Most effective computing folks do just that. In this case we self report ask people how frustrated they’re were in a scale of 1 to 10. Dan up top reported a “1” for not frustrated, he’s satisfied with the interaction, and the other driver reported as a “9” he was very frustrated with the interaction. And what you notice is there is a very cold, stoic look on Dan’s face which is an indication of happiness. And in the case of frustration, the driver is smiling. So this is a sort of a good reminder that we can’t trust our own human instincts. It’s an engineering feature. Engineering the ground truth. We have to trust the data, trust the Ground Truth that we believe is the closest reflection of the actual semantics of what’s going on in the scene. Okay, so end-to-end driving. Getting to the the project and the tutorial. So if driving is like a conversation and, thank you for someone to clarifying, that this is the Arch of Triumph in Paris in this video. If driving is like a natural language conversation, then we can think of end-to-end driving as skipping the entire Turing Test components and treating it as an end-to-end natural language generation. So what we do is we take as input the external sensors and output, the control of the vehicle. And the magic happens in the middle. We replace that entire step with a neural network. TAs told me to not include this image because it’s the cheesiest we’ve ever seen. I apologize. Thank you, thank you. I regret nothing. So this is to show our path to self-driving cars but it still explain a point that we have a large data set of Ground Truth. If we were to formulate the driving task to simply taking external images and producing steering commands, acceleration of braking commands, then we have a lot of Ground Truth. We have a large number of drivers on the road every day driving and, therefore, collecting our Ground Truth for us because they’re an interested party in producing the steering commands that keep them alive and, therefore, if we were to record that data it becomes Ground Truth. So if it’s possible to learn this, what we can do is we can collect data for the manually controlled vehicles and use that data to train an algorithm to control a self-driving vehicle. Okay, so one of the first folks who did this is Nvidia where they actually train in an external image, the image of the forward roadway. and a neural network, a convolutional network, a simple vanilla convolutional neural network I’ll briefly outline: take an image in, produce a steering command out and they’re able to successfully, to some degree, learn to navigate basic turns, curves and even stop or make sharp turns at a keener section. So this this now work is simple. There is input on the bottom, output up top. The input is a 66×200 pixel image, RGB. Shown on the left is the raw input and then you crop it a little bit and resize it down 66×200. That’s what we have in the code as well in the two versions of the code we’ll provide to you. Both that runs in the browser and in TensorFlow. It has a few layers. A few convolutional layers, a few fully connected layers. And an output. This is a regression network. It’s producing not a classification of cat versus dog, it’s producing a steering command. How do I turn the steering wheel? That’s it. The rest is magic and we train it on a human input. What we have here is a project, is an implementation of the system in ConvNetJS that runs in your browser. This is the tutorial to follow and the project to take on. So unlike the DeepTraffic game, this is reality. This is a real input from real vehicles. So you can go to this link. Demo went wonderfully yesterday so let’s see, maybe two for two. There’s the tutorial and then the actual game, the actual simulation is on DeepTesla.JS, I apologize. Everyone is going there now, aren’t they? Does it work on a phone? It does, great. Again similar structure up top is the visualization of the lost function as the network is learning and it’s always training. Next is the input for the layout of the network, there’s the specification of the input 200×66. There’s a convolutional layer. There’s a pooling layer and the output is a regression layer. A single neuron. This is a tiny version, DeepTiny, right? It’s a tiny version of the Nvidia architecture and then you can visualize the operation of this network on real video. The actual wheel value that produced by the driver, by the autopilot system, is in blue and the output of the network is in white. And what’s indicated by green is the cropping of the image that is then resized to produce the 66×200 input to the network. So once again, amazingly, this is running in your browser, training on real world video. So you can get in your car today input it and maybe teach a neural network to drive like you. We have the code in ConvNetJS and TensorFlow to do that and the tutorial. Well, let me briefly describe some of the work here. So the input to the network as a single image. This is for DeepTesla.JS, single image and the output is a steering wheel value between -20 and 20. That’s in degrees. We record, like I said, thousands of hours but we provide publicly 10 video clips of highway driving from a Tesla. Half are driven by autopilot, half are driven by human. The wheel values extracted from a perfectly synchronized CAN, we are collecting all of the messages from CAN, which contains steering wheel value and that’s synchronized to the video. We crop, extract the window. The green one I mentioned. And then provide that as input to the network. So this is a slight difference from DeepTraffic with the red car weaving through traffic because there is the messy reality of real world lighting conditions. And your task for the most part, in this simple steering task, is to stay inside the lane, inside the lane markings. In an end-to-end way, learn to do just that. So ConvNetJS is a javascript implementation of CNNs, of convolutional neural networks. It supports really arbitrary networks. I mean all neural networks are simple but because it runs in javascript it’s not utilizing GPU. The larger the network the more it’s going to be weighed down computationally. Now unlike DeepTraffic, this isn’t a competition but if you are a student registered for the course you still do have to submit the code, you still have to submit your own car as part of the class. Question. So the question was the amount of data that’s needed. Is there a general rules of thumb for the amount of data needed for a particular task in driving for example? It’s a good question. You generally have to, like I said, neural networks are good memorizers so you have to just have every case represented in the training said that you’re interested in. As much as possible, so that means, in general if you want a picture, if you want to classify the difference between cats and dogs, you want to have at least a thousand cats and a thousand dogs and they do really well. The problem with driving is twofold: one, is that most of the time driving looks the same. And the stuff you really care about is when driving looks different. It’s all the edge cases. So we’re not good with neural networks is generalizing from the common case to the edge cases, to the outliers. So avoiding a crash just because you can stand the highway for thousands of hours successfully doesn’t mean you can avoid a crash with somebody runs in front of you on the road and the other part with driving is the accuracy you have to achieve is really high. So for cat versus dog, No, life doesn’t depend on your error. On your ability to steer a car inside of the lane. You better be very close to 100% accurate. There’s a box for designing the network. There’s a visualization of the metrics measuring the performance of the network as it trains. There is a visualization, layer visualization, of what features the network is extracting at every convolutional layer and every fully connected layer. There is ability to restart the training. Visualize the network performing on real video. There is the input layer, the convolutional layers. The video visualization, an interesting tidbit on the bottom right is a barcode that Will has ingeniously designed. How do I clearly explain why this is so cool? It’s a way to through video synchronized multiple streams of data together, so it’s very easy for those who have worked with multi-modal data where there are several streams of data for them to become unsynchronized especially when a big component of training a neural network is shuffling the data. So you have to shuffle the data in clever ways so you’re not overfitting any one little aspect of the video and yet maintain the data perfectly synchronized. So what he did instead doing the hard work of connecting the steering wheel and in the video is actually putting the steering on top of the video as a barcode. The final result is you can watch the network operate and over time it learns more and more to steer correctly. I’ll fly through this a little bit in the interest of time just kind of summarize some of the things that you can play with in terms of tutorials and let you guys go. This is the same kind of process end-to-end driving with So we have code available on GetHub. You just put up on my GetHub and the DeepTesla. That takes in a single video or an arbitrary number of videos trains on them and produces a visualization that compares the steering wheel, the actual steering wheel and the predicted steering wheel. The steering wheel, when it agrees with the human driver or the autopilot system lighting up as green and when it disagrees, lighting up as red. Hopefully not too often. Again, this is some of the details of how that’s exactly done in TensorFlow. This is vanilla convolution neural networks. Specifying a bunch of layers, convolutional layers, a fully connected layer, train the model, so you iterate over the batches of images. Run the model over a test set of images and get this result. We have a tutorial on iPython Notebook into the tutorial up on this. This is perhaps the best way to get started with convolutional neural networks in terms of our class. It’s looking at the simplest image classification problem, of traffic light classification. So we have these images of traffic lights. We did the hard work of detecting them for you. So now you have to figure out, you have to build the convolutional network that gets figures out the concept of color and gets excited when it sees red, yellow or green. If anyone has questions, I’ll welcome those. You can stay after class if you have any concerns with Docker, with TensorFlow, with how to win DeepTraffic. Just stay after class or come by Friday, 5 to 7. See you guys tomorrow.

29 thoughts on “MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task”

  1. The point I can't understand intuitively – is why every "filter" in convolution layer is a single neuron. Why its not a network (of several neurons) in general case?

  2. 1_python_perceptron.ipynb has an xrange() method call. Stackoverflow says this method isn't available in Python 3 which is necessary for tensorflow. Use range() instead.

  3. We live in incredible times that these lectures are available online. This information is crucial for my work and I have no idea how I'd be able to educate myself if I didn't have access to these lectures. I really appreciate your work.

  4. How can a conv. neural network perform better at classification than humens, if humans label the images? or is the labeling done in a different way?

  5. Are there good examples that explain step by step Image Segmentation and Object Detection. Its easy to find examples of image classification.

  6. Thank you for the excellent lectures! They are fantastic! And are the guest talks available on youtube (can't find links on the course page) ?

  7. If I remember correctly, there is also a rich kid got killed in Tesla's autopilot mode while driving in China. So that's 2 fatalities per 3million miles

  8. A short question: when traning a cnn network with back propogation , how the parameters change when passing through the pooling layer?

  9. Horrible lectures with no real content from which one can benefit. Feels like a summary presentation for a set of journalists who don't want to get technical!

  10. This guy seems way more like an engineer than a com sci background. That's a compliment (and a poke at CS majors) 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *