Teaching convolutional neural networks to give me friends

Hey guys, so this is—no joke—a list of all my real life friends, but the problem is just that- they’re kind of ugly. [ Illuminati Music ] Since I have such ugly friends and since computers are so darn powerful why don’t I just have my computer generate prettier friends? That means the goal is to have my computer automatically generate a wide range of human face images without any human work required. One option is to load up a videogame like The Sims or Nintendo Miis and randomize the settings of their avatar creators. But that’s not “ma-chine learn-y” enough and I know– I just know! that machine learning is what you guys want. So let’s take a look at convolutional neural networks some of you viewers will already know what this is but I want my videos to be as beginner-friendly as possible so I’ll assume you know nothing. Say you have an image meaning a two-dimensional array of pixels that are all either black or white. You want to find out where all the donut shapes are. How would you go about doing that? Well, let’s make a donut filter it will specify the requirements for something to be a donut. Then we’ll center our filter around the upper-left pixel and ask: “Is every condition of the filter satisfied?” No?! Then it’s NOT a donut. We can move the filter over each pixel, asking the same question, most pixels like this one will say, “no donut” since not every condition is satisfied… …but a select few will say, “yes donut.” After that’s all said and done, we now have markers at the center of where all the donuts are. Voila! Our goal is complete! However, most images aren’t as simple as black or white. For most images the pixels brightness exists on a spectrum from 0 to 1 so it could be 0.5 or 0.1 (ignore color for now) so when we’re searching for donuts we can’t use a filter that’s so simplistic it only asks for yes or no questions. Rather than asking: ”is this a donut, or is this not a donut?” Our new improved filter, should instead ask the question: ”How donutty is this pixel?“ On a continuous scale from negative infinity to positive infinity with higher values meaning this is more like a doughnut and lower values meaning this is less like a donut. How can we engineer a filter that does this? Well, let’s imagine the filter is a set of multipliers—like this—some multipliers are higher than others some are positive and some are negative but let’s see what they do. We can center the filter around a single pixel and then we can multiply those underlying image pixel values by those multipliers, add up all those products, and we get an overall score of how “donutty” that pixel is. You can think of the positive multipliers as if they’re saying: ”If you want to be considered donutty, you’d better have a high value for this pixel.” And the negative multipliers like they’re saying: ”Ooh donutty pixels don’t typically have high values here.” In the end we can apply this continuous “donutty” filter to every pixel of the image. So this pixel with a score of 3.64 is the most “donutty.” Which makes sense because it’s a dark pixel surrounded by quite a few light pixels. A few other contenders get pretty close. Now this pixel has the worst donut score, which kind of makes sense because it looks like an inverted donut. By the way—if you’re curious—there are quite a few methods to handle the literal edge cases. You can cut them off, fill the exterior with zeros, extend the boarders to infinity, or just loop the image. For our example, we’ll just fill the exterior with zeros because it’s the easiest to understand. And also, each application of a filter is called a convolution which gives the convolutional neural network its name. But hold on! The donutty score of each pixel is a scaler. Meaning: a number on a one-dimensional number line. And guess what!? The original brightness of each pixel was also a scalar. What does this mean? It means that applying this continuous donutty filter converts data of one type into data of the same type. In other words: it converts a grayscale image into another grayscale image. So if we wanted to, we could apply this filter to the image once, and then again, and again, and again. Forever. To be honest that’s actually not very interesting. What is interesting, is if you apply a different filter in the second layer, and a different filter in the third layer and so on– and also: If you apply multiple filters to each image, creating this giant web of filters, each looking for different things. Since each filter can be different, you don’t have to be searching for just donuts. You can have one filter that’s good for finding vertical lines, and maybe another is good at finding horizontal lines. At the second layer you can combine the two, to create a filter that finds cross shapes. Think of it this way: Perhaps the first layer can find edges, then the second layer takes those edges as input. That means the second layer can find edges of edges, meaning corners. The third layer can find edges of edges, of edges. Here, interpretation gets a little fuzzy, because we humans don’t really know how a computer effectively uses its filters. But I’d guess that edges of edges, of edges, could be used to detect arrangements of corners; in other words, simple shapes, like equilateral triangles. Perhaps, further layers could see arrangements of triangles, and further layers than that, can soon detect whole objects. From pencils, to apples, to chihuahuas, to humans. With more layers, and more convolutions per layer, you can find more and more advanced features in your original image. Got three or four filters that can find ridges of darkness at just the right angles? Boom. You’ve got a nose detector! Use a few other filters to find pairs of dark ellipses that are twice as far apart as their width, and there’s your eye detector. Add in the rest of the body parts somewhere else, and then combine them in a final convolution that makes sure they’re all in the right place. And you’ve just got a web of convolutions that tell you—exactly—where there are human faces in the image. hmm… Doesn’t that look familiar…? Okay. It doesn’t tell you exactly where the human faces are, since neural networks behave randomly and unpredictably, they’ll never achieve 100% accuracy, but they can get into the high 90s pretty easily now. hmm… I brushed over this topic. But usually, interspersed throughout the webs of convolutions, you have points where you just downscale the image by a factor of two, and this is called pooling. If you downscale enough you can slowly convert your image, of thousands of pixels, into an image of just one pixel. Which can either be light or dark, or anywhere in between. Essentially, this can be used as a marker to look at a whole image, not just one location, and answer, “is there a human in this picture…? …or was this image taken indoors or outdoors…?” If the final pixel’s brightness is one, that means yes; but it’s at zero, that means no; and anything in between means maybe. I also bet you’re asking how to deal with colored images. Simple. Almost all photos have three color channels: Red, Green, and Blue. So you can just interpret that as three different grayscale images overlaid on top of each other. That means you can just set up your convolutional neural network to have three images in the earliest layer, instead of one. Pretty simple actually. Each color of RGB is called a color channel and convolutions and further layers are also called channels more advanced CNNs can have like 40 or 60 or even a 100 channels in a single layer because that’s how many features they’re simultaneously trying to search for. So yes, this is a convolutional neural network. It takes in an image of 𝘯-channels as an input and outputs a scalar or an 𝘯-dimensional vector if you’re looking for multiple things, or just whatever you want it to output. That’s great at all but even if you were to program in this whole structure perfectly, you still wouldn’t have a working convolutional neural network because you’d have no idea what to set the filters to. I mean, the filters determine what the network is even searching for, so they’re pretty darn important. Maybe you could set them up manually using your own common sense to figure out what elements each filter should specifically be designated for? That would be the hardest math puzzle of all time– please don’t do that. Instead we want to use a ton of training data with labels of how we would want our network to respond to this data, and gradient descent, and calculus, and math, BUT UH-OH!!! This video is already getting long, so I guess it’ll have to wait for part two. Besides, you guys are getting impatient and probably just want to see what my new prettier friends look like. Okay, I can introduce you to them. At the beginning all these filters I mentioned earlier are set with random values, so you’ll see nonsensical images, but then it’ll train to get better. The training data is 15,000 images of celebrities from FamousBirthdays[dot]com. I’ll explain why I chose the source in part two. The machine learning program I’m using is called “Hyper-Găn” by 255-bits. Which is Martin and Michael, and I’ll also explain why I chose this in part two. Also, the timer at the top, shows how long my computer has been training for, in the hours minutes seconds format. { HH:MM:SS } Anyway, enough talking– Let’s go! Yep! Yep! These are my new friends all right! So much prettier than my old, real life friends. I am so excited to hang out with this beautiful, new crowd. We can watch movies, go bowling, rip out my brain cells and replace them with neural networks, go shopping, eat dinner. It’ll just be a blast! Let me answer some questions while an irrelevant time lapse plays. “What was that music during the training time lapse?” It’s “Skyline” by “JujuMas” who you should really go subscribe to. “What happens when you train it for more than 7 hours?” Not much. I actually trained it for a day, and the results didn’t get significantly better. Which brings me to… “Shouldn’t you remove the non-photographs from the training data?” Yeah I should, but it takes too much work to sift through 15,000 images, and if the non-photographs are a small enough proportion, they shouldn’t affect the end result much, anyway. “What was actually your procedure for setting this up?” Again, I’ll talk about the details in part two. Before I end this video, I want to point out that many other—actually smart—researchers have gotten much better results than I have. For example, the HyperGAN GitHub page itself shows much larger more realistic looking generated faces– that just– I mean look. Can you even tell these aren’t real? And then, I keep seeing even better, and better results, as time goes on, on the R–slash–machine-learning subreddit. That might lead you to ask, “Cary, why would you spend so long showing your own—mediocre work—when other people have literally done exactly the same thing as you, but 10 times better?” And that’s a valid question. I’d like to think all my projects in the past were unique in some way, but this one really isn’t. But one, I want to make it more visible to more people, because I feel like a not that many people read the academic papers, but a lot of people are on YouTube. And two, this whole journey has really just been to prove to myself that the code that’s used to generate these images can indeed work successfully on just my computer alone. No more relying on what other people post the results could be. I want to see my computer reaching those results myself. Anyway, I got a juicy NVIDIA GTX 1080 GPU for this, so I want to make sure I can use it to it’s full potential. But don’t worry, more original stuff is coming in the future. Like this! What’s this image?! I’m so confused! This is unlike anything I’ve ever seen before~ hm- I better subscribe to “carykh” to find out what all those interesting lines are! I can’t believe I stooped that low… Okay, end of the video, but I want to promise to all the people who’ve been requesting. I am going to make a ton of tutorial videos from here on out. For example, showing you how to program a neural net completely from scratch, assuming you know nothing; or how to replicate the results I got in my Baroque music video… It’s all coming, just be patient, and good bye.

100 thoughts on “Teaching convolutional neural networks to give me friends”

  1. “And since computers are so powerful.” Computer is struggling to carry a 0.042 kilogram weight.

  2. You should make a robot to animate a whole video for you, so you don't have to keep animating. Also, wheres part 2?

  3. I really like you video, and additionally you give me more inspiration about what can I do to learn ai..please keep update, so that people like me who is new in this area can obtain more information and joy, thanks…

  4. When you showed the Mii’s at 0:32, the one in the middle with dark blue hat and top looked scarily similar to the Mii I use.

  5. I don't think it's possible to code a neural network in Scratch… Coding in Scratch is way easier than coding a website. (Without Wix or GoDaddy)

Leave a Reply

Your email address will not be published. Required fields are marked *