Large Scale Machine Learning


[ MUSIC ] [ APPLAUSE ] BENGIO: Thank you. All right. Thank you for being here and
participating in this colloquium. So, I’ll tell you about some of the things
that are happening in deep learning, but I only have 30 minutes so I’ll be kind
of quickly going through some subjects and some challenges for scaling
up deep learning towards AI. Hopefully you’ll have chances to ask me some
questions during the panel that follows. One thing I want to mention
is I’m writing a book. It’s called Deep Learning, and you can
already download most of the chapters. These are draft versions of
the chapters from my web page. It’s going to be an MIT Press
book hopefully next year. So, what is deep learning and why
is everybody excited about it? First of all, deep learning is just
an approach to machine learning. And what’s particular about it, as Terry
was saying, it’s inspired by brains. Inspired, we’re trying to understand
some of the principles, computational and mathematical principles that could
explain the kind of intelligence based on learning that we see in brains. But from a computer science perspective, the idea is that these algorithms
learn representations. So, representations is a central concept
in deep learning, and, of course, the idea of learning representations is not new. It was part of the deal of
the original neural nets, like the Boltzmann machine and
the back prop from the ’80s. But what’s new here and what happened about
ten years ago is a breakthrough that allowed us to train deeper neural networks, meaning
that have multiple levels of representation. And why is that interesting? So already I mentioned that there
are some theoretical results showing that you can represent some complicated
functions that are the result of the many levels of compositions efficiently with these
deep networks, whereas you might — or in general, you won’t be able to
represent these kinds of functions with a shallow network that
doesn’t have enough levels. What does it mean to have more depth? It means that you’re able to
represent more abstracts concepts, and these more abstract concepts allow
these machines to generalize better. So, that’s the essence of what’s going on here. All right. So, the breakthrough happened in
2006 where, for the first time, we were able to train these deeper networks
and we used unsupervised learning for that, but it took a few years before
these advances made their way to industry and to large scale applications. So, it started around 2010
with speech recognition. By 2012, if you had an Android
phone, like this one, well, you had neural nets doing
speech recognition in them. And now, of course, it’s everywhere. For speech, it’s changed the
field of speech recognition. Everything uses it, essentially. Then about two years later, 2012, there was
another breakthrough using convolution networks, which are a particular kind of deep
networks that had been around for a long time but that have been improved using some of the techniques we discovered
along these — in recent years. Really allowed us to make big impact
in the field of computer vision and object recognition, in particular. So, I’m sure [Faye Faye] will say a few words
later about that event and then the role of the image net dataset in this. But what’s going on now is that neural nets
are going beyond their traditional realm of perception and people are exploring how
to use them for understanding language. Of course, we haven’t yet solved that problem. This is where a lot of the
action is now and, of course, continues a lot of research
and R&D and computer vision. Now, for example, expanding
to video and many other areas. But I’m particularly interested in the
extension of this field in natural language. There are other areas. You’ve heard about reinforcement learning. There is a lot of action
there, robotics, control. So, many areas of AI are now more and
more seeing the potential gain coming from using these more abstract systems. So, today, I’m going to go through
three of the main challenges that I see for bringing deep learning, as
we know it today, closer to AI. One of them is computational. Of course, for a company
like IBM and other companies that build machines, this
is an important challenge. It’s an important challenge
because what we’ve observed is that the bigger the models we are able to train, given the amount of data we
currently have, the better they are. So, you know, we just keep
building bigger models and hopefully we’re going to continue improving. Now, that being said, I think it’s not going
to be enough so there are other challenges. One of them I mentioned has to
do with understanding language. But understanding language
actually requires something more. It requires a form of reasoning. So, people are starting to use these recurrent
nets you heard about, recurrent networks that can be very deep, in some sense,
when you consider time in order to combine different pieces of evidence,
in order to provide answers to questions. And essentially, displayed in
different forms of reasoning. So, I’ll say a few words about that challenge. And finally, maybe one of the most important
challenges that’s maybe more fundamental even is the unsupervised learning challenge. Up to now, all of the industrial applications of
deep learning have exploited supervised learning where we have labeled the data we’ve
said in that image, it’s a cat. In that image, there’s a desk, and so on. But there’s a lot more data we could
take advantage of that’s unlabeled, and that’s going to be important because all of
na information we need to build these AIs has to come from somewhere, and we need enough
data, and most of it is not going to be labeled. Right. So, as I mentioned,
and I guess as my colleague, Ilya Sutskever from Google
keeps saying, bigger is better. At least up to now, we haven’t
seen the limitations. I do believe that there are obstacles,
and bigger is not going to be enough. But clearly, there’s an easy path
forward with the current algorithms just by making our neural nets a
hundred times faster and bigger. So, why is that? Basically, what I see in many experiments
with neural nets right now is that they — I’m going to use some jargon here. They under fit, meaning that they’re not big
enough or we don’t train them long enough for them to exploit all of the
information that there is in the data. And so they’re not even able to
learn the data by heart, right, which is the thing we usually
want to avoid in machine learning. But that comes almost for free with these
networks, and so we just have to press on the pedal of more capacity and we’re
almost sure to get an improvement here. All right. To just illustrate graphically that we have
some room to approach the size of human brains, this picture was made up by my former student,
Ian Goodfellow, where we see the sizes of different organisms and neural nets over
the years so the DBN here was from 2006. Of the AlexNet is the breakthrough
network of 2012 for computer vision, and the AdamNet is maybe a couple of years old. So, we see that the current technology is
maybe between a bee and a frog in terms of size of the networks for about
the same number of synapses. So, we’ve almost reached the kind of average
number of synapses you see in natural brains, between a thousand and ten thousand. In terms of number of neurons, we’re
several orders of ranking away. So, I’m going to tell you a little bit about a
stream of research we’ve been pushing in my lab, which is more connected to the
computing challenge and potentially part of our implementation, which is can we train
neural nets that have very low precision. So, we had a first paper at ICLR. By the way, ICLR is the deep learning
conference, and it happens every year now. Yann Lecun and I started it in 2013
and it’s been an amazing success that year and every year since then. We’re going to have a third version next May. And so we wanted to know how many
bits do you actually require. Of course, people have been asking
these kinds of questions for decades. But using sort of the current state of
the art neural nets and we found 12, and I can show you some pictures how we
got these numbers on different data sets and comparing different ways of representing
numbers with fixed point or dynamic fixed point. And also, depending on where I use
those bits, you actually need less bits in the activations than in the weights. So, you need more rescission in the weights. So, that was the first investigation. But then we thought — so that’s the — for the weights, that’s the number of bits
you actually need to keep the information that you are accumulating from many examples. But when you actually run your
system during training, especially, maybe you don’t need all those bits. Maybe you can get the same
effect by introducing noise and discretizing randomly those
weights to plus one or minus one. So, that’s exactly what we did. The idea is — the cute idea here is that we
can replace a real number by a binary number that has the same expected value by, you know,
sampling those two values with a probability such as that the expected
value is the correct one. And now, instead of having
a real number to multiply, we have a bit to multiply, which is easy. It’s just an addition. And why would we do that? Because we want to get rid of multiplications. Multiplications is what takes up most of the
surface area on chips for doing neural nets. So, we had a first try at this, and this
is going to be presented at the next NIPS in the next few weeks in Montreal. And it allows us to get rid of the
multiplications in the feed forward computation and in the backward computation
where we compute gradients. But we remained with the multiplication
— even if you discretize the weights, there is another multiplication
at the end of the back prop where you multiply — you
don’t multiply weights. You multiply activations and gradients. So, if those two things are real valued,
you still need regular multiplication. So, we — yes, so that’s
going to be in the NIPS paper. But the new thing we did is to get rid of that
last multiplication that we need for the update of the weight, so the delta
W is a change in the weights, DC DA is the gradient that’s propagated
back, and H is the activations. It’s some jargon. But anyway, we have to do this multiplication,
and so, well, the only thing we need to do is take one of these two numbers and
replace it again by a stochastic quantity that is not going to require multiplication. So, instead of binarizing it, we quantize
it stochastically to its mantissa. In other words, we get rid
of — to its exponent. We get rid of the mantissa. In other words, we represent it, we
— we represent it in a log scale. So, if you do that, again,
you can map the activations to some values that are just powers of two. And now multiplication is just addition. This is an old trick. I mean, the trick of using
powers of two is an old trick. The new trick is to do this stochastically
so that you actually get the right things in average and stochastic
gradient works perfectly fine. And so we’re running some experiments on
a few data sets showing that you get a bit of a slowdown because of the extra noise. But so the green and yellow curve here are
where this strict with binarized weights and quantized, stochastically
quantize the calculations. And the good news is, well, it
learns even better, actually, because this noise acts as a regularizer. Now, this — yes, this is pretty good news. Now, why is this interesting? It’s interesting because we can
probably — for two reasons. One is for hardware implementations,
this could be useful. The other reasons is that it connects
with what the brain — with spikes, right. So the idea with — you can think of, if I
go back here, when you replace activations by some stoke tick binary values that have the
right expected value, you’re introducing noise. But you’re actually not changing that
much the computation of the gradient. And so it would be reasonable
for brains to use the same trick if they could save on the hardware side. Okay. So now let me move on to my second
challenge, which has to do with language and, in particular, language understanding. There’s a lot of work to do in this direction, but the progress in the last
few years is pretty impressive. Actually, I was part of the beginning
of that process of extending the realm of application of neural networks to language. So, in 2000, we had a NIPS paper where
we introduced the idea of learning to represent probability
distributions over sequences of words. In other words, being able to generate
sequences of words that look like English by decomposing the problem in two parts. That’s a kind of a central element that you find
in neural nets and especially in deep learning, which is think of the problem not as
going directly from inputs to outputs, but breaking the problem into two parts. One is the representation part. So, learning to represent words here by mapping
each word to a fixed size, real valued vector. And then taking those representations and
mapping them to the answers you care about. And here, that’s predicting the next word. It turned out that those
representations of words that we learned have incredibly nice
properties and they capture a lot of the semantic aspects of words. And there’s been tons and tons of papers to analyze these things, to
use them in applications. So, these are called word vectors, word
embeddings, and they’re used all over the place and becoming like commonplace
in natural language processing. In the last couple of years, there’s
been a kind of an exciting observation about these word embeddings, which
is that they capture analogies, even though they were not programmed for that. So, what do I mean? What I mean is that if you take the vector which
is for each word and you do operations on them, like subtract and add them, you can
get interesting things coming up. So, for example, if you take the vector for
queen and you subtract the vector for king, you get a new vector, and that vector is pretty
much aligned with the vector that you get from subtracting the representation for
woman from the representation for man. So, that means that you could do something like
woman minus man, plus king and get queen, right. So, it can answer the question, you know, what is to king what woman is
to man, and it would find queen. So, that’s interesting, and there is some
nice explanations that we’re starting to understand why this is happening. Basically, directions in that space of
representations correspond to attributes that have been discovered by the machine. So, here, the difference between man and woman,
they have all the same attributes somehow, in some semantic space, except for gender. The same is true for queen and king. They have lots of different attributes, but they
essentially have all the same except for gender. So, when you subtract them, the only thing you
get in your hand is the direction for gender. Okay. So the progress with representing the
meaning of words has been really amazing. But, of course, this is by no means
sufficient to understand language. So, the next stage has been, well, can we
represent the meaning of sentences or phrases. And in my group, we worked on machine
translation as a case study to see if we could bring up that power
of representation that we’ve seen in those language models to a task that was a bit more challenging
from a semantic point of view. And I guess the thing we’re doing now,
and many other groups are also doing, is pushing that to an even harder
semantic task, which is question answering. In other words, read a sentence or read a
paragraph or a document and then read a question and then generate a natural language in answer. So, it’s a bit more challenging, but you can
see that it’s a kind of translation as well. You have a sequence in input and
you produce a sequence in output. In fact, we used very similar techniques. So, now let me tell you about
that machine translation approach that we created about a year and a half ago. And it uses these recurrent
networks that you’ve heard about, because as soon as you start dealing with
sequences, it’s kind of the natural thing to do. It uses something fairly new that has
been incredibly successful in the field in the last year, which is the idea of introducing attention
mechanisms within the computation. So, sometimes we think of attention as, like,
visual attention, so deciding where to look. But here we’re talking about
a different kind of attention. It’s a kind of internal attention. So, choosing which parts of your neural network
are you going to be paying attention to. And here, let me go through
this architecture a little bit. What’s going on is — do I have a pointer? All right. You have an input sentence in English, say,
and there’s a recurrent net that reads it, meaning that it sees one word at a time. As it goes through it, it builds a
representation of the words that it has seen. Actually, there are two recurrent
nets, one reading from left to right and the other from right to left. Then at each position, you have a representation
of what’s going on around that word. So, that’s the reading network. Then there is a writing — an output network,
which is going to produce a sequence of words. More precisely, it’s going to produce a
probability distribution for each word in the vocabulary at each stage and then we’re
going to pick, according to the distribution, we’re going to pick the next word. The choice of that word is going to
condition the computation for the next stage. The state of the network
is going to be different, depending on what words you’ve said before. And that whole output sequence is going to be
influenced by what we have read, of course, because we want to translate the input sequence. Now, the way that that input sequence and
that output sequence are related is important. That’s where the attention mechanism comes in. Because when you’re doing
translation, for example, the input sequence has a different
length from the output sequence. So, which word or which part of the
sequence here corresponds to which part in the output sequence, that’s the question that the attention mechanism
is helping us figure out. And we found a way to do that
doing a mechanism that allows to us train using normal
techniques with back prop. We can compute exact gradients to this process. And the idea that is for each
position in the output sequence, our network looks in the input sequence at
all possible positions and computes a weight. And it’s going to multiply the representation
it’s getting at each position by that weight to form a linear combination which
is going to be a context that’s going to drive the update at the next stage. So, in a sense, you’re choosing
where to look at each stage to decide what the next word is going to be. So, this has actually worked incredibly well. And in the space of one year, we went from dismal performance to
state of the art performance. And at the last WMT, 2015, we got the
first place on two of the language pairs, English to German and English to Czech. And now there’s like a bunch
of groups around the world that are pushing these kinds of systems. So, this is kind of a new way of doing
machine translation, which is very, very different in nature from the state of
the art that’s been around for 20 years. So, the next thing we did is use the same,
almost the same code for translating not from English to French but from — or from
French to English, but from image to English. So, the idea is, it’s almost the same
architecture, except that instead of having a recurrent network
that reads the French sentence, we have what’s called a convolutional net
that we’ve heard about that looks at the image and computes for each location or for each block of pixels a feature vector,
a gain or representation. Similarly that we had representations for words, now we have representations
for parts of the image. And then the attention mechanism, as
it generates the words in the sentence that it’s producing, at each stage
chooses where to look in the image. So, Terry showed you some pictures from my lab. You’ve seen this. And what we see with each
pair of images is on the left, the image that the system sees an input. On the right, we see where it’s putting
its attention for a particular word. That’s the word that’s underlined. So, when it says little girl, when it says girl, we see that it’s putting attention
around the face of the girl. The other one, on top, for example, a
woman is throwing a frisbee in the park. So, the underlined word is frisbee, and
we show the second image in the pair where it’s putting its attention in the image. So, these are cases where it works quite well. But it wouldn’t be fair if I
only showed you those cases. I need to show you those where it fails. So, here are examples where it fails. That’s where we learn the most. First of all, you realize immediately
that we haven’t solved the eye, and that it’s making mistakes both on
the visual side and on the language side. So, on the visual side, you see things like
on the top left, it thinks that it’s a bird. It’s two giraffes. Maybe if you squint you can think it’s a bird. On the second one, it thinks that the round
shape on the shirt is a clock, which, you know, again, if you squint, you
might think it’s a clock. Now, the third one is totally crazy. A man wearing a hat and a hat on a skateboard. So, it’s wrong visually. It’s wrong, you know, linguistically. You wouldn’t do a hat on a hat, and so on. So, it’s fun and instructive to
use these attention mechanisms to understand what’s going
on inside the machine. To see, you know, at each step of the
computation, what was it paying attention to. So, it’s pretty interesting. Now, it turns out that this attention mechanism
is at the part of another revolution that going on right now in deep learning that
has to do with the notion of memory that Terry also mentioned during the panel. And neural nets up to recently
have been considered as purely sort of pattern recognition devices
that go from input to output. As soon as you start thinking about dealing
with reasoning and sequential processing, comes the idea that it would be nice to have
a short-term memory or even a long-term memory that is different from the straight sort of
kind of representation building computation that we have in those feed forward neural nets. So, the idea is that in addition
to the recurrent net that does the usual computation,
we have a memory. So, here, each of the cells,
think of it as a memory cell. A memory needs simple concepts like where
are you going to be reading and writing and what are you going to
be reading and writing. So, we can generalize these concepts to neural
nets that you can join by back prop by saying that at each time stamp, you basically have a
different probability of choosing where to read and where to write and then you’re
going to put something there with some weight that’s proportional
for that probability. So, these kinds of systems, they started
less than a year ago at about the same time from a group in Facebook and a group at DeepMind
using the same kind of attention mechanism that we had proposed just a few months earlier. And so they’re able to do things like this, like
read sentences like this and answer questions. So, Joe went to the garden
and Fred picked up the milk. Joe moved to the bathroom and Fred dropped the
milk and then Dan moved to the living room. Where is Dan? You’re not supposed to read the answer. Or other things like — I have other examples
down there, like Sam walks into the kitchen. Sam picks up an apple. Sam walks to the bedroom. Sam drops the apple. Where is the apple. So, these are the kinds of
things we’re able to do now. Of course, these are toy problems. But it’s not something we would
imagine just a few years ago that neural nets would be able to do. So, by using recurrence and by using new
architectures that allow these recurrent nets to keep information for a longer time, so dealing with this challenge
that’s called long-term dependencies, we’re able to push the scope of applications of deep learning well beyond what was
thought possible just a few years ago. So, in my lab, we’re working on using
these ideas for knowledge extraction. So, the idea is to be able to read
pages in Wikipedia and fill that memory with representations, semantic
representations for nuggets of fact can be then used to answer questions. Of course, if we can do that,
that would be extremely useful. Yes. I’m going to skip that and just use a
little bit of time for the last challenge, which is maybe the most difficult
one and has to do with how computers could form these abstractions
without being told ahead of time a lot of the details of what they
should be in the first place. So, that’s what unsupervised learning is about. And I mentioned that unsupervised learning is
important because we can take advantage of all of the knowledge implicitly
stored in lots and lots of data hasn’t been tagged
and labeled by humans. But there are also reasons
why it could be interesting for other applications in machine learning. For example, in the case of structured outputs
where you want the machine to produce something that is not a yes or a no,
or it’s not a category, but it’s something more complicated,
like an image. Maybe you want to transform an image or you want
to produce a sentence like you’ve seen before. It’s also interesting because
if you start thinking about how machines could eventually reach
the kind of level of performance of humans, we have to admit that in terms of learning
ability, we’re very, very far from humans. Humans are able to learn from
very few examples, new tasks. Right now, if you take a machine learning
system out of the box, it’s going to take — it’s going to need, depending on the task,
maybe tens of thousands or hundreds of thousands or millions of examples before
you get a decent performance. Humans can learn a new task with just a handful for sometimes even a single
example or even zero examples. You don’t even give them an example. You give them the linguistic
description of the task, right. So, we’re thinking, you know, what are
plausible ways that we could address this, and it all has to do with the notion
of representation that’s been central to what I’ve been telling you about. And now, we’re thinking about how
those representations become meaningful as explanations for the data. In other words, what are the explanatory factors
that explain the variations we see in the data. And that’s what unsupervised learning is after. It’s trying to discover representations where
each element of the representation you can think of as a factor or a cause that could
explain the things we’re seeing. So, in 2011, we participated in a couple of
scientific challenges on transfer learning, where the idea is you’re seeing
examples from some tasks. Maybe they’re labeled. But the end goal is to actually use
the representation that you’ve learned to do a good job on new tasks for which
you have very few labeled examples. And basically, what we found is that when
you use these unsupervised learning methods, you’re able to generalize much faster
with very few labeled examples. So, all these curves have on the X axis
the log of the number of labeled examples. And on the Y axis, accuracy. As you build deeper systems that learn actually
in an unsupervised way from all the other tasks, but just looking at the input
distribution, you’re able on the new tasks to extract information from the very
few examples you have much faster. Faster meaning you need less
examples to get high accuracy. That’s what these curves tell us. Now, there are really big challenges to why
is it that unsupervised learning hasn’t been as successful as supervised learning. At least as we look at the current
industrial applications of deep learning. I think it’s because there are really hard
fundamental challenges because you’re trying to model something that’s
much higher dimensional. When you’re doing supervised learning,
usually the output is a small object. It’s in one category or something like that. In unsupervised learning, you’re trying
to characterize a number of configurations of these variables that’s exponentially large. And for a number of mathematical reasons, that
makes the sort of more natural approaches based on probabilities automatically
intractable for reasons that I won’t have time to explain in detail. But there has been a lot
of research recently to try to bypass these limitations,
these intractabilities. And what’s amazing about the research
currently in unsupervised learning is there’s like ten different ways of
doing unsupervised learning. There’s not one way. It’s not like the supervised learning where we
have basically back prop with small variations. Here we have totally different learning
principles that go and try to bypass in different ways the problems with [maximum
light hue] and probabilistic modeling. So, it’s moving pretty fast. Just a few years ago, we were not able to
generate, for example, images of anything but digits, images of digits, black and white. So, just last year we were able to
move to sort of more realistic digits. These are images of street view
house numbers that were generated by some of these recent algorithms. And these are more natural
images that were generated by paper presented just a few months ago
where the scientists who did this at Facebook and NYU asked humans whether
the images were natural or not. So, is this coming from the machine
or is this coming from real world? And it turned out that 40
percent of the images generated by the computer were fooling the humans. So, you’re kind of almost
passing the train test here. Now, these are, you know,
particular class of images. But still, that’s, you know, there’s a lot
of progress and so it’s very encouraging. One thing I’m interested in, as a last bit here, as we’re exploring all these different
approaches to unsupervised learning, some of these look like they might also
explain how brains do it and the thing that is a very interesting source
of inspiration for this research. All right. So, why is it interesting
to do unsupervised learning? As I mentioned, because it goes at the
heart of what deep learning is about, which is to allow the computer to discover good
representations, more abstract representations. So, what it does mean to be more abstract? It means that we essentially go
to the heart of the explanations of what’s going on behind the data. Of course, that’s the dream, right? And we can measure that. We can do experiments where we can see
that the computer automatically discovers through in its [healing] units features
that we haven’t programmed explicitly in but that are perfectly capturing some of the
factors that are present as we know them. So, yes, I’m going to close there and
show you pictures of the current state of my lab, which is growing too fast. Thank you. [ APPLAUSE ]

2 thoughts on “Large Scale Machine Learning”

Leave a Reply

Your email address will not be published. Required fields are marked *