How to Learn from Little Data – Intro to Deep Learning #17

Hello, world. It’s Siraj. And if you only have
a little bit of data, can you still learn from it? We’re going to
build the model that learns to classify
images using a very small dataset for training. Deep learning is still
very much a dark art. It’s an emerging practice in
the world of machine learning that isn’t well understood,
even by those pushing the state of the art, which is exciting. Because it means that there’s
so much potential for discovery. And it’s not just one algorithm. It’s a collection of them. Recurrent nets,
convolutional nets, restrictive Boltzmann
machines, Barbara Streisand– These are all networks
that can recognize patterns in the world. And they themselves
have shared patterns. One shared pattern is that they
all use a hierarchy of layers, the other is using
differentiable layers, so that we can use
gradient-based optimization to improve their prediction. Design patterns are nothing
new in computer science. There are many books
on design patterns for topics like object-oriented
programming and user interfaces. But when it comes
to deep learning, there’s not really a definitive
design pattern guide. We’re all kind of figuring it
out collectively right now. You might glaze over
a deep learning paper and assume that there is some
solid mathematical foundation behind what the
researchers attempted. Because you see all sorts
of equations for things like Hilbert spaces
and measure theory. But the reality is that our
collective understanding is still pretty minimal. Theories are often formulated
because they are mathematically convenient. For example, the
Gaussian distribution is ubiquitous, not
because it’s some divine construct that the universe
has bestowed on us, but because it’s
mathematically convenient. So defining standard
design patterns for pattern recognition networks
is a field ripe for discovery. In the context of
machine learning, we can call it meta-learning,
or learning to learn. Can we design a system that
learns how best to learn? That it learns how
to perform well at an immediate task
in the short term. And in the long term, it
learns a common structure across many tasks. We see meta level constructs
in nature all the time. DNA is a great example. It carries the
instructions, the blueprint to create learning systems
that can expire our brains. But it acts as long-term
memory by transcending death, just like Oracle. As long as there is a
mechanism for memory and one to alter behavior based
on that memory, then that mechanism can serve
as a meta level construct. In the past few
months there have been several papers
on meta learning that have been published. But I want to talk
about one that uses meta learning
as a tool to solve another task, one-shot learning,
the goal of learning from one or only a few data points. This is what we should be
aiming for, since GPU costs are too damn high. They used a modified
version of a model called a Neural Turing
Machine, to learn to classify character images
with just a few examples. DeepMind first drizzed-opped
the idea of NTMs in 2014. It contains two components. The first is a neural network
that we call the controller, and the other is a memory bank. The controller takes
vectors as inputs, and outputs vectors as well,
just like all neural nets. But what makes it
special is that it also interacts with a memory
matrix, using read and write operations. This is where the Turing Machine
analogy comes from, not just because it sounds dope, but
because a Turing machine manipulates symbols
on a strip of tape, according to a table of rules. It’s like having a working
memory for a brain. The network learns how best to
use its memory when learning a solution to a given problem. For the controller, they use
an LSTM recurrent network, since its internal state is a
function of the current state and the input to the system. It can perform
context-dependent computation. So, a signal at a
current time step can influence the network’s
behavior later on. And we need all the components,
including the memory store, to be differentiable so that
we can incrementally update their values during training. To achieve this, they added
an attention mechanism, so that each read and
write operation interacts to a tunable degree with
all the elements in memory, rather than addressing a single
element like a normal Turing machine would. Each row in the memory matrix
represents a memory location. Read and write heads
use a weighting vector with a component
for each location. So if there are 10
memory locations, then the weighting vector with
just one value at index 3, would focus the attention of the
memory operation on location 3. But a weighting
vector, like this, spreads its attention
to the memory across multiple locations. A read operation is just a
combination of the memory matrix and weighting vector. A write operation though, has
two parts, an erase operation then an add operation. The way there read and
write heads are produced is by combining two memory
addressing mechanisms. The first is content based. We focus on locations
based on the similarity between their current
values and the controller’s emitted values. The second is location based. It facilitates iterations
across locations of the memory and random access jumps. Controller and memory
bank, read and writes are so, so dank. So, so dank. So the authors of our one-shot
learning paper knew that NTMs were a subset of memory
augmented neural networks. And they saw the potential
to improve on it, so that they could learn
from just a little data. They discovered that
using a pure content based memory writer, instead
of content plus location, let them do just this. That’s because there’s a
trade-off when training MANNs. The more complex the
memory mechanism, the more training the
controller requires. The dataset has 1,600
separate classes, and only a few
examples per class, perfect for one-shot learning. They randomly
selected five classes, and randomly assigned each
class a label between 1 and 5. So the model gets shown
an instance of a class, tries to classify
it, then it gets informed of what the
correct label is. We’ll only need TensorFlow
and Numpy for our model. We’ll first define
our memory bank, initializing each of the
variables that make it up. Then we can define
our controller, a feed-forward neural network. We’ll define each set of
weights and biases layer by layer, until we’ve
reached the output layer. We can define the
interaction that happens between both components
under the step function, which is called every time
step during training. Just like with a regular
NTM, we read a vector from memory that is a linear
combination of its rows, scaled by a normalized
weight vector. For the given input x, the
read vector will produce a key. We compare each key
against each row in memory, using the cosine
similarity as a measure. This produces the
read weight vector, which tells us how much
each row should contribute to the linear combination. The difference here is that
there is no extra parameter to control the read weight
vectors’ concentration. To write to memory,
the controller interpolates between writing
to the most recently read memory rows and writing to
the least used memory rows. Using the read weight vector
at a previous time step, and the weight vector that
captures the least used memory location, the controller
combines the two, using a scalar parameter
and the sigmoid function to create a write_weight_vector. Each row in memory
is then updated using the write_weight_vector
and the key issued by the controller. The model eventually
returns the probabilities for each class as a vector. After we’ve initialized
our TensorFlow session, we’ll use gradient
ascent via Adam to optimize our network for
every image label pair we feed in via a dictionary. We’ll print out our
results iteratively. After training,
we can test it out on some different
recognizable characters. And notice how the accuracy
is surprisingly good. Normally training time
would take a lot longer for similar results. These results are very
promising for one-shot learning. And that’s all it
takes to train, folks. Let’s get down to brass tacks. A meta learning
system learns how to perform well at
an immediate task, and also learns a common
structure across many tasks. Memory augmented neural networks
like a Neural Turing Machine, use a controller and an
external memory store to perform meta learning. And meta learning can
be a way to achieve one-shot learning,
which means learning from one or a few examples. This week’s coding challenge
is to use a memory augmented network to learn to classify
two classes of animals. Details are in the Read Me. Get Help links go into comments. And winners will be
announced in one week. And although this is the
last video for this course, I’m still just getting started. Please subscribe for
more programming videos, and for now I’ve
got to go celebrate. So, thanks for watching.

97 thoughts on “How to Learn from Little Data – Intro to Deep Learning #17”

  1. hey Siraj , i was asigned to be the representative of programing 101 in my college , any tips

  2. This is what i was waiting for!!! BTW, mathematically speaking, any kind of memory augmented neural networks actually can be used even to solve a problem with a moderate amount of data (not too big, not too small), right? Any thought about the performance comparison between other deep learning algorithms and MANN in task with a moderate amount of data?

  3. ▒█░░░ ▀█▀ ▒█░▄▀ ▒█▀▀▀   ░█▀▀█   ▒█▀▀█ ▒█▀▀▀█ ▒█▀▀▀ ▒█▀▀▀
    ▒█░░░ ▒█░ ▒█▀▄░ ▒█▀▀▀   ▒█▄▄█   ▒█▀▀▄ ▒█░░▒█ ░▀▀▀▄ ░▀▀▀▄
    ▒█▄▄█ ▄█▄ ▒█░▒█ ▒█▄▄▄   ▒█░▒█   ▒█▄▄█ ▒█▄▄▄█ ▒█▄▄▄█▒█▄▄▄█

  4. Siraj, I would like to hear a bit more about the life cycle of a deep learning paper. Can you make a video or point me in the direction where I can learn more?

  5. Thanks a lot for your videos! They are so many! I even don't have enough time to watch them all!

  6. Congrats on the completion of the course. I'm looking forward to taking your Udacity Nanodegree!

  7. Could you make a video about ZERO DATA LEARNING too? A very interesting topic in the realm of meta-learning.

  8. Siraj: great tutorial as always. After this course, what is inline ?
    Please give detail of your upcoming videos.

  9. I sadly doesn't take one shot for me to understand this, But great video, I've been wanting to train models on tiny datasets!

  10. Has anyone tried the Hierarchical Temporal Memory (HTM) algorithms from Numenta, I'm curious about their tech but I don't know if they work better than the more common neural networks solutions.

  11. Yes, the gaussian is mathematically convenient, but it is also special in the sense it maximises Entropy for a given (mean, std-err) pair, among so many special properties…

  12. SIRAJ! Not sure if you remember me commenting a few months back about starting my Data Science degree and a bunch of Math being involved but I need some more guidance.

    We have a project where we're exploring simple regression and multivariable regression models on a dataset of our choosing. I want to try take it a step further. I have a bunch of individual player data for a sports league (seasons 2012 – 2016), and I have a source of the data so I could probably grab the 2017 results with a little wget magic.

    ANYWAY, my goal is to predict a future win or loss when one team is playing another (sounds like a 'simple' classification problem – win or loss) taking as many individual player statistics into account as possible. Is there any easy ways I can just throw the data into a pile, tell the computer a WIN is good and a LOSS is bad (a DRAW is okay); and then get it to spit back out some correlations that it found with a win or a loss when specific teams are up against one another?

    I need a little bit of a starting point – what should I be looking at?

    Would love to shock my class and lecturer with a semi-intelligent model!

  13. Doesn't the central limit theorem kinda indicate that using a Gaussian distribution can often times be a good choice of distribution when trying to model a lot of phenomenon?

  14. Wow!!! Thanks a lot, this code is going to help me much with a web scrapping project I am working on! 🙂

  15. Great video as always! Is there a discord channel for Siraj-related things? If not I would really like to see one created, so that ppl can go there and discuss the challenges.

    Discord is really good at bringing communities together in my experience.

  16. Note that you also need pytest, matplotlib, Pillow and scipy to run the code. And also make sure to add a in both MANN/ and MANN/Utils/ folder. And also make sure you create a file ~/.matplotlib/matplotlibrc there and add the following code (if you're using virtualenv):
    backend: TkAgg

    And even though when you're trying to sample the self.character_folders it raises an error, because the variable is empty.

  17. I have to say: im not a programmer, im a doctor with programming as some short of a hobby. But your videos actually make me rethink about my profession! The way you teach, the quality of your videos! Man! Thank you for sharing your knowledge. Hope for more videos. Success!!

  18. Dude, Siraj, you helped me with one of my kaggle projects that contains little data (like 8-10 examples/label). This is my first time being exposed to concepts like neural turing machines 🙂

    You have earned a subscriber

  19. Very powerful concepts here. It's a lot to package into a 9 minute video, so I've been going over the arXiv paper to get a better grasp of the fundamentals. If you have a moment, can you comment on the difference(s) between one-shot learning and transfer learning?

  20. So who taught you machine learning in the first place? Are you self learnt ?

    P.S : Thanks for all these videos, they are amazing.

  21. Great video, Siraj. I'd like to suggest you to explain Generative ladder networks (REGEL) for problems with small labeled datasets available, but also with big unlabeled datasets.

  22. Love your video! I have no background in computer science or coding, but managed to train my own image classification through your tutorial! Can you do a tutorial on semantic segmentation? something like this is really fun:

  23. @Siraj: please do make vedios about robotics u may have a grate. knowledge as u had been in a start up

  24. I have three questions:
    1. So there is really nothing more than the RNN (controller) that uses its memory database to achieve the best results?
    2. Is this implementable in keras?
    3. Should I be more focused on keras (for efficiency) or tensorflow (for flexibility or accessibility)? Are there things that can be done only in tensorflow?

  25. Can you please share links in the description to the journal articles you highlight in your videos?

  26. Hi Siraj, this tis a great one! Could you do one in information leakage? e.g. processing testing dataset using training set statistics.

  27. How much is "little data"? 20 samples per class? Also, do you have the code in the video on github?

  28. Siraj nice work. I would like to ask for learning from a small data set. I am using Corell5k data set that contains 50 classes each class contains 100 images and its small data for CNN to achieve a high score of accuracy. How can I increase accuracy?

    Thank you very much

  29. I missed watching your videos :') Recently I decided to finally read the paper (which I learned about because of your video back then) after a year long of putting it off… thankfully Siraj comes to the rescue breaking it down in more digestible pieces 🙂 keep up the good work

  30. siraj please let us know if we can use , one shot learning or mann for fraud or loan defaulter prediction ??? . if so please make a comprehensive video ?

  31. Hey Siraj,
    I just want to know how I can learn Machine learning for autonomous cars from scratch and what language should I choose.

    Thanks Your Videos are really intersting.

  32. You should link the arXiv/ Research Papers in the Video in the Description, It'll be a lot easier to get to them…

  33. Great video Siraj… You're like my role model in my Deep Learning journey and ML in general. I've learnt so much within such a short time.

    Keep up the great work and hope you reach far. Watched all your videos by the way.

  34. Could you use this to train a network to make a network that is supposed to make networks using the original network as the data? Basically, Make 'A' be able to make 'B' that can make 'C', by using 'A' as the data for 'A'. Then using all subsequent programs use that for more data to become better? Or is that way too complicated for one shot learning?

  35. Hi Sir Siraj,
    I have downloaded and run the code in github. It seems to take forever. Is it suppose to be like that? Please help….. I am running this in a vn in an i7 laptop.

  36. Misleading title alert! This is about meta learning and one shot learning and it's definitely not an "introduction to deep learning" topic

Leave a Reply

Your email address will not be published. Required fields are marked *