Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 3 – Optimization-Based Meta-Learning


Hi, everyone. Let’s get started. So, uh, today we’ll be covering the things that we didn’t get to last time with regard to, uh, meta-learning and black-box adaptation approaches to meta-learning, and then we’ll cover topics in optimization based approaches. Uh, so before we get started, a couple reminders. So first, Homework 1 is due on Wednesday next week and that homework assignment is now out, uh, and yeah, encourage you to- to get started on that, uh, early, and then also the first paper presentations and discussions of papers will be happening on Wednesday this week, uh, and so please, uh, show up so we can discuss those papers and also, uh, kind of be a part of the discussion for the students that are presenting, uh, that day as well. Okay, um, so as I was mentioning today we’ll first, uh, start by, uh, actually by recapping the probabilistic formulation of meta-learning that I mentioned at the end of lecture last time, and then cover kind of a general recipe for different meta-learning algorithms, uh, and cover black-box adaptation approaches. These, uh, kind of these two things are the topic of Homework 1 where you’ll be implementing, uh, a black box approach to meta-learning. And then we’ll be talking about optimization-based meta-learning, and this will actually be part of Homework 2, uh, and, uh, the rest of Homework 2 will be covered, uh, in the next lecture on Monday next week. Okay, so first, let’s recap from last time. So we were talking about, uh, kind of a- a more intuitive or- or probabilistic view to these meta-learning algorithms, uh, and in particular, we can view meta-learning as a process of learning, a set of meta-parameters Theta, uh, that summarizes your meta-training data such that you can solve new tasks quickly. Uh, and what this meta-training data looked like was, uh, you had a range of tasks, 1 through n, and for each task you had a train dataset and a test set, uh, and the train dataset had k data points and the test set had k data points. Uh, and so in particular, what meta-learning was trying to do was to optimize for a set of meta, uh, for a set of meta-parameters, uh, that, uh, maximized the likelihood of those parameters. So in particular, you could view kind of the meta-train processes optimizing for these meta-parameters and the adaptation process as adapting those parameters, uh, to compute a set of parameters Phi, that, uh, can solve a new task given a train dataset for that task and the meta-parameters that you learned. Uh, and so you could essentially view, uh, kind of this adaptation process as this function f that’s taking in a train dataset, uh, and producing a new set of parameters five-star, uh, and kind of under this, uh, view of, um, of that adaptation process, you can kind of view, uh, meta-learning as optimizing for the meta-train parameter such that the task specific parameters uh, do well on held out data, your test set, where the task specific parameters are computed according to your training dataset for that task. Okay, so this is like essentially the probabilistic view on meta-learning where you can view, uh, kind of the meta-training process as trying to optimize for these, uh, these kind of prior parameters such that adaptation leads to good performance. Okay, so now I’d like to talk about how we actually kind of design algorithms that perform this optimization at- at a, basically at a more mechanistic level and uncover kind of how you actually go about trying to implement some of these things. So, uh, in particular we, like, can we think about a general recipe for meta-learning algorithms? Uh, and before we actually cover a general recipe for the algorithms themselves, uh, we need to have a sense for how we’re actually gonna be, going to be evaluating these meta-learning algorithms. Um, so I want to first talk about how to evaluate, um, a meta-learning algorithm, and, uh, kind of, the first thing worth mentioning here, uh, the first thing, uh, that we should mention is the- the Omniglot dataset. So this is a dataset, uh, that was proposed by Brenden Lake e- et al in 2015, uh, and it actually really, uh, kind of exemplifies some of the- the weak points with neural networks that they’re gonna be learning from small amounts of data. So, uh, this dataset has six hu- six fi- 1,600 characters from 50 different alphabets. Uh, here’s some examples of the, um, of the dataset. So there’s different alphabets uh like Hebrew, Bengali etc, uh, and each character has- has only 20 instances. Uh, so unlike something, uh, like MNIST that has a few number of characters and a huge number of data points per character, uh, in- in many ways this is sort of like the transpose. It has many classes and few examples per class. Uh, and one of the things that I think is- is quite appealing like, uh, to a dataset like this is that, the statistics of this dataset are in many ways more reflective of- of the types of things that we see in the real world. Uh, for example, if you, uh, kind of are trying to learn how to recognize, uh, forks for example, you’re not going to see thousands, uh, thousands of different types of forks. You may see, uh, uh, a wide range of objects, but you’re only gonna see per object, a small number of instances of that object throughout your lifetime. Um, okay, so this dataset has kind of the breadth of classes and- and a small number of examples per class, uh, and they propose a few different ways you could try to use this dataset. So they propose both few-shot discriminative learning as well as few-shot generative learning problems. Uh, and in particular what these look like is, uh, the few-shot discriminative learning is given a few examples of new characters, can you learn to classify between those characters? Uh, and the derivative problem is likewise given a few examples of some characters, can you actually generate new instances of those characters? Uh, and they essentially show that things like deep neural networks, uh, struggle at- at this sort of problem if you’re gonna be, going to be training them from scratch because if you’re only training them on a few examples, uh, we know that deep neural networks do best when you have a large number of examples. Um, and initial approaches towards this kind of problem, um, actually predating the Omniglot dataset itself, uh, instead used things like Bayesian models and non-parametrics in order to solve this problem. Um, great, so this is kind of, uh, one kind of canonical example for a meta-learning dataset and there are a wide range of others that have also been used for meta-learning more recently. Uh, these include things like the MiniImagenet dataset, uh, the CIFAR dataset, um, CUB, CelebA, uh, and a number of others. Uh, in all of these datasets, kind of the goal is to, given a small number of examples, be able to learn something from that small dataset. Okay so this is, uh, this is kind of on the dataset side, this is, this is the kind of datasets that you can use to evaluate a few-shot learning algorithm. Uh, now how do we actually go about evaluating an algorithm on these datasets? Uh, so this is actually gonna look a lot like the tests that I gave you on the first day where your goal is to classify new examples from a small dataset, and so in particular, let’s say that we have a 5-way, 1-shot image classification problem. Uh, and in particular, we could have, um, one example of five different classes shown here. Uh, way means the number of classes, shot means the number of examples per class, uh, and then your goal is given these, uh, five examples, classify new examples as being among one of the five classes on the left. Okay so this is, uh, the few-shot learning problem and in meta-learning, our goal is to be able to leverage data from other image classes in order to solve this problem. Just like kind of, be- be able to leverage the meta-train dataset that I was mentioning before in order to learn a few shot classifier that can, kind of, learn from these data points on the left. So the way that we can do that, if we can structure, uh, the data into training sets and test sets, just like I was mentioning before, where these are going to mimic what you’re going to be seeing at test-time, matching meta-training time and meta-testing time. So you can take five other image classes- classes and break it into a train set and a test set, and do this for a wide range of other image classes that you’ve seen in the past. Uh, these will be your training classes, uh, and you’ll perform meta-training across these training, meta-training the classifier such that after it sees the images on the left, it can successfully classify images on the right. Uh, and then critically after you do this, uh, you’ll test it on held-out image classes as shown on the top, uh, it will essentially be able to perform this few-shot learning problem. Uh, and this isn’t specific to image classification, you can replace image classification with a regression problem, language generation problems, skill learning problems, uh, kind of, as I alluded to you in previous lectures, I- each of these tasks shown as rows is essentially a machine learning problem. Okay, so any questions on this setup. Okay, yeah. [inaudible] Yes. The nuance here is that, in multitask learning your goal would be to try to solve all of the training tasks shown in this, in this box, in the gray box. Whereas in meta-learning your goal is to use these training tasks in order to solve new tasks with small amounts of data. So kind of being able to actually evaluate on new tasks and quickly learn new tasks is the critical difference between the two problems. Okay, um, so kind of more broadly and more generally we can kind of view the meta-learning problem from a more mechanistic standpoint. Um, and so in particular if we say supervised learning is trying to learn a mapping from X to Y given input output pairs, we can view meta-supervised learning as trying to learn from a dataset to make- where this dataset contains k input output pairs for a k shot learning problem. To make predictions about new test data point X test. So our goal is to kind of produce a function that takes as input a training data set and a test input and produces the label corresponding to the test input, so there’s a more mechanistic view of meta-learning is essentially that we want to learn this function f. Um, the function f that takes in the training dataset and the test input and produces the label. Now the way that we learn this, uh, this function is through a Meta-training dataset which contains a set of tasks or set of datasets where each dataset consists of X Y pairs where you’ll use at least k to be used for the training dataset and at least one additional data point to be used to measure generalization to actually train it such that, uh, it does well on new data points. Now, uh, why is this view use- view useful? So we, we kind of saw the probabilistic viewpoint before, um, one of the nice things about this particular problem statement is that it reduces the problem of Meta-learning to that of designing and optimizing this function f. Uh, once you kind of design a function f and optim- and kind of decide how you want to optimize it, uh, then you’ve created a Meta-learning algorithm. Okay, um, how does this connect to the probabilistic viewpoint, uh, well you can view supervised learning as doing inference over parameters given a dataset. Similarly, you can view the adaptation process of Meta-learning as doing inference over your task specific parameters Phi i given a training dataset and, uh, and a set of meta parameters and the Meta-learning optimization as doing, um, maximum likelihood, uh, inference over the meta parameters, uh, over all of your training tasks. Okay, um, any questions on kind of the problem setup before we get into algorithms. Yeah. Um, is it important to use the proper value for k or is it- Yeah, that’s a good question. So typically algorithms assume that you know, um, you know something about the k that you’ll be evaluated on at test time. So if you’re going to evaluate on 10 shot learning or 100 shot learning, then you’ll, uh, train for those values, uh, and you can train for, um, depending on the algorithm you can t- train for exactly the value that you think you’re going to have at test time or a range of values, such that, um, it can adopt to a range of dataset sizes. Yeah. What happens when you don’t know before [inaudible] all the tasks that you are going to [inaudible] because of some feature parameter [inaudible]. So your question is what if you don’t know the test task that you’re gonna- going to be evaluated on? That- like say you are theoretically adding new [inaudible] Yeah, so generally the assumption here, uh, is that the, the test task that you’re being evaluated on is from the distribution- the same distribution as the training tasks, uh, and what some, some algorithms do better than others when you break that- break that assumption, uh, and I’ll talk about that a bit more in the second half of this lecture. Uh, there is also kinda this online setting where we’re incrementally adding tasks and that’s a setting that has been explored a little bit, uh, and I’ll talk a little bit about probably, um, as when we talk about lifelong learning in the course, uh, and then we’ll also talk a little bit about set, like later in the course about settings where you just know nothing and you just have like an unlabeled dataset and, and how you might be able to try to construct tasks automatically. Was there another question? Yeah. Is Meta-learning required to extend network structure? So you’re asking is it required to use the same network structure as supervised learning or- Um, supervised learning. I guess we’ll, we’ll get into the, the kind of what different architectures you can use for different algorithms, uh, later in the lecture and then if you still have the- a question you can ask it maybe towards the end. Okay, great. So, uh, the general recipe for what an algorithm looks like, uh, is basically what I alluded to before is choose some form of this function that is, uh, that could be probabilistic or it could be a deterministic function as I mentioned before where you’re going to be outputting a set of task specific parameters given a training dataset and your meta parameters [BACKGROUND] and then once you choose the form of this then you need to just figure out how you want to choose to optimize your parameter’s data with respect to your, um, Meta-training dataset, this, this choice is usually somewhat relatively straightforward using standard, uh, neural network optimizers. Okay. So this is kind of the general form, uh, and most Meta-learning algorithms vary based off of the first plan, uh, basically how do you actually design this function that’s going to infer task-specific perimeters and so the first class of approaches that we’ll look at are going to, going to be considering, can we treat this, um, this, this distribution as an inference problem? And in particular neural networks are pretty good at doing things like inference and so can we just treat this function as a neural network? Um, and this is where, uh, what I’m gonna refer to as Black-Box approaches come in, so, um, what these Black-Box adaptation approach is they essentially just train a neural network to represent this function right here that’s at- a neural network that’s going to be outputting parameters given a training dataset and a set of meta parameters. And so first for now we’re going to be using a deterministic or point estimate of this distribution, um and we’ll kind of get back to Bayesian approaches in a couple of lectures, uh, and the way this looks like is you can have, uh, some neural network that, uh, has parameters theta, it takes as input the training data, it can take it as input in a sequential fashion or it can take it in, uh, kind of all as one batch and it outputs a set of task specific parameters Phi i. Uh, then you have a separate neural network that’s parameterized by Phi i that makes predictions about test data points. And this is essentially your, uh, the test data, you can basically train this, uh, train everything using your test dataset, DI test. Uh, and so this is like really simple, uh, and some of the nice things about this is we can just train it with standard supervised learning, um, so we can say that the, um, we want to be able to, uh, maximize the probability of the labels under the distribution that G is producing, uh, [BACKGROUND] for all of the test data points, uh, and for all of the tasks in your Meta-training dataset. Um, so essentially what you’re doing is you’re training this neural network such that it outputs parameters that, uh, represent an accurate classifier [BACKGROUND] Um, so if you denote this right-hand part as the loss for a set of parameters Phi given a test data point, then you can essentially view this optimization as, um, as the, uh, the loss function between the parameters that are outputted, uh, will also ensure that takes some parameters that are outputted by f Theta, evaluate it on your test dataset averaged over all of your tasks. Okay, um, any questions on this? Yeah. Uh, so when you evaluate your model which Phi i do you use? Great, so when you evaluate your model you’re given a new task, uh, and so you’re given a training dataset for a new task and so what you do is your, uh, for your test task you basically pass in that training dataset into your network f Theta, uh, and, and produce your parameters for that task. Yeah. So are those data and Phi i learned in this neural network? So in this case, the, um, during the meta-training process, the parameters Theta are learned and the parameters Phi i are somewhat dynamically computed per task. Um, so in this sense, Phi i is almost treated more as, um, activations or a ten- or a tensor rather than actual parameters, um, which is somewhat of an interesting concept. Um, the- yeah, basically you can ba- back-propagate the loss with respect to Phi into the meta-parameters Theta. Yeah. [BACKGROUND] Yeah. That’s a good question. So, uh, the question is relating to the, um, basically should you be, like, uh, in the homework where you’re also passing in y test, uh, as input to, uh, the right-hand side and, and, and zeroing it out. Uh, and the reason for that is that if you have this, kind of, this type of architecture that’s an LSTM, um, and you want to basically be sharing weights across time for each of these units, then you want the shape of the tensors, uh, shape of the, the inputs at each, uh, at each data point to be the same. Uh, and so if you want that, then you want to be able to basically pass in, um, pass in the same shape, but, of course, you don’t want to give it the, the ground truth label, uh, because the label is what it’s supposed to be predicting. [BACKGROUND] Um, so the question was to zero out the label value or the embedding value? [BACKGROUND]. Of y. Um, I think that basically, as long as you’re not passing in y test as input in any way, er, you’re, you’re in good shape. Yeah. [BACKGROUND] Are you asking about- can you maybe repeat your question? [BACKGROUND] Oh, um, right. So you’re asking basically that this top function here, is Theta an input or is it parameters? [NOISE] Uh, in this case it is, um, it is the, kind, of the parameters of that model. Uh, and so maybe a more standard notation would be to either put this Theta, um, as subscript to the p or, uh, put a semicolon here to indicate that it is parameters rather than an input. [BACKGROUND] Uh, during meta-training, we’re optimizing over Theta. [BACKGROUND] Yeah. In the inner loop you’re producing Phi. Yeah. [BACKGROUND] Right. Yeah. So Phi- yeah, exactly. Phi is computed at test time given the training dataset as input. Yeah. Uh, and let’s go, let’s go through a couple more of the details here, uh, before we answer any more questions about this. So the, um, kind of, this is, um, the- I’ve just covered what the objective is. Now, let’s, let’s actually look at this as an algorithm. So, uh, what we do is we f- if we wanna actually optimize this, uh, we first sample a task i or whatever meta-training tasks or a mini-batch of tasks. Uh, then we sample disjoint datasets from that task dataset, uh, which we’ll refer to as D train and D test. So if this is all of the data that we have for task i, then what we’re gonna wanna do is we’re gonna wanna partition this into a training dataset and a test set for that task. Uh, and so in particular what we can do is we can basically pick, uh, randomly select half of them to be used for the training dataset and half of them to be used for the test set at this iteration of the algorithm. Then, we’ll take the training dataset, uh, the- what’s in the green box and use that to compute the task per- specific parameters Phi i. And then we’ll update our meta-parameters using the gradient of the objective with respect to the meta-parameters using the computed task-specific parameters. [NOISE] Uh, and then we’ll repeat this, uh, iteratively using your favorite, uh, gradient descent op- uh, optimizer, things like Adam, SGD, Momentum, et cetera. Yeah. [BACKGROUND] So, so the question was, uh, we’re not compu- computing gradients using the training dataset. So what we’re using is we’re, um, computing gradients using the meta-training dataset, uh, of tasks. And so the task-specific parameters are computed using D train, and the, uh, then we evaluate those parameters using the test data-set for that meta-training task. So we’ve, kind of, lifted the, the training datasets from, kind of, training datasets and test datasets to meta-training sets and meta-test tasks. So. Yeah. [BACKGROUND] You’re asking if Theta, Theta is all meta parameters? [BACKGROUND] Yeah. So all of Theta is, uh, are the meta parameters and then Phi are considered the task-specific parameters or the- Theta- Phi is essentially not considered part of th- the meta parameters. [BACKGROUND] Um, we’ll get into what Phi might be in a second. It could be the base- basically, it could be the, the parameters of an entire neural network, uh, it could also be something that’s more compact, and I’ll talk about that in a second. Yeah. [BACKGROUND] Yes. Yeah. That’s the met- [OVERLAPPING] [BACKGROUND] Yeah. Yeah. Exactly. So we haven’t touched any of the meta-test tasks, uh, that are, kind of, held out from the, uh, task distribution. Yeah. [BACKGROUND] Right. So in this case, for this particular network architecture, the order of the training datasets mat- the order of the, the data points matters. Uh, and this actually isn’t necessarily a good property because in many cases, the, the, you have data sets not data lists, uh, for which the order doesn’t matter. Um, and so we’ll see some architectures that- later see some architectures where you- they are permutation invariant. Yeah. [inaudible] . So in this case we, we compute Phi in Step 3, and then we, uh, update the meta parameters Theta. So we do not update Phi, uh, itself. It’s- that is basically dynamically computed at every iteration of the meta-training process. Uh, and then at test time, we’re also going to be computing Phi given our meta parameters Theta. [NOISE] Yeah. [inaudible] we compute Phi and then we do maximization step of Phi, and then run that, like, updated Phi back to the network? Yeah. So it’s similar to that. So the- we essentially compute- you can, you can view the, the, the computation of the gradient with respect to Theta as basically back-propagating the loss from, from y into Phi back in- back all the way into theta. We don’t ever use that gradient to, to update Phi. We only use it to update Theta, but it has to go through Phi in order to compute that gradient. Yeah. [BACKGROUND] Um, I’ll talk a bit about architectures in a minute, but one of the nice- one of the things about LSTMs and, and RNNs is that they can, er, process variable amounts of data relatively easily. Uh, so you don’t have to assume any data- any particular dataset size, although you should probably train it for the largest possible dataset size. Um, but there are other neural network architectures that I’ll talk about that you can use [NOISE] for this as well. [NOISE] Okay. Um, so- now one of the challenges with this approach is that if, uh, if Phi is literally representing all the parameters of another neural network, uh, it may not be that scalable to actually output, um, all of those neural network parameters because neural networks can be very large. So, um, there are a couple of approaches for dealing with this, but the main, kind of, way you can think about doing this is you don’t need to necessarily output all of the parameters of a neural network. You could instead just output the sufficient statistics of that task such that you could effectively make predictions for that task. Um, and so what this looks like is, instead of having a neural network that outputs all of the parameters Phi, it will output, uh, some set of sufficient statistics h. Er, and then this, uh, like, some, some lower-dimensional vector h. And then your, uh, your neural network on the right will use those, uh, sufficient statistics as well as other parameters n theta in order to make predictions. Uh, and so what this lower-dimensional vector h might represent is things like contextual task information. Uh, and then your new parameters Phi i are gonna correspond to hi as well as, er, part of theta that will parameterize g. Uh, and so essentially, the way that you can view the size is you can, uh, basically view this if you, uh, basically view this as a single LSTM that’s taking in, uh, data points. Uh, so one of the reasons why I named this h, is h is often used for the hidden state of LS- of an LSTM. If you basically share all the parameters, uh, between both f on the left and g as, for example, an LSTM, uh, then the task-specific parameters Phi are represented by the hidden state of that LSTM as well as the parameters of, uh, the function on the right that are shared with the LSTM parameters. Um, so one interesting connection here is that if you recall, uh, multi-task learning where we were concatenating task information z into the network. Uh, you could view h, uh, essentially as a summarization of the task that is used to make predictions for that task. So h and z, er, are very similar. Uh, in this case, unlike z, in the multi-task learning setting we’re actually learning the task representation h in this case, and we’re learning how to produce that task representation h given a small dataset of that task. [NOISE] Okay. And so the, the fully general form of black-box neural networks is a function that takes as input a trained dataset and a test input and produces a test output, um, where Phi is, is somewhere, uh, in the middle of this network and may not actually be something that is, uh, actually representing parameters per se. Yeah. [BACKGROUND] Right, so the question is, can you explain what theta g means here. Um, so here basically theta g, uh, represents all of the other parameters that this, uh, that this network g is representing other than h. Uh, so, uh, this neural network right here that’s making prediction about test- making predictions about test inputs we’ll take as input h and we’ll also have other parameters Theta g that we’ll use to make predictions. Theta g will be a part of the, the full parameter vector Theta. Uh, it may also share parameters with this part of the network right here. Yeah. [BACKGROUND] Yeah. So in this case, you might be sharing more, uh, more parameters between test time and, and training time. [NOISE] Okay. Um, so this is the, kind of, overview of black-box approaches. And let’s now talk about what sort of architectures we could use for this function f. So one of the first, um, well, I guess, it’s hard to say what, what came- comes first in research in general, but, um, one of the earlier approaches to these sorts of black-box approaches, um, is using LSTMs or Neural Turing Machines, uh, that take as input the test inputs and, uh, basically the dataset, uh, and be able to make- and use that to make predictions about new data points. Um, LSTMs are, are probably something that’s familiar to you. Uh, Neural Turing Machines are something that have more of an external memory mechanism for which it can essentially store, uh, information about the training data points and then access that information when making predictions about new data points. Uh, and it’s it d- does this in a differentiable way. Uh, you can also use something that [NOISE] uh, so, kind of, as was noted before, this is not permutation-invariant because you’re taking the data points sequentially. Uh, and you could also use an architecture that is permutation-invariant by having a feed-forward function that takes as input each of your training data points, uh, x and y, uh, x1, y1, x2, y2, et cetera, uh, and then aggregates that information using something like an average operation, uh, to compute, uh, something that looks like, uh, in this case, what’s denoted as a or r. Uh, and then that is passed into another feed-forward network to make predictions about new data points. Um, beyond these, er, these two types of architectures, there’s a wide range of others that have been proposed that have used other memory mechanisms, um, as well in combination with, uh, ideas from, kind of, having slower weights and faster weights. Um, often when people use the term slow weights and fast weights, they refer to the task-specific parameters as fast weights and, and the meta parameters as slow weights, uh, because one of them is updated much more quickly than the other one. Uh, this is a concept that, uh, was developed by, um, folks in neuroscience actually, uh, that have looked at kind of, the, the, um, how weights have been changing and how, uh, how synapses change in the brain. Uh, and then, uh, there’s also an architecture that has used a combination of attention mechanisms and convolutions. Uh, so in this case, convolutions are go- actually going to be not permutation-invariant although attention-based architectures can be permutation-invariant. Um, and, kind of, as a representative approach of, kind of, the black-box approaches in general, um, this, this type of method that uses, uh, that uses, uh, convolutions and attentions is able to do quite well on things like Omniglot, uh, getting around 97-99% accuracy on, uh, things ranging from 5-way one-shot to 20-way five-shot Omniglot, uh, and also does well on the min- ImageNet dataset that is performing, uh, like, five-way classification for actually real images from the ImageNet dataset. Yeah. [inaudible] ? Um, so the question is the- like, is there any, um, mechanisms relating to neural? [BACKGROUND] What do you mean by HI? [BACKGROUND] Right. Okay. Um, I guess, I’m not, uh, I’m not too familiar with the, the neuroscience literature, uh, to be able to comment on that in a, uh, in a competent way. The- I guess, one thing I will say that has been somewhat inspired by the neuroscience literature is that people have looked at, um, things that look like LSTMs, but do more of a Hebbian rule- update rule on that, uh, on H- on HI, um, in order to, uh, kind of, update the sufficient statistics with respect to a given task given your training datasets. Um, other works from, um, I guess, one thing that is perhaps worth noting that we will actually cover in one of the reading sessions, um, is that the- I guess, there are a number of, uh, neuroscience, uh, researchers at DeepMind that have looked at these types of meta-learning methods. Uh, and they have focused on, on actually these types of meta-learning methods more so than optimization-based or pan- on parametric approaches, uh, using things like, like LSTMs. [NOISE] Yeah. [inaudible] . Yeah. So in general, I’m, I’m under the opinion that Omniglot performance has saturated, uh, for the most part. So, um, one of the algorithms that we’ll be talking about later in this lecture, uh, gets, like, 99.9% accuracy on five-way five-shot Omniglot. Uh, things that aren’t solved are generation of Omniglot digits. That’s certainly something that’s a lot harder and was actually proposed in the original paper. Uh, also, um, this is a bit of a nuanced point, but the tr- the meta-trained meta-test split that they proposed in the original Omniglot paper is actually not the one that’s used in all the machine learning papers because it is a bit, uh, it proposes a train test split that doesn’t have quite enough training data points for these models to not overfit a lot. Um, and so if, if you’re look- interested in looking at very efficient learning, uh, then you- I think that performance isn’t quite as saturated when you move towards the original meta-train meta-test split. Um, but then it’s just a matter of putting inductive biases into your network. Okay. Um, so in homework 1, uh, you’ll be implementing the- kind of, the data processing pipeline for these meta-training algorithms that involve actually taking the Omniglot dataset, for example, and actually loading images and, and plugging them into a neural network. This is actually a pretty fundamental part of these algorithms. Uh, you’ll also be implementing a very simple black-box meta-learner, uh, and also training a few-shot Omniglot classifier. Uh, and you can use, kind of, uh, you can somewhat compare it to, uh, some of the numbers in these papers. Okay. Um, so to wrap up black-box adaptation, uh, the pros and cons of this approach is that, first, it’s very expressive. So given that neural networks are universal function approximators, these methods can represent any function of your training dataset. Uh, and they’re also very easy to combine with a variety of learning problems, for example, supervised learning or reinforcement learning. Uh, and later in this course we’ll talk about how we can combine these methods with reinforcement learning. Um, it’s, it’s- the, kind of, the spoiler is that it’s, it’s very straight forward, uh, as you might imagine with these types of models. The downside to this approach is that, uh, in general, these neural networks are, are fairly complex because they need to be taking in datasets and making predictions about new data points. They essentially need to figure out how to learn from data, uh, and they need to do those- do this in a completely- like, basically completely from scratch. Like, at initialization, these, these LSTMs were not built as, as optimization procedures, and they need to learn those optimization procedures, uh, from scratch from the meta-training data. Uh, and as a result, they’re often fairly data inefficient. Um, and by this, I mean not data inefficient at meta-test time, but they often require a large number of, um, kind of, a large amount of meta-training data, a large number of tasks in order to perform well. Okay. Any questions on black-box approaches before we move on? Yeah. [BACKGROUND] So, um, I guess, the- so the question was, there are other algorithms that take X-test as input. Um, and you could certainly, like, you could certainly integrate X-test as much as possible in, uh, on- into the, kind of, left-hand side of this diagram. Um, it’s still, kind of, part of the input. And if you look at, kind of, the, the general form of, um, of these algorithms, it, it- something that takes in the train dataset and the test input and you can really design whatever architecture you want to integrate those pieces of information. Whether or not they’re integrated, kind of, somewhat separately, or treated somewhat separately, or integrated, uh, in the same part of the network, that, that’s up to you. Okay. Great. So let’s talk about, um, optimization-based approaches. So the- I guess, the motivation here is that, uh, as we talked about a bit before, if we want to infer all of the parameters of a neural network, uh, having a neural network output them isn’t a very scalable way to do that. Uh, and instead, what we could do is, instead of, uh, treating this function as an inference problem, we can instead treat it as an optimization procedure. Uh, and this is similar to what we do in supervised learning. We treat, uh, parameter- like, i- inference of our parameters as an optimization problem not as necessarily, uh, an in- inference problem. Um, this is where optimization-based meta-learning approaches come in. So the key idea behind these methods is that we’re gonna acquire our task-specific parameters phi i through optimization. And then we’ll differentiate through that optimization procedure to the meta parameters to optimize for a set of met- meta parameters such that that optimization procedure for phi i leads to good performance. Um, so how do we get started here? So the- I guess, you can essentially break down the, the meta-training problem as, uh, as, kind of, having these two terms. One that’s maximizing the likelihood of your training data given your task-specific parameters and one that is, uh, optimizing, uh, your, uh, task- the likelihood of your task-specific parameters under your meta parameters. Um, and so you can view this, kind of, this, this equation right here as the optimization procedure that you wanna be able to do, uh, at test time and also the optimization procedure that you’re going to be integrating into your meta -learning problem during meta-training. One that’s basically going to be taking into account the training dataset, uh, and your accuracy on the training dataset as well as, uh, your prior, which is given by phi given data where, where your meta parameters are, are parameterizing your prior. All right. So your meta parameters are serving as your prior. Um, and now we need to think about, well, what form of prior, uh, should we, should we basically impose using our meta parameters. Um, well, one very successful form of prior knowledge that we’ve used in deep learning optimization is the initialization. Um, and in particular, one of the things that’s been quite successful in deep learning is what’s called fine-tuning where we take some set of initial parameters and then run gradient descent on training data for some new task. Uh, and typically, this is not for just a single gradient step as written here, but for many gradient steps. Uh, and this has worked really well. So, for example, if you look at, um, so- something that pre-trains on im- on ImageNet versus training from scratch, uh, those are the two rows shown here, and fine-tuning either on the PASCAL dataset or on the SUN dataset. Uh, and we see a huge difference in performance using pre-training, which is, uh, labeled as original in this paper, um, versus using a random initialization. Um, great. So typically, like, this is i- in many ways a valid approach to, kind of, the meta-learning problem where you first train, uh, a cert- or pre-train a set of parameters on your meta-training data and then fine-tune on your dataset at test-time. Um, now some questions that might come up is, where do you get your pre-training parameters? Uh, the typical way to do this is through- for, for, for vision problems, the typical way to do this is by pre-training on ImageNet classification as using supervised learning. Um, in language, one very popular approach for doing this is using models trained on large language corporas, um, models like BERT or language models, uh, or other unsupervised learning techniques. So pre-training of neural networks actually has a very long history. Before even- well before ImageNet, our people were pre-training their, their models using unsupervised learning techniques and then fine tuning them. Um, although other than language, it’s not sure if- uh, it’s not clear if those approaches have really been that popular recently. Um, but really, like, if you have some domain, um, in many ways, kind of, the thing to do is just to take- train on some very large and diverse dataset and then fine-tune these- fine-tune those parameters on whatever dataset you actually want to perform, uh, inference on. Um, and it’s also worth mentioning that pre-trained models are often available online. Uh, and so you can- actually you don’t even necessarily need to do this stuff where you actually trained on ImageNet. You can just download the parameters and then fine-tune from there. Um, and then, I guess, the other thing worth mentioning here is that, er, fine-tuning is a bit of an art, er, like other, other aspects of deep learning unfortunately. Um, and so there’s a, a range of common practices for, uh, performing fine-tuning successfully. This includes things like fine-tuning with a smaller learning rate, um, using a lower learning rate for lower layers of the network. Um, uh, typically the low layers are- for many fine-tuning problems, typically, the low-level features are the things that need the change the least and the higher-level concepts are the things that need the change the most for a new task. Uh, you may actually freeze earlier layers of the network. Potentially even basically setting them to- with, uh, setting a learning rate of zero for those layers. Uh, you could also consider re-initializing the last layer. Um, and then typically people search over these hyperparameters using cross-validation. Um, and then the last thing worth mentioning here is that architecture choices tend to matter, uh, a lot when, when choosing how to fine-tune. Er, for example, things like residual networks tend to be actually quite good at fine-tuning, um, because the gradients flow, um, flow relatively easily through various parts of the network when you have residual connections. Yeah. [inaudible] So you’re asking, basically when you’re, when you’re fine tuning, you’re not actually using any information about the target? Either fine tuning or [inaudible] We never exclusively encode which task we are using. It’s never an argument about using procedures. Yeah. So you’re saying that, basically, we never are passing in any information about the task as input to this approach or to the black box approaches. Yes, that’s, uh, that’s correct. There’s actually, uh, well- for meta-learning- well, there’s actually some nuanced reasons for not doing that, uh, which is kind of interesting in some way. It seems like in many ways you should pass in as much information you have about a task to the models so that they can use it. But in fine tuning, for example, um, if you say passed in a one-hot vector for ImageNet, and then pass it in a different one-hot vector for your test task, then, uh, it actually may, like- that, that information will- first, if they’re just two separate one-hot vectors, like, the, the test tasks that you’re doing, like, versus train time, they’re completely distinct things to the network. Um, and so that, that information isn’t something that can actually be used during the fine tuning process to, to help it because it- it’s only ever seen one task. And so kind of looking at another task, for example, won’t tell, tell you that it’s doing a different task. Um, is it- Sorry. Go, go ahead. I just want my [inaudible] because the only agent or person that knows third tasks is only the person that’s training the network because when you’re going to test, you know you will test with the test set [inaudible] about task. But if someone gave me a network and I was training in this manner and I just got a bunch of clear images, how do I know there’s no index tracking, right? Like we had earlier and we had to mention approaches. So basically, [inaudible] network, there’s no way to know what parameters to use for a specific task, right? The earlier, I know, oh, we are trying to- we have like a vehicle on each side and, uh, [inaudible] so maybe one has one [inaudible]. Yeah. They’re recording and then I’m like and I’m, oh, there’s a vehicle I should use, there’s one, there’s an [inaudible] , I should use this. Yes. So you’re asking me, so you could basically tell the network, um, like, you could train it- you could pre-train it on, like, multitask learning, say, like, first, you- I’m gonna train you on, like, recognizing animals and recognizing plants and recognizing cars and something, and tell it that it’s going to be doing that, um, and we’re, like, fine tuning on that, for example. Uh, in this case the pre-training is just single task. Uh, and the test task is also single task, so you don’t actually tell it any information about kind of what task it’s solving because it’s kind of, um, assume that you’re gonna be fine tuning it on, on a new task and kind of both pre-training and testing our, um, our separate tasks. Um, I’ll get to the, to the point about why meta-learning doesn’t pass on task information a bit later. Yeah. So in this case, we’re fine tuning the task specific parameters, but can we also do fine tuning on the shared meta-parameters? So in this case, we’re actually not- um, I guess the- in this case, this is- uh, there isn’t really a distinction between task specific parameters and meta-parameters. So what we’re doing is we’re just pre-training parameters Theta, which, basically, could be your meta p- could be your meta-parameters Theta, and then the optimization process is producing your task specific parameters, Phi. So it then includes both of them basically? [NOISE] Um, I would actually sa- I would actually say that the pre-train parameters Theta are the meta-parameters. Okay. Uh, and that initialization is affecte- is basically serving as a prior on your optimization, uh, and in a somewhat implicit way. Uh, because basically the, the meta-train- the pre-trained parameters are kind- like affecting the solution that the fine tuning process will give you. Yeah. Is the optimization [inaudible] [NOISE] like more generally trained neural network, is that more of like a pruning thing or like using just at specificity? You’re saying, is this- is the fine tuning procedure pruning the network? Yeah. Like wh- is it reducing weights and then be like- or adding specifically, like increasing weights? So in this case, it’s actually just changing the weights. So it’s not, uh, removing or adding weights to the network. But, but like generally, in your experience, what have you seen? Do you know if there is a general problem for that? Uh, so you’re asking what, what do, what do fine tuning procedures end up doing? Do they- um, I think that the- kind of the accepted wisdom of, of what these are – things are doing are reusing features, and often are changing how those features are used, uh, for a new task, but not necessarily changing the features that much themselves. Um, so- and that’s kind of how like, the, the later layers of the network are changing a lot and the features themselves are not changing a lot. Um, although I don’t know if anyone has actually proven anything related to that. Yeah. So, uh, two quick questions. What is- [inaudible] Yeah. So typically, you’ll use the same architecture or you might, uh, use the same architecture but like chop off the last layer. Oh, okay. And when you’re doing these updates and fine tuning, are you- you said you only provide one task at a time. So- Right. So, uh, we’ll get, we’ll get how, how we can integrate this into a meta-learning approach on the next slide. Uh, but yeah. So typically, what you do is you just pre-train parameters on a single task, uh, and then fine tune on your test task. Okay. Um, great. So uh, one other example of where this has been used is using, um- basically pre-training using language models and then fine tuning on text classification tasks. Uh, and the plans in here are pretty interesting. So they’re showing that, uh, on the X-axis as you vary the number of training examples you have for the test task, how does the, um, performance on that task vary? And so what we see first is that there’s a big difference between training from scratch versus training- uh, using pre-trained parameters from universal language models or ULM. Uh, so that’s the gap between the blue lines and the orange and green lines. And then, uh, the second thing that we see is that, as you have fewer examples, uh, in your new task data set, uh, performance gets worse, our error goes up. Uh, and so essentially what we see is that when you only have, for example, 100 data points for your test task here, uh, your performance actually isn’t very good on your test task. Uh, and you can expect that as you actually decrease that even lower, you would do even worse. And so, uh, essentially, fine-tuning is, is much less effective when you have smaller data sets. And now motivated by this, how about we design a meta-learning algorithm with the goal of being able to fine tune with small amounts of data at test time? Uh, and in particular, what we could try to do is take our fine tuning procedure and evaluate how well those task-specific parameters did on a test data set or on new data points. And then actually optimize for your pre-trained parameters such that fine tuning gives you a set of test- uh, gives you a set of parameters that do well on the test data points. Uh, and you could do this optimization across all of the tasks in your meta-training ta- in your meta-training data set, such that fine tuning with small amounts of data leads to good generalization. Um, so essentially, it’ll be training for, uh, a set of parameters Theta across many different tasks such that it can transfer effectively via fine tuning. Um, okay, so kind of at a more intuitive level what this might look like. And say Theta is the parameter vector that you’re meta-learning- uh, your meta-parameters, and Phi i star is the optimal parameter vector for task i. Then you can view the meta-training process of this optimization as the thick black line, where when you’re at this point during the meta-training process, and you take a gradient step with respect to task three, you’re quite far from the optimum for task three. Whereas, at the end of the meta-training process you take a gradient step with respect to task three, you’re quite close to the optimum. And likewise, for a range of other tasks. Um, and we refer to this as the Model-Agnostic Meta-Learning algorithm, uh, in the sense that, uh, it embeds this optimization procedure in a way that’s agnostic to the model that’s used and the loss function that’s used, as long as both of them are amenable to gradient-based optimization. Um, and then one other thing worth noting here is that this, this diagram, I think, can be helpful for get- getting across the intuition of the method. Uh, but at the same time, it can be a bit misleading. First, because parameter vectors do not exist in, in two-dimensions, uh, and also, or, or neural network parameters do not exist in two dimensions, typically. Uh, and then also, there often isn’t a single optimum, but, but actually a whole space of optimums, um, for a whole space of optima for, uh, neural network parameters. And so in many ways it’s more about, um, not necessarily reaching a center point for these different algorithms, but reaching a point, um, such that fine tuning will eventually- um, will, will get you to, uh, a good part of the parameter space with, uh, with, with small amounts of data. Okay. Um, so that was the objective, uh, what does this look like as an algorithm? So we can take, um, the black box adaptation approach that we mentioned before, uh, and, uh, adapt it to the, the optimization-based meta-learning case. Uh, and essentially what this does is you first sample a task, you can- you sample your data sets. Uh, then instead of computing your task-specific parameters using a neural network, you’re going to be computing them using one or a few steps of fine tuning. And then you update your meta-parameters by differentiating through those fine-tuning steps into your- uh, into the parameter vector Theta- into your initial set of parameters. Okay. Any questions on this before I get into a few of the details? Yeah. If you were to initialize a multitask network with the- whatever these learned weights would be, versus trained multitasks from scratch, do you think the multitask network could do better with this prior? So you’re asking, um, what if you use multitask learning as an initialization instead? First, you- you do the string method for some- some theta and then I guess [inaudible] , and then you use those as your initial weights for the multitask learning. I see, so you’re saying that you can basically pre-train to do this meta-training process to get an initia- initial set of weights, and then use that as an initialization for multi-task learning? Yeah. Um, you could certainly do that. I guess, what this is doing is optimizing that the meta- the meta-learning process that I mentioned, uh, on the previous slide is optimizing for fast adaptation to individual tasks given an individual dataset. You could also do the same thing for pairs of tasks or triplets of tasks. Uh, if you wanted to explicitly pre-train for three-shot learning or it’s not, it’s not three shot learning, three task learning. Um, you could- I, I guess you could also consider doing something like- like optimizing it for, uh, optimizing it on, on for a single-task adaptation and then fine tuning it on multitask adaptation, or multitask learning. Um, it’s hard to say how well that would do because it’s not actually explicitly training for what it’s going to be doing at task time, but, uh, conceivably could do something effective. Yeah. When you talked about initializing the weights data from a pre-train network like it was that- was that just motivation for this algorithm or do you actually do that with your parameters theta or could you start with random initialization? Right. So in this case we start with a random initialization before this meta-training algorithm, and that was mostly serving as for- basically as- as motivation for how well pre-training can work, uh, in a ra- range of settings. Yeah. [LAUGHTER] So does this work, uh, well, even with, like, one-shot learning because it seems like even with this approach that you could risk over fitting on [inaudible]? Yeah, so this approach actually works really well even for one-shot learning, two-shot learning etc. It’s competitive with, um, with the black box approaches that I mentioned previously. So how long do you typically run that in optimization move [inaudible]? Right, so for the one-shot setting you can you- typically the data optimization is somewhere between one and five gradient steps. Um, and even with only a few gradient steps you can get quite far. [NOISE] Okay. So one thing worth mentioning about this algorithm is that it brings up second-order derivatives because we are optimizing, um, basically because we’re, ah, we’re optimizing for a set of, uh, meta parameters. Ah, so this is- we have this gradient. And inside of, ah, this inductive term we also have this gradient right here. Uh, so I know you might be a little bit worried about this. So I- for example, if you need to compute the full Hessian of, uh, of the neural network, we would be in a bit of trouble. Uh, and, um, what if you want more than one integrated step. Does that give us higher-order, um, higher-order derivatives? And so I wanna go through a bit on the whiteboard, ah, what this- what actually the meta gradient update looks like, uh, such that we can, kind of, figure out the answers to these questions. [NOISE] Great. So let’s say that, um, for the sake of notation. So, uh, in this case I was writing out a gradient step as the update procedure. Uh, and- and in this case I’m just gonna, uh, use u to denote the, um, the update rule and that’s gonna be a function of theta and your training data points D train. So this is, uh, this is basically one, uh, or a few steps of gradient descent, theta minus alpha grad theta loss with respect to D train. Um, and I’m gonna use, uh, so you, kind of, just write out some- some notation. I’m gonna use d to denote the total derivatives [NOISE] uh, and [NOISE] the, uh, nabla symbol to denote partial derivatives. [NOISE] And we’ll see why this distinction actually matters, uh, in a second. And this is just for the purpose of the white board, in all the slides we’ll just be using the nabla symbol- the gradient symbol, um, for, for both. Okay. So as you can see on the bottom of this slide, um, the optimization procedure that we have [NOISE] looks something like an optimization parameter- meta parameters theta over our loss function with respect to our task specific parameters five. [NOISE] And our test data points. I’m gonna drop the i from the notation here just for notational simplicity. And this is the same as [NOISE] an optimization over meta parameters of L of our update rule with regard to our training dataset, uh, and our test dataset. Okay. So this should all be clear from the board. And in order to optimize this objective function, we need to be able to get the derivatives of this objective with respect to our meta parameters if we want to optimize this with gradient-based optimization, things like Adam for example. Um, and so to do this, we need to be able to get, uh, the derivatives [NOISE] of this objective with respect to [NOISE] our meta parameters data. Uh, and so let’s try to actually write out what this meta gradient looks like. Uh, so in particular we can view this meta gradient. Um, first we can basically, uh, compute the- with the chain rule, compute the derivative of the outer function, uh, and then, uh, use the chain rule to compute the derivative of the- the inside with respect to theta. So what this looks like is we’ll take the derivative with respect to, uh, I’ll use, kind of, a placeholder variable phi bar of, um, L of phi bar and D test [NOISE] evaluated at phi bar equals u, um, of theta comma D train. So this is just the derivative of the outer objective, uh, times the derivative of the, um, the derivative of the update rule with respect to D train. Yeah. Can you write a little larger [inaudible]. Yes, I can try to write larger from here forward. Um, so basically this is the- the derivative, the derivative of the first- of the outer loss, and this is d phi d theta. Um, this is why partial derivatives are- matter because if we, kind of, just wrote this as, uh, as a full derivative, then this would just be exactly the same as the- or basically be very similar to the, um, to the, uh, what was originally written. Okay. Um, right, great. So notice that this is, uh, this is, like, ah, a row vector. Uh, I need to write bigger. [NOISE] And, uh, this is a matrix. [NOISE] Um, and so the result is a row vector. Um, this can be computed with a single backward pass through the neural network. So this is, uh, you could just set the parameters of your neural network to phi bar and then compute the derivative of this loss function with respect to those parameters. So this is just one backward pass. Um, this is differentiating through the update process itself. Um, so this is the part that is a little bit trickier to deal with. Okay. Um, so let’s try to actually compute what this looks like. So, um, [NOISE] we can let u theta D train. Let’s just start with the case where we have [NOISE] a single gradient step. Um, [NOISE] then in this case, uh, then we can try and take the derivative of this. So, uh, derivative of the update rule with respect to theta. This is going to equal the identity matrix, uh, minus alpha, uh, d d theta squared of L of theta, um, comma D train. [NOISE] Uh, and this is the- [NOISE] this is the Hessian of the neural network. Any questions with this? Okay. So, uh, if we didn’t plug this term into here, then what we get, uh, is that we- this is, we have this vector. We simply need to be doing- do a vector matrix multiplication. Uh, and fortunately this means that we don’t actually have to compute the full Hessian of the neural network because we ha- because we have this. All we need to compute is this Hessian vector product. Uh, and there are much more efficient ways to compute Hessian vector products via backpropagation for neural networks than, uh, that don’t require you to construct the entire Hessian of the neural network. Uh, it also turns out the standard neural network different- automatic differentiation libraries like TensorFlow and PyTorch will actually perform this Hessian vector computation for you, uh, such that you- in an efficient way that amounts to essentially performing, um, additional backward passes such that you don’t actually have to worry about coding this up yourself, um, which is very convenient. [NOISE] Okay. So that’s the case if we had just a single neural- a single gradient step in the, um, in the inner loop. What if we have multiple inner gradient steps in the inner loop? Uh, and so in particular what if we have, um, u theta comma D train. What if we have two gradient steps. So this is gonna equal theta minus alpha d theta of L theta D train. Uh, let’s call this intermediate set of parameters theta prime. And then we will have a second gradient step that is with respect to theta prime of L of theta prime D train. Is that behind the thing? I think that’s still there. Okay. So this is two gradient steps. Um, if we then want to compute the derivative of this, [NOISE] then what we get is, we first get the first two terms that we had before which is the identity minus the, um, minus the Hessian. [NOISE] Uh, and then, what about the second term? So we want to compute the derivative of this last term with respect to the parameters Theta. Uh, and what this looks like is, uh, first you compute the derivative of the outside. So you get, uh, D Theta prime squared. Uh, now I’ll use Theta bar here of, um, of Theta bar D train. This is going to be evaluated at, um, at Theta prime times D Theta prime D Theta. Er, and so, er, first this, this term right here is just equal to the first two terms. Er, and one of the nice things that we get here is that, er, we don’t get third-order terms here. So we get the Hessian evaluated at Theta prime which is the parameters after the first gradient step. Er, and we get the Hessian with respect to, er, the original parameters but we don’t get anything, um, any third order derivatives basically. Er, and again this is something that we can efficiently compute. Well, if initially- basically compete with additional backward passes without having to basically construct, er, any full Hessians or without having to compute higher-order derivatives which is nice. Okay, er, and then as you might imagine if you continue to run this, um, continue to kind of compute this for, um, for even more gradient steps in the inner loop, you basically continue to get these types of terms that pop-up without higher-order terms. Okay. Any questions on, um, on some of the math? Okay. So yeah. Er, in the computation of the second derivative term, aren’t you trying to take the derivative with respect to Theta of the derivative of Theta prime. But then you wrote the derivative squared was like the Theta prime bar. So how, how does that happen? So if you’re trying to differentiate this third term, um, with respect to Theta, you first kind of take the derivative of the outer function with respect to, um, its arguments times the- times this term which is the d of the chain rule. Okay and sorry that this likes to float upward but okay. Cool. So, um, now we’ve talked about authorization ba- authorization based approaches, um, or at least the kind of th- the basics of them. Let’s think about how they compare to black-box approaches. So you can view Black-Box adaptation as having this general form that it takes as input a training data set and a test input. Um, for example using something like a recurrent neural network or something like that. Now, you could also view MAML, or model-agnostic meta-learning as also, er, taking a training data set and a test input where you have this function Phi that takes as input the test input and the parameters Phi are computed with gradient descent. Um, so essentially you can view MAML as a computation graph with this funny embedded gradient operator inside that computation graph. So if you kind of take this view, that means you can potentially mix and match components of, um, of these approaches. For example, um, one paper that looks at can you learn init- initialization, er, but replace the gradient update that MAML does with a learned neural network that produces that gradient update. So for example instead of having, er, instead of learning the initialization then running gradient descent, you could learn initialization and have a neural network output your, your gradient update. Er, and this was done in,er, Ravi and Larochelle in 2017. And this paper actually precedes the MAML paper. But I, I mentioned it here, er, just for the purpose of understanding different things. Okay. Um, and this computation graph view of meta-learning will come back again, er, later. Okay. Now, one other thing to think about, er, is some of like how these approaches not just compare conceptually, but also in practice and in theory. So, um, one question to think about that was actually mentioned a bit before is, er, what if your test task is different than the meta-training tasks that you were optimizing on? And so this is a question that we studied empirically to some degree and we were aiming to compare MAML to black-box type approaches, er, such as SNAIL that the, the, the architecture that used attention and convolutions, er, as well as MetaNetworks which is also one of the architectures that I showed before. Er, and we looked at, er, Omniglot classification where we tried to vary the tasks, er, and see how the performance did as you vary the tasks away from the Meta-training distribution. So in this case the X-axis will show the task variability, and the Y-axis is going to show performance. And so in the first study we looked at we, um, we skewed the digits in the Omniglot data set. So it was trained on digits that were, er, kind of in the center. And then we moved, er, kind of away from the meta-training task distribution training it on, er, testing its ability to adopt to tasks that involve skewed digits. And what we saw at first is that all the approaches, er, their performance deteriorated as you moved away from the meta-trained distribution. But we saw that, er, algorithms like MAML are better able to perform these out of distribution tasks, er, as you move away from the meta-training distribution because they’re performing, er, an optimization procedure at test time. So because you’re running gradient descent at test time you can still expect it to give you some reasonable answer. Er, at least an answer that achieves good accuracy on the training data set for example. Whereas black-box approaches that are, are just taking in a data set as input and producing an answer. Um, when you, when you move away from the training distribution, there’s really, er, nothing that you can say about what those algorithms are doing. Because they- yeah. Yeah. Er, and then if you look at something like the scale of the digits, er, we also see this sharp drop off as you move away from the training, er, meta-training data set. But we kind of consistently saw this pattern that optimization based approaches were better at extrapolating because they were still giving you, um, a procedure at test time that looked like an optimization procedure. Um, so this is one empirical trend that we noticed. Er, and then you might ask well we’re embedding the structure of optimization into the Mediterranean process. Does this come at a cost? And in particular one very natural thing that was actually brought up a bit before is how far can you actually get with a single gradient step or a few gradient steps? Are these methods actually as expressive as the black-box approaches that I mentioned before? Um, and it turns out that you can show that, um, for a sufficiently deep function F, the MAML algorithm, the MAML function that I mentioned before can approximate any function of the training data set and the test input. Um, it can basically represent anything that the black-box approaches can represent, under a few, er, fairly mild assumptions. Under the assumptions is that the inner learning rate is non-zero, er, that the loss function gradient doesn’t lose information about the label, the standard-like Mean Squared Error and Cross entropy loss functions fall under this category, er, and also that the data points in your training data set are unique. And the reason why this is interesting, er, is that it means that MAML has the benefit of the inductive bias of gradient descent without losing expressive power. Yeah. What do you mean by inductive bias? Um, what I mean by that is that at initialization like even before you do any Meta-training for MAML, you still have, er, an optimization procedure that’s going to point you roughly in the right direction. So you’re still running gradient descent and you’ll still be able to improve on your training data. Yeah. Is that- are these assumptions- [inaudible] are there any number of gradient steps or just assuming [inaudible] This is actually only for a single gradient step. What is sufficiently [inaudible]? Very deep. [LAUGHTER]. Do you know how- is there like an order of the [inaudible]? Exponential. Yeah so the, the, the- I guess the assumptions that I listed here are very mild. The sufficiently deep function is, er, is not mild. Er, it does need to be very deep. And you could probably relax this assumption if you made other assumptions about the gradient pointing in the right direction, um, or other things about the, the optimization. It sounds kind of like the sufficiently lied single hidden layer. Yeah. Yeah. Okay so we’re running out of time. Um, the- let’s see. One thing I want to mention, um, I guess we can probably just leave off where, um, leave off where I left off on Monday next week. But we basically covered the basics of, of optimization-based meta-learning, er, and I’ll cover, er, I’ll cover the rest of it and some of it. G- go into a bit more of the advanced topics on Monday next week. On Wednesday this week, we have, um, applications of meta-learning, and multitask learning to things like imitation learning and generative models, drug discovery and machine translation. Er, I think that this will actually be pretty exciting to actually see some of the real-world use cases of these algorithms. Er, these will be student presentations and discussions. And then on Monday I’ll wrap up optimization-based meta learning and cover, er, non-parametric methods and talk about how all of these different approaches compare. Great. I’ll see you on Wednesday.

Leave a Reply

Your email address will not be published. Required fields are marked *