Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 6 – Reinforcement Learning Primer


So, so far we’ve covered multi-task learning and meta-learning topics in the context of supervised learning as well as in the context of uh, hierarchical Bayesian models. And so today we’re gonna talk about what these types of algorithms start to look like when we move into sequential decision-making domains, uh, in reinforcement learning. And so today we’ll get started on that and we’ll get started by talking about, kind of a, a primer on, on reinforcement learning in the multitask reinforcement learning problem, uh, goal condition variants of that and, uh, then we’ll, in future lectures, over the next couple of weeks, cover additional topics in the context of reinforcement learning when you have multiple goals, multiple tasks, etc. Um, first some logistical items. Homework 2 is due on Wednesday this week. Uh, Homework 3 will be out on Wednesday this week and Homework 3 will cover, uh, topics in goal condition reinforcement learning and including some of the things that we’re talking about today. And the product proposal is due next Wednesday. Okay, so, um, first, uh, why should we actually care about reinforce learning? So, uh, we talked a lot about supervised learning. Supervised learning is, is used in a wi – wide variety of places. Um, and I guess, first to, kind of, answer this question, let’s think about, well, when do you not need sequential decision-making? Uh, and anywhere else are things where you need sequential decision-making systems. So, uh, you don’t need sequential decision-making systems when your system is making a single isolated decision, uh, such as a classification decision or a regression decision. Uh, and where that decision does not affect feature inputs to the system and does not affect future, future decisions. Uh, and so from this point of view, uh, we don’t need sequential decisions whenever we’re, kind of, in a very isolated black-box world, uh, and actually in the real world, uh, in many cases, our decisions are actually affecting the future or affecting, uh, future aspects of the world. So, uh, there are many different applications of sequential decision, sequential decision making problems. Uh, in many cases, in some applications, people choose to ignore the dependence of, uh, of future, of the current decision on, on the future, uh, which, kind of, makes a simplifying assumption. Uh, but in many real world cases, there are, um, there is this effect, uh, of, of affecting the future. And so for example, some very common applications of reinforcement learning where you can’t afford to ignore this effect include things like robotics, uh, include things like language and dialogue systems when you’re interacting with another agent or interacting with a human, for example. Uh, in autonomous driving, uh, decisions you make affect the, the future observations that you make. Uh, in business operations, uh, in finance, uh, these are all kind of in these sequential, uh, in a sequential decision-making problem setting. Uh, and really most kind of deployed machine learning systems that are deployed in the real world and are interacting with humans, are faced with a sequential decision-making problem. Okay. Um, so this is, uh, in practice, why this, sort of, topic is important. Uh, and also if you’re interested in, kind of, how humans, uh, act in the world and how humans are intelligent in the world, uh, these sorts of problems is also, kind of, a key aspect of our own intelligence. We also can reason about how our actions affect the future. Okay. So reinforcement learning, uh, or in general sequential decision making is pretty important. Um, and so in this lecture, what we’re gonna talk about is first, uh, what is reinforcement learning or what does multi-task learning look like in the reinforcement learning context when you’re making, uh, sequential decisions? What does this look like, uh, in the formulation of policy gradients which is one, uh, one form of reinforcement learning, algorithm of one class of reinforcement learning algorithms? What does this look like, uh, in the, in the context of Q-learning? So, uh, we’ll give a, a, like, a few slides of, of review on Q-learning. This should be a review for most of you because this, this topic is covered in a, a number of courses like CS 221, CS 229, um, etc. Although policy gradients is not always covered in those courses. So we’ll, we’ll gi- give a little bit more of an in-depth overview of those. Uh, and then finally we’ll talk about approaches for multi-task Q-learning, goal conditioning Q-learning, um, and algorithms that, that significantly improve upon, kind of, the naive approach to multi-task learning, uh, in combination with reinforcement learning. Okay. So first let’s talk about, uh, the problem statement. Uh, and we can do this by looking at an example. So, so far we’ve been looking at things like object classification, regression, these isolated, uh, problems where you need, you make, um, predictions. And, uh, in contrast, you could think about something like object manipulation where we have very much a sequential decision-making problem. So you can view, uh, the, the problem of object classification as a supervised learning problem and the problem of object manipulation as a sequential decision-making problem. Uh, and what are the differences between these two problems? So in supervised learning, so far we’ve assumed that we have iid data, data that is independently and identically distributed according to some distribution. Uh, whereas in sequential decision-making, your action that you take effects the next state that you’re in. So the data that you’re seeing is very much not iid. Uh, second, so in supervised learning, we have some typically assume some large dataset that’s maybe curated by humans to ensure that it has a distribution that you care about. Uh, whereas in things like sequential decision-making, you need to collect the data yourself in many cases. Um, and it’s also not clear what the labels are. Uh, you aren’t really, you kind of need to figure out what this might be in a different, in different applications. And then lastly, in things like supervised learning, you generally have a fairly well-defined notion of success which is corresponds to some error or some prediction accuracy correspond, uh, in relation to the, the labels. Whereas in, uh, things like reinforcement learning, success is a little bit more, uh, great. Okay. So these are some of the, kind of, biggest differences between, uh, these two problem domains. Uh, and so before we go into like what concretely the problem looks like, let’s look over some terminology and notation. So, uh, somebody before will say that we have some neural network that’s gonna be making some predictions and in the classification setting, uh, you might be looking at an image and then classifying the class corresponding to that image. Uh, corresponding to different, uh, different types of animals for example. Now, uh, in reinforcement learning, uh, we’re no longer gonna be making, uh, predictions like that. We’ll be instead, uh, using, uh, our policy. This policy will be taking actions and the actions will affect the next state. Um, so, all right. There will be this feedback loop that goes from the action to, uh, back to the observation, uh, and our, our classes won’t look something, won’t look like this. They might look more like this. So we might, uh, need to figure out if we should run away, if we should ignore, if we should pet the tiger, etc. Okay. So we need to make decisions, um, always denoting the observation that the agent or the, the system receives as input is denoting the action. Uh, Pi is denoting the policy which is parameterized by Theta. And typically, we assume that there is some underlying state of the world s. Uh, and so in the fully observed setting, we get to observe s, uh, and in the partially observed setting, we get to observe o. Uh, what concretely is the difference between s and o? Uh, one example of this is you may have, uh, you may be, uh, trying to chase a, uh, a hyena or something and, uh, if you’re given an image or something that would be an observation, uh, whereas in contrast, if you are given the pose of the, uh, respective animals, then that would be the state. You’ll basically be able to fully observe, uh, the system, under- underlying state of the system and the things that, that matter for making decisions in the world. Uh, and in particular, in partially observed settings, uh, you might not just receive an image. You may also have occlusions and the imagery can actually see, uh, part of the state. Okay. So this is, kind of, the, the basic, uh, kind of, terminology corresponding to reinforcement learning. Uh, now one very basic approach to this, sort of, sequential decision making problem is to treat it as a supervised learning problem. So what you could do is, you can say, “Okay, I, I just want to, uh, perform. I just want to imitate some expert for example.” So, uh, maybe you could collect a bunch of driving data, uh, collect the observation that the person sees and collect the action that they took in those states, uh, put this into some big training data set, uh, and then sample iid from this training data set during supervised learning to, uh, to train your policy to predict actions from observations. Okay. So we’ve already seen a little bit of imitation learning. So there was a paper presentation, uh, a week or two ago that was looking at, uh, how we can apply meta-learning into things like imitation learning. Uh, these approaches generally work, uh, pretty well in some contexts. For example if you have a lot of data, uh, expert data of performing the right actions, uh, then these, these systems can actually do something fairly reasonable. Uh, the place where these kinds of systems tend to fail are when you have very long, uh, very long horizon problems, you have compounding errors, uh, as, uh, basically as you make actions, you’ll start to move off of the manifold of the data and then your errors will, will compound, um, until you’re well off of the manifold of the training data. Um, and also these systems don’t reason about outcomes in any way. Uh, they’re just trying to mimic what the data is doing rather than trying to accomplish some particular outcome. Okay. So this is where reinforcement learning comes in. Um, and for reinforcement learning, we need some notion of what’s called a reward function. Uh, and this reward function should capture what states and actions are better or worse, uh, for the system. So this typ – typically takes in both a state and an action, uh, and it tells us which states and actions are better. For example, if we’re driving, we might have a very high reward, if we, we look like this and have a low reward if we see something like this. Okay. So the- um, in aggregate, the states, the actions, and the rewards, as well as the dynamics of the system define a Markov decision process, because this is encapsulating the notion of a sequential decision-making problem. [NOISE] Okay. Any, uh, any questions up until here? This, this should mostly be review for people. Okay. Cool. So the goal of reinforcement learning, uh, is typically to learn some policy that takes, uh, as input. Uh, in this case, we’ll look at the fully observed setting, takes as input some state, and makes predictions about auctions. Uh, the goal is to learn, uh, the, the policy- uh, the parameters of that policy. So, uh, for in a deep reinforcement learning setting, your policy will probably be parameterized as a neural network, uh, where the states are being processed as input, you’re produ- uh, producing actions, the actions are fed into the world, uh, and then the world gives you the next state that’s fed back into your policy. Okay. Um, and so, we can actually characterize a system as the graphical model here, uh, where we have a policy that’s taking in- in the- in this case, in the partially observed setting a policy that’s taking the observation and producing an action, uh, the dynamics are, uh, taking in the current state and the current action and producing a distribution over the next state. Um, and one thing that’s pretty important is that this dynamics function is independent of the previous state. All right. This is what’s known as the Markov property. Which is that basically, uh, kind of the definition of a state in a Markov decision process is that, uh, you can fully define, uh, the reward function and the dynamics, uh, from the, the information in that state variable independent of previous states. Uh, and the way, the way- you see guys if you look at this dynamics distribution here, this only depends on S_t and A_t and doesn’t depend on S_t minus 1. Okay. And then, [NOISE] the goal of reinforcement learning and typically kind of the way that we can formulate a concrete objective here is that we want to be able to maximize the expected reward, uh, under our policy. And in the infinite horizon case, we can imagine the stationary distribution over states and actions arising from our policy and maximize the reward function, uh, under that stationary distribution, uh, and in the finite horizon case, we might have some horizon capital T and you wanna maximize the rewards of the states and actions visited by our policy- when rolling out our policy. Yeah. So are the actions here taken before the observation or after the observation? So like is any one taken just before we observe Observation 1 [NOISE] or is it taken just after? It’s taken just after. So you observe Observation 1 and then, uh, and that’s shown, that’s shown right here and then your policy predi- predicts an action from that observation, and that action is then fed- uh, then the kind of the world that actually, is actually executed in the real-world and that produces the next state which produces, uh, the next observation. It seems like you- uh, the states produced- it seems like you’d wanna use the state to make your action but it sounds like you’re saying the model doesn’t try and convert it into a state first. You’re saying. So the- I guess there’s a couple of different versions of how you might handle the partial observability, maybe one point of confusion here is th- the arrowheads on these arrows are very hard to see and there, there is an arrow going from state to observation and not an arrow going from observation to state. Um, you- what your policy could do is it could try to form some, like, estimate of the current state from your observation, and then, like, do some sort of inference and then pass that to your policy to predict the next option. You said here is the real state of the world rather than an embedded state from [NOISE] [OVERLAPPING] Yeah. Exactly. Yeah. Yeah so the state is the, the real state. Okay. Cool. So [NOISE] now that we’ve talked about kind of the reinforcement learning problem, um, what is the reinforcing learning task? Uh, and we’re gonna define this for the sake of, uh, thinking about the multitask learning setting. [NOISE] So remember, in supervised learning, we defined a task, uh, as this, as corresponding to the data distribution, or the, the data generating distributions, p of X and p of Y given X, as well as some loss function. [NOISE] Uh, in reinforcement learning, uh, our task will be defined as basically just a Markov decision process. So, uh, the task will be defined by some state space as some action space a, uh, some initial state distribution p of S_1, uh, your dynamics function, uh, S prime given s and a, and the reward function. So if you look at, uh, the, uh, kind of- if you compare this to the supervised learning setting, uh, the initial state distribution, the dynamics are, basically, are the same as the data generating distributions. The reward function, uh, corresponds to the loss function, um, and the state and action space are just kind of telling you, uh, what is the general set that your states and actions lie within. All right. So this is just a Markov decision process. Um, and I guess one thing worth mentioning here is that, if these- if the different MDPs are different tasks, then this is much more than just the semantic meaning of a task. Uh, cause different tasks could have the same exact reward function but have different action spaces, for example, or have different dynamics. Um, and so we use the term task loosely to describe these different Markov decision processes. All right. So what are some examples of what different task distributions might look like, or settings where we might actually want to apply multitask learning in the reinforcement learning setting? Uh, so one example, uh, that we saw earlier actually is a supervised learning problem but really is a sequential decision making problem is a recommendation system, where you want to be able to recommend videos, or recommend treatments, or recommend other things to a particular person. And you could imagine different people as being different tasks in the system. Different people have different preferences, have different- uh, operate in different ways. Uh, and so, if you kind of view this personalized recommendation, uh, problem as a multi-task learning problem, then you can view it as a setting where, uh, the dynamics and the reward function vary across tasks. Um, the dynamics correspond to how that person will react to a particular action that you take and the reward function corresponds to whether, to whether or not what you do- uh, whether, whether or not you recommend something to them reduce- re- results in a state that is good. In some context, the initial state distribution may also vary for different people, it depends on the, uh, particular, like how you formulate your problem. Okay. So this is one example. Um, another example where reinforcement learning has been applied, uh, has been in character animation, uh, and you can imagine, uh, trying to animate different, uh, characters in computer graphics across different maneuvers, for example. Uh, so there has been some work applied in reinforcement learning to learning- uh, for learning maneuvers like this. Uh, and in this case, if you treat this as a multitask learning problem, different tasks would have different reward functions, um, in the setting but the dynamics would be the same, initial state distribution would be the same, as well as the state and action space. Okay. Uh, and then, another setting where, uh, reinforcement learning has been applied is for, uh, dressing, uh, putting on clothes, and this is actually a really challenging problem in computer graphics because of the, the deformable objects and also has applications in assistive robotics, for example. Uh, and in this setting, uh, things like the initial state distribution, what, what is the garment- what state is the garment in before you put it on, as well as the dynamics are going to vary across tasks but the underlying reward function might be the same, uh, such as putting the clothes on the person. Uh, and then one last example of, of a task distribution might be if we wanna be able to do reinforcement learning across different robotic platforms. Uh, you may still want to do the same task across these platforms like, like having them learn how to grasp things, uh, but in this case, the state space and the action space would be- would vary across tasks. The initial state distribution and the dynamics would also vary across tasks as the robots have different, uh, degrees of freedom and react to actions in different ways, but the underlying reward function could be the same. Of course, if you want the robots to do different things, then the reward function would be different. Any questions on these examples? Or any questions about other examples? Okay. Cool. So this is a reinforcement learning task. Um, now, one alternative way to view multitask reinforcement learning, uh, is as follows. So we’ll typically have some sort of task identifier that’s part of the state, and this is required to make it a fully observable setting or a fully observable MDP. And the notation I’m using here is that S-bar is gonna denote the original state space or the original state, and Z_i is going to denote the task identifier as in previous lectures. Now, if you take this view, then interestingly, what you can take is you can look at- uh, you can basically fold, uh, looking at the task identifier and determining, um, determining the dynamics and determining the reward into a single dynamics function and a single reward function. Uh, and then, basically, view, uh, your set of tasks as just a single, single task standard Markov decision process, where the state space and the action space are just the union of the state spaces and action spaces in the original tasks, uh, that the initial state distribution just corresponds to a mixture distribution over your initial state distributions for each of those tasks. The dynamics and the reward function are folded into the- um, uh, ah, they are just a single dynamics and single reward function that takes as input the task identifier and produces either the next state or the reward. So basically, you can essentially view- you can still basically apply standard reinforcement learning algorithms- standard single task reinforcement learning algorithms to the multitask problem with this view on multitask RL. Questions on this? [NOISE] So basically, multi-task RLs, the same as before, is the same as the single task reinforcement problem except we have a task identifier that’s part of the state. Uh, this task identifier could be something like a one-hot task ID like we had described in the supervised learning context, it could be a language description of the task, uh, it could be a desired goal state that you want to reach, uh, and this would be what’s known as goal-conditioned reinforcement learning where you condition it on a particular state that you want to be able to reach in the future. Uh, and what is the RL function will it could be just the same as before where it takes as input the task id and outputs the reward function corresponding to that task for that state. Uh, or for things like goal-conditioned reinforcement learning, it can correspond to simply the negative distance between your current state or your current original state and the goal state. Uh, and some examples of distance functions might be Euclidean, uh, distance, it could be Euclidean distance in some latent space, uh, it can also be a sparse 0/1 or a function that corresponds, that is 1 when s bar equals sg and 0 when they’re not equal. Okay, so you might ask, okay, if, if this is just a standard Markov decision process, why not just apply standard reinforcement learning algorithms? And as I mentioned you can and this will work. Well, it will be more challenging than the individual single tasks because you will have a wider distribution of things in general, uh, but in general you can apply these, these same types of algorithms, but you can often do better, uh, and we’ll discuss that a bit in this lecture. [NOISE] Okay. Great. Any questions on how it can be formulated as a single task RL problem? Yeah. [inaudible]. Yeah. So I will look conditioned RL is a special case of multi-task reinforcement learning where, um, the task descriptor corresponds to the goal state and the tasks correspond to goal reaching tasks. Okay. So let’s get into some algorithms. So, ah, the first class of, I guess, these are, I’ll start off by saying that it kind of goes, we can look at broadly at kind of the, the anatomy and like the class of reinforcement learning algorithms and how these approaches relate to each other and then I’ll talk a bit about, uh, two classes of algorithms. So we can generally review reinforcement learning algorithms in the following flow graph where we first are generating samples, uh, in our environment, this is just running the policy forward typically, then we fit some model to estimate the return, ah, and then we use that model to improve the policy. And then different algorithms typically correspond to just differences in this green box and in this blue box. So for example, one, ah, one example of fitting a model might be just fitting something to the return, estimating the empirical return such as using the Monte-Carlo policy gradient. Uh, another example of estimating the return might be to try to fit a Q function, ah, using for example dynamic programming algorithms, uh, and another example of fitting a model would be to, uh, estimate a function that models the dynamics. Uh, and once we have any of these models we can then, uh, for example apply the policy gradient to our policy parameters, we can improve the policy by taking the max over Q values for our current Q-function, uh, or in the case of model based algorithms we can optimize a policy by for example back propagating through a model into our policy. Uh, this is kind of a, a general outlook on reinforcement learning algorithms where we have different choices for fitting a model to estimate the return, different choices for improving the policy, we also have different choices for how we generate samples although those are, uh, generally, that decision is generally orthogonal to the choice of algorithm. And in this lecture, we’ll focus on model-free reinforcement learning methods such as policy gradient methods and Q-learning methods, uh, and in fact we’ll, we’ll stick with these algorithms for the next, uh, two weeks about. Uh, and then on, uh, the lecture on November 6th we’ll focus on model-based RL methods and how they can be applied to the multi-task learning. Okay. So let’s start with policy gradients. So, uh, this is our objective, uh, in reinforcement learning, so we want to be able to sample trajectories from our policy, uh, and estimate the return. So we’ll refer to this objective as J of Theta and, ah, this is just rewriting J of Theta. You can view this or you can estimate this as, uh, rolling out and trajectories, uh, for the example shown here, uh, and estimating the rewa- computing the reward for each of those trajectories. So maybe the first trajectory has a high reward, the middle trajectory has a medium reward and the, the last trajectory has a bad reward. Uh, and so this first sum is the sum over the samples from our policy and the second sum is a sum over time. So this is the way that we can, uh, kind of estimate the, the expectation shown on the left. Now, what we could think about doing this, can we differentiate through this objective directly into our policy? Uh, so if our objective is the expected reward and we can estimate this with the reward of a trajectory, I’m just using shorthand to denote that as a sum over time of the reward function of the individual states. Uh, you can view this as, uh, this expectation as an integral over Pi Theta, uh, because the expectation is with respect to Pi Theta of r tau. All right. So this is our objective, uh, and if you want to be able to compute the gradient of this objective with respect to our policy parameters, uh, we get something like this. So we can move the gradient, uh, inside the integral because it’s a linear operation and, uh, then you basically have the, the integral of the gradient of the policy [NOISE] times reward function integrated over trajectories. Okay. So this is the gradient. Uh, now how do we actually go about evaluating this gradient? So we do- we don’t want to have to, uh, integrate over all possible trajectories, so what we’re gonna do is we’re gonna use this very convenient identity, uh, which is known as the likelihood ratio trick. And in particular what this identity shows is that, oops, which is that if we, uh, are looking at the, the policy parameter, the, sorry, the policy probability for a trajectory times the gradient of the log of the policy. This is equal to, uh, the, basically we just differentiate to the log, uh, which is equal to the policy times the gradient of Pi divided by Pi. And of course the two pis are on the top and on the bottom can cancel and this is just equal to the gradient of Pi or the gradient of, uh, of the policy with respect to the policy, the policy parameters. Okay. So we have this very convenient, uh, identity, and we can use it to, uh, expand out this equation. So, uh, we can basically replace this, this term with the term on the left to get an integral that looks like this and very conveniently this integral now looks a lot like an expectation. So, uh, this is an expectation under Pi Theta, uh, and so we can simply ,uh, evaluate the gradient or estimate the gradient by, uh, taking an expectation over a trajectory sampled from our policy, uh, and using those samples to evaluate the gradient of the log, probability of our policy weighted by the reward of that trajectory. Okay. So I guess to kind of recap what we did there, this first trajectory we don’t want to have to integrate over all possible trajectories and so instead we’re able to transform that into an expectation over trajectories drawn from the distribution of our policy. Okay. So when- the, once you have this gradient we can, uh, we can use it, to actually differentiate in, compute this gradient and actually apply that gradient to our policy parameters. Okay. So this is all with respect to trajectories, uh, one thing that is, uh, one thing that’s important to do is actually break this down into states and actions, so we’re denoting the, uh, Pi of Tau as, uh, Pi of the full trajectory which can be broken down into, uh, the initial state, uh, density times a product over time of the, uh, policy probability and the dynamics probability. So this is basically the probability trajectory under our policy. Uh, if we take the log of both sides of the equation, we get log, uh, Pi of Theta of Tau, uh, and then just change the, the products into sums using the log and we can basically, uh, plug in the right-hand side of the equation into the equation on the left into our form for the gradient. Now unfortunately, if we just apply this, uh, naively, we would have a term that corresponds to the probability of our state or next state given our state and action, uh, and we don’t know that probability value. Uh, but we can- one thing that you know is that because this, uh, this is a gradient with respect to Theta, these terms don’t depend on Theta, they’re constant with respect to Theta and so the gradient of Theta with respect to those terms is 0 and then we get, uh, so we then get the, the kind of, the final gradient which corresponds to this term right here. So, uh, this is basically, uh, log probability of Pi of a given s. This is something that we can evaluate because our policy will output a distribution over a, uh, conditioned on s, and this right term is just the reward function given the state and the action. All right. So this is kind of the vanilla policy gradient, uh, and this is something that we can very clearly evaluate. Um, and so, what this looks like, uh, as an algorithm, uh, is basically we can estimate, uh, basically we can run- rule out our policy to get trajectories. We can then estimate the policy gradient by averaging over those trajectories, uh, over time of the, uh, the lo- the gradient- grad log Pi times the reward function, uh, and then apply the gradient, uh, to our policy parameters. So if we go back to our diagram, uh, collecting data corresponds to the orange box, uh, evaluating the return corresponds to the green box and actually, using that to improve the policy, uh, in the last step corresponds to the blue box. [NOISE] And then, what this looks like is an algorithm which is called, uh, the reinforce algorithm, er, is explicitly sampling trajectories from your policy and then computing the gradient, uh, using those trajectories, and then using that estimated gradient to update your policy parameters. And then you can repeat this step to iteratively improve your policy. [NOISE] Okay. So this is the algorithm. Um, how does this compare to something like imitation learning like maximum likelihood of expert actions? So, uh, if you look at the policy gradient, um, and you instead also look at kind of the, the imitation learning approach where you do, uh, supervised learning with respect to actions, uh, the maximum likelihood objective looks pretty similar to the, the gradient of the policy, um, [NOISE] the, the policy gradient form. And in particular, the difference is that, uh, the- is just the reward term on the right. So basically policy gradient will correspond to taking, um, maximizing the probability of actions that have high reward. And if they have low reward, then, er, you- you’ll have, uh, you’ll try to maximize it less essentially. Okay. Now, one of the really nice things about this is that, uh, because we- it’s just basically a gradient descent algorithm, it’s very easy to apply multitask learning algorithms to it. Uh, we can basically be- it corresponds, uh, very similarly to maximum likelihood, likelihood problems. So all of the things that we learned about in maximum likelihood supervised learning can be applied to the reinforcement learning context. Okay. So this is nice. Um, let’s go to one more slide kind of on, on intuitively what this algorithm is doing. Um, so if we look at the, er, kind of the form of the, the gradient, er, which corresponds to the kind of gradient log Pi of a given S, uh, and look at maximum likelihood, um, we can say that, okay, we have trajectories. If we do maximum likelihood imitation learning, we’re just trying to imitate, uh, the best trajectories, uh, whereas in policy gradient what we’re trying to do is we have some, some distribution over these trajectories. And then we’re going to try to, uh, increase the probability of the actions that had a high reward, uh, and place less probability mass on the actions that had low reward. Uh, and so as a result, we’ll basically just be making the good stuff, more likely making the stuff that gets bad reward less likely, um, and kind of formalizing this notion of trial and error. You, you try a few things. You do more of the good stuff and less of the bad stuff. [NOISE] Okay. [NOISE] So that’s policy gradients. Um, it’s pretty easy to combine with things like multitask learning. Uh, it’s also pretty to com- pretty easy to combine with things like meta-learning. Uh, so the meta-learning algorithms that we learned such as MAML and black-box meta-learning algorithms, uh, just assume that you can get some gradient, uh, of your objective. And so we can readily apply these to, uh, readily apply, uh, these algorithms to- in combination with policy gradient algorithms. Uh, so, for example, uh, here’s a very- toy example of MAML with policy gradients where there’s just two tasks. One of the tasks is running forward and one of the tasks is running backwards. Uh, so we’re not evaluating generalization in, in any way. We’re just gonna look at whether or not it can learn to adapt its policy with a single gradient step for one of these two tasks. Um, what we see is first at the end of meta-learning, basically at this point right here at the end of meta-learning, but before taking a [NOISE] gradient step to a- one of the tasks we get a policy that looks like this. Uh, it’s running in place, essentially, like, ready to, to run in either of the two directions. [NOISE] And if we then, uh, take one gradient step with respect to the task of running backward- with the reward function of running backward, uh, we get a policy that looks like this. And if we take a single policy gradient step with respect to the reward function of running forward, we get a policy that looks like this. [NOISE] Um, and so, I guess, one of the interesting things that this shows is that, uh, there does exist a representation under which reinforcement learning is very fast and very efficient, um, at least in the context of a few tasks. Uh, and I guess one other thing worth mentioning here is that the policy gradient was evaluated with respect to 20 trajectories, uh, from Pi Theta. Uh, so basically 20 trajectories similar to the video shown on the previous slide. Okay. So this is pretty straightforward. Um, what about black-box methods? Uh, so we can also apply policy gradient to black-box methods. What this corresponds to is using, um, some LSTM policy, some policy with memory or recurrence, uh, and training that policy with the policy gradient algorithm or, or, uh, a variant of the policy gradient algorithm that I mentioned on the previous slide. Uh, so, for example, in, uh, this, this previous paper that was actually, uh, presented, uh, a few weeks ago in class, uh, one of the experiments that they looked at was learning to visually navigate a maze. Uh, and so what they did is they trained the algorithm on 1,000 different small mazes and then evaluated the algorithm’s ability to learn how to solve new mazes, uh, including both small mazes and large mazes. Uh, and so we can look, uh, at what it does. So here, this is first showing, uh, after meta-learning, uh, the beginning of rolling out the recurrent policy. Uh, so, in this case, it doesn’t know the task and it needs to navigate the maze. Uh, and the left is showing the, uh, the agent’s point of view and the right is showing the maze. And then after it gets this experience, it then is able to learn how to solve the maze with basically just a single trajectory. Um, so at first, navigate around that maze to explore and then at the end of that episode, [NOISE] the memory of the, uh, of the architecture is not reset, and it- you keep on rolling forward that, that memory- that black-box architecture and it can figure out how to, uh, from there, based on what’s stored in memory, how to solve the task. [NOISE] Uh, and they also looked at bigger mazes. So here’s an example of it navigating through a bigger maze. At the beginning, it’s just exploring. It needs to figure out how to solve the task. Uh, so it explores, um, different parts of the maze. This is one of the, the, um, I guess, both of these examples are successful examples. There’s also failure cases. Um, so here’s one. In, in this case, after it sees a single trajectory, it’s able to very quickly navigate to the goal position. Okay. Yeah. [BACKGROUND]. Yeah. So for MAML, the number of, um, the, the inner loop corresponded to 20 trajectories and one grad- one policy gradient step. In this example, the inner loop corresponds to basically like two trajectories, um, where you can basically see the trajectories on the, um, on the thing. So, uh, here’s the- this is the first trajectory shown here. And then the second trajectory is when it actually, um, solves the task well. Does that answer your question? [BACKGROUND]. This is after meta-learning. Yes. Yeah. So this is the inner loop. [BACKGROUND]. Yeah. And then the outer loop is trained, uh, a lot, uh, for, um, across the different tasks. So this is trained across 1,000 mazes during the meta-training process. Uh, and then- yeah. And it practices a lot for those, uh, mazes. Yeah. [BACKGROUND]. Yeah. So in this case, it’s just, it’s just gets this as input. It doesn’t get the, the layout of the maze. Uh, and in the case of the ant example, it just receives like joint angles, uh, and other state information. Yeah. [BACKGROUND] So in both of these examples, after this end of the first trajectory, just- it’s- it is reset to the initial position. Does that answer your question? [BACKGROUND] [NOISE] Um, so I think that after it reaches the end, here it is then reset to this position again. We can watch it again one more time if you want to verify it. So he goes to the goal, and then it’s respawned right there again. Yeah. [inaudible]. Yeah that’s a good question. Um, so I guess as of like last year, I think the- these sorts of maze tasks are probably the most complicated task that I’ve seen these algorithms do. This year I’ve seen more complex tasks that these algorithms have been able to learn quickly. Um, ranging from being able to adapt to- learn how to run on an entirely new agent or like simulated robot to, um, solving, uh, like actually settings where the tasks themselves are partially observable, not just the, um, this is, this is also partially observable, but partially observable to a greater degree I guess. Um, and then I’ve also seen tasks where, um, it’s like there are different manipulation tasks and it can generalize to an entirely new manipulation task, like robotic manipulation task. But those are, those all like very, very recent works, but yeah. There are more, more, more better things to come. Yeah. [inaudible]. Yeah, so this is a good question. Um, we’ll- so we’ll cover meta-reinforcement learning in more detail, um, next week on Wednesday when there’s going to be a guest lecture by Kate. Er, but one thing that I’ll say here is I guess first, one of the things in the reinforcement learning setting we talked about how MAML is very expressive, um, in the supervised learning setting. Um, in the reinforcement learning setting, it’s actually not very expressive because, um, because of the policy gradient. Basically, if the reward function is 0 for all of your trajectories then your gradient will always be 0. And so even if it gets lots of rich experience about the environment with zero reward, it can’t actually incorporate that experience to update the policy. And this was just one example, there’s other examples where the policy gradient isn’t very informative. And so as a result, MAML with policy gradients isn’t actually very expressive and has, um, well is, is, is yeah is not as good. Um, in general, applying these algorithms to the reinforcement learning setting it’s pretty easy to combine with policy gradients. Combining them with methods like Q-learning and actor-critic algorithms is a lot more challenging, and Kate will talk about that certainly a lot during her lecture, um, and some of the challenges that come up there. Um, the biggest thing is that those algorithms aren’t, uh, a gradient-based algorithm there are dynamic programming algorithms. So it’s hard to combine these things. Yeah, yeah. Er, are this kind of algorithms will have things like curiosity that can be expand on the existing algorithms? Yeah. So curiosity based approach and other exploration methods in general, can certainly be combined like that’s just a kind of an objective and you could use that objective as one of your tasks. You can augment all of your tasks without objective, or you could imagine trying to learn exploration strategies like learn curiosity, like learn different forms of curiosity that are particularly effective for a class of tasks or class of environments. Um, and Kate will talk about kind of learning exploration strategies in her lecture next week. Yeah. [inaudible] can you use advantage estimation with value? Advantaged estimation. You mean, uh, GAE Generalized Advantage Estimation. [inaudible]. [NOISE] yeah so, uh, the- I guess there’s different-there’s different ways. So I guess to, to explain to other people kind of what the question is. So one of the challenges with policy gradients is that, uh, the gradient estimate that it gives you is high variance. And one thing that people typically do, uh, to mitigate to, to reduce the variance of this is to use what’s called a baseline, um, which corresponds to some- which basically corresponds to something that’s subtracted from the reward term here. Er, is gives you an unbiased estimate of this gradient, but that has lower variance. Um, and there are different techniques for estimating that baseline and one of them corresponds to things like generalized advantage estimation and other things. Um, in the original uh, implementation of the MAML algorithm, we used a Monte Carlo estimator for the baseline, uh, rather than a bootstrapped estimator. I think applying a bootstrapped estimator would be like applying MAML to that would be a bit tricky. You could of course always do it from scratch on, on your batch of data, but applying MAML to it is little bit tricky because- just because of how- because bootstrapping isn’t a gradient based algorithm, it’s a dynamic programming algorithm. [inaudible]. Yeah, it’s hard. Okay. So some of the pros of policy gradients, um, to recap is that it’s very simple. I gives you-kind of just gives you a gradient of your policy which is very nice. Um, it’s also very easy to combine with existing multitask algorithms and meta-learning algorithms as we saw in the last couple of slides. Um, the downsides is that first it produces a high variance gradient, um, and this can be mitigated with baselines. Um, and baselines are basically used by all algorithms in practice. I don’t have time to cover them, um, in this lecture but uh, feel free to come to office hours if you’re interested in learning more. It can also be mitigated with trust regions, um, with which people have also used uh, and both- both MAML and the black-box methods we’re using both baselines and trust regions in the optimization to, um, make things more stable and, uh, more effective. The other downside of policy gradient algorithms is that it requires on policy data. Um, and in particular the way that you can see this as you can see that this expectation is with respect, respect to pi Theta, and pi Theta is your current policy. So in order to improve your policy, you need data from your current policy. [NOISE] Um, and this is really important because this means that you can’t reuse any data from your previous policies to estimate- to try and improve your policy. Um, it also means you can’t reuse data from other tasks or from, from, from other things basically. Uh, and this is-this is really challenging. As a result these algorithms tend to be less sample efficient, than, um, algorithms that are able to reuse data, um, from previous policies, uh, from, from other experience etc. Things like importance weighting can help with this. Uh, so you can basically add, add a weight that, um, corresponds to the ratio between the policy- between your current policy, and the policy that you collected data with. Um, but these importance weights also tend to give you high variance especially when those two policies are very different. Okay. Cool. So now that we’ve talked about policy gradients, um, let’s talk about value-based reinforcement learning. Ah, and in particular, the, ah, the benefit of value-based RL is that, ah, first they, um, they tend to be lower variance, ah, by introducing some amount of bias and they, um, can use off-policy data which is the, the bigger- which is like a really, ah, important thing if you care about reusing data and being sample efficient. Okay. Um, so for your very brief overview of these algorithms for those of you who, ah, are a little bit rusty so a value function, ah- first let’s go over some definitions. So a value function corresponds to the total reward that you will achieve starting from state S and following some policy pi. Ah, so this is the function of both the policy and your current state. And a Q function, oh in- and this kinda captures how good is a state basically, how valuable is that state. And a Q function corresponds to the same thing as a value function but, ah, the total reward starting from state S, taking action A and from there, following pi. Ah, so the A that passed as input is a parameter and it does not depend on the policy pi. Ah, and this is basically telling you how good is a state action pair. Ah, both of these things are very closely related so as I alluded to, the value function corresponds to, ah, an expectation over action- the value function of our current policy corresponds to expectation under a policy of Q of the state as input and the pol- the action sample from your policy. And one other thing that’s really nice about your Q function is that, if you know the Q function for your current policy, you can use it to improve your policy. Ah, so for example, ah, one very naive way to see this is that if you just set, ah, the probability of taking an action for your current state to one for every action that is the max of, ah, the arg max of the Q-value, ah, this is just gonna increase the probability of taking actions that have maximum Q values then the new policy resulting from this will be at least as good as the old policy, ah, and typically better. Okay. Um, so the goal value-based RL is to- to learn these um, to learn at least the Q function. Ah, and then use that Q function to perform the task or to perform, um, in order to learn a policy. Ah, and one kind of critical identity that’s important for these types of algorithms, is noting that for the optimal policy, ah, we have this equality that is satisfied. So we know that the Q function for the optimal policy is equal to, ah, the expectation of states visited under the dynamics of the reward function plus, ah, Gamma times the max of actions of the next Q function, where Gamma here is representing some discount factor. Ah, the way that you can see this, ah, is that, ah, basically if you take a reward and then, ah, if, if at the current timestep you observe some reward and then, ah, you know, kind of the- the- kind of the reward in the future for the best action. Um, that’s gonna equal the best- ah, the best, ah, value from your current state and current action. Ah, and this is what’s known as the Bellman equation. Okay. So we can use this Bellman equation to, ah, learn a Q function. So, um, what this looks like is, ah, this is one example of a- of an algorithm that’s called fitted Q-iteration. Um, what this looks like is you first, ah, collect a dataset using some policy, ah, and the hyper parameters corresponding to this sort of the dataset size and the, the policy that you use for data collection. Ah, you can then set, ah, the reward plus the max Q as, ah, some target label and then improve your policy, ah, to try to match those target values. So you’re essentially trying to, ah, run a dynamic programming algorithm that leads to the Bellman equation holding for your Q function. Ah, and so for example if your Q function is represented, ah, has parameters phi, ah, that might be some neural network that takes this input, the state and action and outputs the Q value, a scalar value for that state and action. Um, another way to parameterize this indiscreet action case is, if you just pass in the state, ah, and then output the Q value for each of the actions correspond- for- for that state. Ah, that’s, that’s often used in practice. Ah, and the other hyperparameter values that uh, that you have in this algorithm correspond to, ah, the number of gradient steps you take, ah, and the number of iterations that you perform this for. So in practice, you’re gonna be iterating between collecting your dataset, computing your, your target values, trying to fit your Q function to those target values and then iteratively, ah, fitting your Q function and also recollecting your dataset. Okay. And then the result of this procedure is that you get- you can get a policy by simply taking the argmax of your Q function for a given state. So take the actions and the current state that maximize your future reward or your future returns. Okay. So this is a, er, a, ah, Q-learning style, style algorithm. Um, some important notes here. Ah, first, we can reuse data from previous policies. So this doesn’t make any assumptions about the underlying, ah, algorithm. There’s no expectations with respect to pi Theta, um, in, in any of this. Ah, and so as a result, it’s what’s called an off policy algorithm because it can use off policy data, ah, and as a result, you can use, ah, replay buffers. So you can store data. It can aggregate data across all of your experience into a single replay buffer and, ah, when computing this update, you can load from your replay buffer, ah, any kind of- any sorts of data. And this allows you to, ah, one, kinda keep on aggregating data and reusing that data and two, uh, get more data that is, ah, decorrelated. So if you just, ah, kind of get some data online and make updates, you’ll have a very correlated data which will result in uh, poor performance. Okay. Another thing to note, ah, as I mentioned before is that this is not a gradient descent algorithm. This is a dynamic programming algorithm. You can see that by, ah, the fact that this, um, this update affects the targets at the next value right here. Ah, and as a result, it’s tricky to kind of combine this approach with things like MAML, um, and, and even black-box methods in practice. Um, but it is relatively easy to combine with algorithms like multitask learn- learning algorithms and goal condition learning algorithms by simply conditioning your Q function or your policy on the task identifier or the goal. Okay. Um, so let’s talk about Multi-task Q-learning. Any questions on kind of the setup for Q-learning before I move on? Okay. So, um, for Multi-task RL we can just, kind of take our policy and condition it on some task identifier likewise for our Q-function. Um, and in each of these cases, I’m using- again using S bar to denote the kind of original state space. where the, the kind of- the, the new augmented state corresponds to the original state space and the task identifier. In analogous to multi-task learning, we can use a lot of things that we’ve learned about before, like stratified sampling, like hard and soft weight sharing, other architectural changes, etc. So this is quite nice. Uh, we can reuse the things that we’ve learned across supervised learning and reinforcement learning, uh, now, what’s different. So there are some things that are different that- about reinforcement learning that we talked about at the very beginning of the lecture that affects the algorithm choices that we make. So the first thing that’s different is that the data distribution is controlled by the agent, it’s no longer just given to us. Uh, and so one of the things we can think about is can we reuse- can we think about how, how should we explore data in a way that’s effective for multiple tasks and also can we think about not just weight sharing but also data sharing across tasks and how should when we collect a batch of data, how should we choose to share that data across the tasks? And second, you may also know what aspects of the MDP are changing within your task distribution, and if you know this you can actually leverage this knowledge in certain ways in your algorithm choice by making assumptions about whether or not one aspect of the MDP is going to be changing across tasks. Okay. Um, so let’s think about an example for thinking about how we can- how we might wanna go about sharing data or leveraging this sort of information. Um, so say we are playing hockey, uh, and we have some, uh, some of our teammates and some of our opponents, uh, and we may wanna be trying to practice different tasks. Uh, so we may want to be able to practice passing the puck from, uh, from yourself to your teammate, and you may also wanna be able to practice shooting goals. Now, if you’re considering this multitask learning problem, uh, what if during practice you accidentally perform a very good pass to your, uh, to your teammate when you’re trying to shoot a goal? Well, this happens, uh, it makes sense, of course, the story or experience is normal but you can also take that experience and say, “Well, okay even though I was trying to shoot a goal, I don’t need to just use that for shooting a goal. I could also say, okay, in hindsight if I was doing task two, that would’ve been great. I would’ve gotten a ve- a kind of a very high reward for that task,” right? And so you can relabel that experience with the task to identifier, and with the reward function for that task and store that data for that task. Okay, so this is something that’s known as hindsight relabeling which is that, in hindsight, you can kind of take some experience that you collected with the intention of one task, relabel it for another task and use that in learning the other task. Okay, it’s also referred to sometimes as hindsight experience replay as well or HER. Okay. So what does this actually formally look like? So we can look- imagine a goal condition RL setting and first we’re collecting some data using some policy as in the kind of standard of policy reinforcement learning setting. Then we of course store the data in a replay buffer. Uh, and then we perform hindsight relabeling. So what we can do is we can take, uh- we can relabel the experience that we just collected but take the last state that we actually reached and imagine that that was actually the goal for that task. So we can replace the, um, replace the goal that you’re trying to achieve in that task with the goal that you actually achieved and replace the reward function with the distance between the current state and that new hindsight goal, and then once you have this relabeled experience you can then store that in your replay buffer as well. Uh, and then, of course, update your policy using your replay buffer and repeat. Okay. Cool. So what about other relabeling strategies? So what this relabeling strategy used the last goal as input, as a, as a thing that we’re- the last state as the thing that we’re gonna re- relabel as the goal. You can also use really any state from the trajectory, uh, and these are also states that were reached. Um, in general, you could, you could choose, uh, any potential state, uh, to relabel with although in practice one of the things that’s really nice about, uh, relabeling with the state that you actually reached is that it can alleviate, uh, it can alleviate a lot of the, um, a lot of the exploitation challenges. So, uh, if you’re exploring for- in the context of one task versus in the context of many tasks, if you accidentally solve one task when trying to perform the other tasks then you’ve already- then you’ve solved exploration problem for that task. Uh, this can also kind of bootstrap the kind of- it allows you to kind of bootstrap the learning process. Okay. Any questions on how this works? So we can generalize this also to the multitask RL setting, um, and this is kind of similar to the setting that we showed, uh, in the example. So the way this looks like is we use- uh, in this case, we just have a- we collect data, in this case, the data core has a task identifier rather than a goal state. Um, sort of that data, we relabel by, um, for- by selecting some task J. Replacing the task identifier with the task identifier for task J and then replacing the reward function with the rewards of, sorry, that should be- the negative shouldn’t be there. But bas- replacing it with a reward function for task J of the corresponding state, and then storing the relabeled data, updating the policy and repeating. Um, another question that comes up here similar to the last slide is what tasks should we choose to relabel with? Um, you choose randomly, uh, but one good choice in terms of exploration is you could choose tasks in which the trajectory achieves high reward. Uh, and that will help you, uh, those are tasks that, that solve the exploration problem to some degree for, um, those tasks. Yep? So is there a special way to handle hindsight experience replay when the, when the dimensionality of the state space is really high? Yeah. Um, we’ll talk about that in a few slides. Yeah? [inaudible] based on a particular state [inaudible] reward functions we use. So given the, the task [BACKGROUND] [inaudible] different task? Um, could you repeat the question? So the R function like [inaudible] is that, like, depend on what task we are doing like is that sort of like, the, um, [inaudible] we’re doing this task so the reward function B is for the state and then for the task there should be something else or is it kind of independent from the tasks? Yeah. So if you initially collected the re- the dat- um, if you initially collected the data for task I, you will get reward labels for task I and then if you want to relabel for task J, here, then you, you, you wanna kind of replace the reward function in that because of experience with rewards that would correspond to task J. Does that answer your question? [inaudible]. Okay. Um, now, you can’t always apply this trick. Uh, so you can apply relabeling when the form of the reward function is known, uh, this is and evalutable. If you can’t evaluate the reward function in all possible contexts or it’s expensive to evaluate the reward function, like if it, if it requires asking a human, for example, it may be a bit trickier to do this. Uh, it also requires that the dynamics be consistent across goals or tasks that you relabel for. Um, if they’re not consistent, then, uh, when you have two poles corresponding to state action and next state. Those two poles will not- will no longer be dynamically consistent and the resulting policy you get will have data that corresponds to different dynamics that isn’t accurate. This is kind of one example of exploiting the knowledge that we may know that the, the dynamics may be the same across tasks, um, and you also need to be using an off-policy algorithm with an asterisk that I believe that there are some people that have looked at applying this on non-policy settings. Um, but basically, uh, if we’re going to be relabeling experience for this task, uh, and storing this in a replay buffer like we don’t have the, um, we don’t necessarily have the policy that, uh, that collected, um, this experience, uh, when it was passed in a particular goal state. Yeah? So in this case, when you have some data K and you come up with or, or you choose a task whose traj- trajectories were [inaudible] that doesn’t work. Is there any reason to come up with multiple tasks because they’re- on the right theory asks which task to choose. Could you say that let’s come up with a few different ones [NOISE] and relabel multiply- multiple times or is it enough – [OVERLAPPING] Yeah. So I’m- maybe I’m not sur- maybe one answer to your question, you can tell me if it answers your question or not is this is I guess is- when I was making this slide I was assuming that you would have an initial set of tasks that you cared about. There may also be a setting where you just have one task that you care about and you want to leverage other tasks to improve the learning for that task, and in that setting, it is important to think about, “Well, what can we construct other like auxiliary tasks [OVERLAPPING] for improving? ” Um, and that will be something that is discussed a bit actually on Wednesday in the paper presentations. Yeah? Um, I think I had this similar thing I was thinking about, like why not, um, why not like, um, like why do we only take one task to relabel. Why not like duplicate the data for each task, uh, [inaudible]. Yeah. So you can certainly basically choose tasks at, um, at random and, and orders like choose all the tasks essentially, um, and that, yeah- you could certainly do that, uh, and then like you could essentially view this version like tasks when the structure gets reward as a form of that where you just relabel all of them and then when you sample data from your replay buffer, you prioritize it to include data that where, in which you get high reward, for example. So yeah. You could, you could definitely do that and then think about how you might prioritize later. One downside of doing that is you do have to prioritize potentially or, uh, I guess it kind of depends on the setting but yeah, typically, if you did that, you’d probably want to prioritize so that you’re not getting a ton of data that you just have zero reward because your policy was attempting completely different things. Okay, so we just- we can just look at one kind of quick empirical example of what this looks like in practice. So, um, [NOISE] the paper from 2017 was looking at this, uh, goal conditioned RL in the context of simulated robotic manipulation where there are tasks such as pushing shown in the top row, sliding shown in the middle row, and pick-and-place shown in the, um, in the bottom row. I’m- I was looking at this now, I’m not sure what the difference between pushing and sliding is. Uh, but maybe- maybe that’s the kind of details that are on the paper. And empirically, if you look at it without relabeling versus with relabeling, um, the two could er, eh, value-based RL method called DDPG. The green, uh, and dashed red lines show, uh, show DDPG without relabeling, and the red and blue lines show with relabeling with two different relabeling strategies. Uh, you can see that in- in these settings, relabeling significantly improves performance, uh, likely mostly because of an exploration challenge. So you see that in the pushing example and in the pick-and-place example, the individual DDPG is basically getting that reward, uh, which means that it’s having trouble actually finding any rewards. And- so this approach is helping it find rewards for some- for certain goals, um, by essentially amortizing exploration across the different tasks. Okay. Cool. Um, and then since we have a bit more time, uh, we can talk a bit about image observations which is, uh, one of the questions I asked, uh, before. So one of the things that’s- that’s important in the goal condition are offsetting is you need this distance function that tells you how far you are from your state, and this is- this corresponds to your reward function. But when you have image observations, you don’t have good distance functions for images in general. Uh, things like- like LT distance don’t work very well. Uh, and so one thing you could imagine doing is well what if you have a binary reward function. That basically, it’s just 1 if the two images are identical and 0 otherwise. [NOISE] Um, this will be accurate, uh, but of course it will be very sparse. Um, but there- there are things that- even though it’s sparse, there are things that we can do with it. Uh, and things that we can- ways that we can use this for, uh, effective learning. And in particular, one of the things that we can observe is that under the sparse binary reward function, we know that random interaction that’s unlabeled, uh, is actually optimal- if your goal function is to reach the last state, at the last time step. So for example, if you have some agent that’s randomly you kind of, ah, exploring in the world, you can say that, “Okay. This is optimal with all we care about is reaching here at the last time step.” We don’t care about any of the other time steps and how we got there. Uh, and so there are things that you can do to actually, uh, you can kind of leverage this insight, uh, with a couple of different algorithms. So, um, the first thing you can do with this- uh, first of all, it’s- it’s easier to deal with image observations because we can use a 0/1 reward function. Um, the first thing you could do is you can use it for better learning. So if you know the data is optimal with respect to that reward function, what if we just use supervised imitation learning on that data. Um, so in imitation learning we typically assume that we have optimal demonstrations. Uh, here, if that’s our reward function, this- these random interactions for goal conditioned RL correspond to optimal behavior. And so what we could do is we can collect data from- from some policy, perform hindsight relabeling where we use the last date as the goal in hindsight, store the relabeled data in some replay buffer, and then update your policy just using supervised imitation learning conditioned on the relabeled goal on your replay buffer. Um, so it turns out you can- you can do this and actually does decently well in a number of domains. Um, one, uh, paper that did this, uh, the way they collected data was actually, uh, by using data from a human that was kind of interacting in, uh, not completely random ways but in more directive ways but still in, uh, less optimal ways as you might think. So they collected data from human play, uh, and performed goal condition imitation learning on this data. Uh, so here is, uh, an example of the- the play data, so this is just a hu- human doing a bunch of random stuff in this ex- environment in virtual reality. Uh, and this data- there’s no reward functions in this data or anything. Uh, that way, what you can do is you can take some, uh, one of the kind of states in this thing. Uh, you can train, uh, a goal conditioned function, basically, a policy that takes as input a goal image, and the current image, and regresses to the actions that the human took for different windows of this data. Uh, and as a result, you get a policy that looks like the bottom rate, um, that is able to reach, uh, goals including pressing buttons. Uh, for example, it’s trying to press the green button, and it’s trying to press the blue button. Now, it’s trying to slide the door over to the left. Um, it’s able to do kind of all of these different goals just by using that- that data. Okay. Um, are there any other ways to use this insight? So another thing that we could do, uh, with this insights is we could try to use it to learn a better goal representation. Uh, so if we have these 0/1 goal representations, this isn’t very good for reinforced learning but we can use it to, uh, learn about our goal representation. And in particular, we can imagine the question, “Which representation when uses a reward function will cause a planner to choose the observed actions?” [NOISE] Uh, and so we could first collect random unlabeled interaction data. In this case, we’ll collect data of, this is robot ran- like sampling from a random Gaussian distribution, as shown here. Uh, we’ll then train a latent state representation and a latent space model such that if we plan a sequence of actions with respect to the last state, we will recover the observed action sequence. Um, so essentially, this corresponds to, uh, embedding a, uh, a planner in latent space into a goal condition policy and train that goal condition policy with supervised learning to match the observed actions. Uh, so we could use this policy directly as in the previous paper, uh, but we can also throw away the latent space model and return the goal representation that the planner was using- the planner was using inside that policy, uh, and combine that- that goal representation with reinforcement learning. Um, so this is referred to as distributional planning that works in the sense that it’s, uh, performing this planning procedure inside the neural network and outputting a distribution over action sequences. Uh, and what you can do with this metric, uh, is this metric is- the metric that you could use is much more shaped because the planner has to be able to use that shape reward function, if you only give it the sparse reward function it wouldn’t be able to succeed. Um, you could use, uh, you can take out this- this metric, uh, run reinforcement learning with respect [NOISE] to this metric on a variety of vision based robotic learning tasks, uh, and then compare it to a variety of other metrics such as pixel distance and distance in a VAE latent space. And you can see that the metric that comes from this procedure showed in green leads to, uh, much more successful reinforcement learning because it’s able to recover both an accurate and a shaped reward function. And so you can get behavior that looks like this where it can figure out how to reach an image of a goal, uh, or figure out how to- to push an object, uh, to reach an image of a goal. Uh, and it can also be used in the real world, uh, for like reaching a certain goal image or for, uh, pushing an object for example. Okay. Um, so to summarize, uh, what we talked about today is, uh, what is the multi-task RL problem, how we can we apply policy gradients to this problem, and how we can think about doing, uh, weight sharing as well as data sharing in both, uh, policy gradient settings and in Q-learning settings. Um, so there’s a number of remaining questions, some of which you brought up today that we’ll cover in the next two weeks. [NOISE] Uh, so for example, uh, can we use auxiliary tasks to accelerate the learning process? Uh, and this will be the, uh, the focus of the, uh, Wednesday. What about hierarchies of tasks, where we have subtasks and then we want to learn higher-level policies that operate, um, on those subtasks? Uh, can we learn exploration strategies across tasks rather than, um, try to using a single kind- just using vanilla, um, vanilla approaches. And also, uh, what do meta-RL algorithms actually learn, uh, when applied to various settings? Um, so we’ll be covering each of these, the first will be covered on Wednesday, and the paper presentations, uh, the second one will be covered on Monday next week in paper representations. Um, next Wednesday, we’ll have a- the guest lecture by Kate Rakelly, uh, who’s the first author on a recent off policy meta-RL paper, that is, uh, I think currently the state of the art method in- in meta-reinforcement [NOISE] learning. Uh, and then on Monday, we’ll, um, we’ll have paper presentations that study emergent phenomena in meta-reinforcement learning. Um, for those of you that don’t have quite as much experience in reinforcement learning, there are additional reinforcement learning resources such as the Stanford course, the UCL course from David Silver, and the Berkeley course, and I believe that all of these courses have lecture videos online. Um, so if you’re interested in learning more, those could be helpful, uh, and it can also be useful for the homework. Um, and then a couple of reminders, homework 2 is due on Wednesday, homework 3 covers hindsight experience replay and goal conditioned RL, and that will be out this Wednesday and due in a couple of weeks. After that, uh, and then the project proposal is due next Wednesday. Okay. See you on Wednesday. [NOISE]

Leave a Reply

Your email address will not be published. Required fields are marked *