Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 11 – Sergey Levine (UC Berkeley)


All right. Let’s get started. Uh, it’s my pleasure to introduce Sergey Levine who will be giving a guest lecture today. Uh, Sergey is an assistant professor at UC Berkeley. He did his bachelor’s and his PhD here at Stanford University, uh, and his research is at the intersection between control, uh, and machine learning with the aim of developing algorithms and techniques that can endow machines with the ability to autonomously acquire skills for executing complex tasks. Uh, and today, he’ll be talking about information theoretic exploration and unsupervised reinforcement learning. Please join me in welcoming Sergey. [APPLAUSE] Thank you for the introduction. So, um, yeah, I thought a little bit about what kind of topic to discuss today. And I figured I would actually talk about something that’s, uh, you know, addressing some challenges in reinforcement learning, but I think it ties also pretty deeply to topics in multitask learning and meta-learning that are sort of the core focus of this course. So let’s start off with a little discussion of, uh, what’s wrong with reinforcement learning algorithms. So you guys had a few lectures on RL. Hopefully, you kind of, uh, have a decent idea of what the basic, uh, setup looks like. But, uh, you know, the basic setup has some issues. So here’s an example of a task where you have, uh, a little bipedal robot learning to run. And you might look at this and say, “Well, that’s- that’s pretty cool like this little guy learned to run essentially from scratch.” Uh, but what did it take for him to get there? Well, that learning curve on the bottom shows, uh, the progression, the- the reward function as a function of the number of time steps that, uh, he had to actually attempt running. And if we put a scale on this in terms of sort of, uh, hours of time that it would take in the real world if it was like an actual robot trying to run, this little bit where he just starts making progress alone already takes 10 hours, and then if you want to get to the end of this curve over here, this would be like on the order of 100 hours. So pretty cool if something is working, but seems like doing this from scratch takes a really, really long time. And if you run this algorithm twice, you get two pretty different solutions. They both look like reasonable, uh, running gaits, but something here seems to be a little bit off if you get totally different solutions each time you do this. And then you might also think, “Well, how did we define the task, uh, for this little guy?” In, uh, you know when- when we learn about reinforcement learning, classically we’re told that we specify objectives or reward functions that tell the system what it’s supposed to do, and the reinforcement learning algorithm tell- it’s supposed to figure out how to do it. So you might think, “Well, if you want it to run, then clearly, you just give it a reward function.” It says, “Like you get a 1, if you run, and a 0 otherwise, go for it and then you can figure this out.” But of course, that’s not how it works in practice. In practice, if you actually use a reward function like this you’re not gonna get a very good solution. And what people do instead is they defined some complicated monster of a reward function like this that says, “Well, have a large forward velocity. Keep the torso upright. Don’t fall too much,” all this other stuff. So essentially encode a little bit more guidance about how the tasks should be performed. So long training times, different outcomes when you run at different times, complex reward functions. So there’s a lot of these challenges and some of these challenges are local optima challenges. Some of them, uh, are challenges with needing to craft very detailed reward functions, and some of them are challenges associated with long training times. And there are a variety of different, uh, ways we can start thinking about solving these challenges and I’m going to discuss one, perhaps, somewhat non-obvious, uh, direction, which is to think about, you know, how is it that- that humans and animals avoid having to deal with some of these issues. Well, maybe part of the key to that is that, uh, we’re not actually, uh, you know, this very much goes to the core of this class, we’re not re- really learning all these skills from scratch. So in standard RL settings, you have very detailed goals, you’re given that goal and you just try to achieve that goal. Uh, if you imagine, you know, uh, the child interacting with its environment maybe, it doesn’t have very specific goals, but it’s interacting with the world and acquiring knowledge about the world through largely unstructured interaction. And now, we might start thinking, “Well, can this kind of unstructured interaction help us, uh, address these issues? Perhaps the issues associated with local optima can be mitigated by obtaining very good coverage over the space of possible behaviors so that you know what’s possible in the world.” And then maybe once you have the more global view of the possibilities, you can avoid the local optimum when you’re provided with a specific task to accomplish. Perhaps, uh, the need for detailed rewards can also be alleviated by having some kind of varied diversity seeking exploration where you, uh, see what all the possible things that you can do in the world are, and then maybe you can deal with objectives that are as simple as this, because you’ve already seen things that look a little bit like running and a little bit like not running. And also maybe the long training times can be alleviated by essentially utilizing unstructured experience like this as unsupervised pre-training to build up representations, build up some initial skills that can later be utilized to address user-specified goals. So the particular, uh, problem that I’m gonna talk about today is, uh, the problem of recovering diverse behavior without access to any reward function. And of course, I’ll- I’ll allude a little bit to how this can be used to then achieve specific goals later down the line. But I’ll spe- I’ll mostly focus on the unsupervised parts or the pre-training part. So this part essentially. Why do we want this? Well, uh, you might want it because you can learn skills without supervision then use them to accomplish goals later on. You can learn subskills and then use those subskills with hierarchical reinforcement learning methods, and you can explore the space of possible behaviors to help you later on get out of those local optima. So, uh, if you want kind of a more concrete motivating example scenario, let’s imagine that, uh, you buy, uh, your, you know, mark 1 household robot, and this is your kitchen. And you buy the robot in the morning, you put it in the kitchen, and then you say, “You know, do whatever it is that you need to do to prepare yourself for an unknown future goal,” and then you’re gonna go to work and the robot is going to go at it in your kitchen trying to figure out what is it that it can do. And then you’re gonna come and they- come back in the evening and you gonna say, “Well, tonight your job is going to be to do the dishes for me.” And the robot should have done whatever was necessary during this unsupervised pre-training phase to prepare it as well as it possibly co- could for whatever unknown tasks you might give it down the line like doing the dishes. That’s sort of an example scenario that we can keep in our minds as I discuss some of the techniques that we can think about for this. And you could, uh, you know, if you’re willing to, uh, sort of stretch the analogy a little bit, you could also imagine that the, uh, the child that was playing around with toys on the previous slide was unintentionally doing some variant of this, essentially exploring the world with the unintentional aim of preparing itself for whatever its future might demand of it. Okay. Um, so that’s maybe the high-level motivation, but the concrete things that I’d like to discuss in today’s lecture, I’m going to start with some definitions and concepts from information theory that we’re going to utilize, uh, repeatedly in the lecture to actually think of algorithms that can do this. I’m going to discuss a distribution matching formulation of reinforcement learning that deviates a little bit from the standard way of formulating the reinforcement learning problem, and provides us a way to connect it, uh, with some of these probabilistic concepts. Then I’ll talk about our first, uh, fully unsupervised or reinforce learning algorithm. We’re just going to learn without a reward function to reach diverse goals. And then, uh, I’ll- I’ll discuss a way that we can extend this idea to match desired state distributions. Talk about whether coverage of state distributions is actually a good exploration objective and some theoretical sense, and then generalize the idea further from covering the space of states to covering the space of skills. And then I’ll conclude by, uh, briefly tying these ideas back to meta-learning, uh with a- with a short discussion about how- some of these unsupervised skill extraction techniques can actually be utilized together with, uh, with meta-learning to perform essentially unsupervised meta-learning, you know, uh, and that’ll be at the end of the lecture. I- I’m not sure if I’ll get through all of this but it sort of depends on how many questions you guys ask. So feel free to ask questions. Uh, if- if we’re running out of time, this is sort of intentionally designed so that the most important stuff comes first, and, uh, sort of the fuzzier stuff comes later. All right. Let’s start with, uh, some definitions and concepts. And many of you might already know a- a lot of these but I do want to go through them to make sure we’re all on the same page. So first some useful identities. Hopefully, all of you know what a distribution is. Uh, but we need to know what a distribution is because we’re going to work a lot with distributions. So if this is your variable x, you can think of some getting some samples represented by those blue dots. Using those samples, you can fit a distribution and just tells you how dense are your points in, uh, different regions. Now, a particular quantity, uh, that we can compute on these distributions, which is going to be very useful for us in this lecture, is the entropy of the distribution. So I’m going to use this script H to denote entropy. Uh, entropy can be defined, uh, like this and the- this equation might be a little bit cryptic. But to give you a little bit of intuition about what’s going on here, it’s the negative of the expectation of the log probabilities. So intuitively, if your distribution is very large log probability somewhere, then this expectation will be large, which means that its negative will be, uh, a very negative number. If the distribution has very small log probabilities, then this expectation will also be small, so it’ll be, uh, a less negative number. So larger entropies correspond to smaller log probabilities, uh, which means that the entropy quantifies in a sense how broad the distribution is. So if this distribution is very wide, it has large entropy, meaning, it has small log probabilities. And if it’s very peaked, then it has a small entropy meaning large log probabilities. And this will be very important, uh, in today’s lecture. Ah, the third very useful identity is mutual information. Now again, mutual information can be defined as a somewhat cryptic way, but then we can discuss the- sort of the intuition behind it, and that intuition will be very useful for us. So mutual information is a measure of how, ah, intuitively how dependent two random variables are and its definitions, the KL divergence between the joint distribution of those variables, p of x, y, and the product of their marginals, p of x times p of y, which we can equivalently write like this using the definition of KL divergence. But to provide a little bit of intuition for this, let’s say that you have two different distributions over x and y, this top one here and this bottom one. So the green dots that I’ve plotted are, ah, samples from that distribution. So now from looking at this definition, um, can you, ah, raise your hand if you think that the top plot shows a distribution with higher mutual information between x and y. Now raise your hand if you think that the lower one has higher mutual information. Okay. Right. So since mutual information is a measure of the dep- of the degree to which the variables are dependent on each other, ah, high mutual information means that x and y are highly dependent, low mutual information means that they are largely independent. And very conveniently, we can rewrite, ah, mutual information as the difference of two entropies. Ah, the entropy of the marginal of one of the variables, so let’s say the entropy of p of y, minus the conditional entropy of that variable conditional on the other one. And this has I think a very appealing intuitive interpretation. The second term is very small. It’s a very small entropy. If when you know x, you can accurately guess y. So in this top plot, you know, y is roughly equal to x, so if you know x, you can guess y pretty well. In the bottom plot, y is independent of x, so knowing x doesn’t help you guess y any better. But it’s not enough for x- for you to be able to guess y from knowing x, you have to actually- it has to actually be difficult to guess y. So if y is always equal to 0.5, of course you can guess y very accurately if you know x, but you can also guess it very accurately if you don’t know x. So this first term basically quantifies how hard it is to guess y a priori. If it’s hard to guess y a priori, but much easier to guess y if you know x, that means that x tells you a lot about y and that means that x and y have high mutual information. And the reason I’m somewhat belaboring this definition is because this intuition will be very important later on to understand what some of these unsupervised reinforcing learning algorithms are doing. So as a little warm-up, let’s go through, um, ah, a sho- a short, ah, kind of example of mutual information and reinforcement learning. Let’s say that we have a distribution. Our distribution is going to be defined in terms of quantities that we actually care about in RL, it’s going to be defined in terms of policies and states. So we’ll say that pi of s is the state marginal distribution of our policy pi. So you run your policy a whole bunch, count up which states it visits, average together those counts and you get a distribution over states. The entropy of pi of s quantifies in a sense the coverage of your policy. So if your policy has a very large entropy of pi of s, that means intuitively that it goes to a wide range of different states. It’s doing a good job of exploring. Now let’s think of an example of, ah, how we can use mutual information to understand something about, ah, MDPs. Here’s an example of a mutual information quantity called empowerment. I’m not actually gonna go through empowerment in detail, I just wanna use it as a short example. So, ah, classically, empowerment is defined as the mutual information between the next state, s_t plus 1, and the current action a_t. On the surface, this might at first seem like a slightly cryptic concept, but let’s go back to those difference of entropies definition from the previous slide. We know that we can write this mutual information as the entropy of s_t plus 1 minus the entropy of s_t plus 1 given a_t. So that means that if you know what action you took, you can guess your next state very accurately; if you don’t know what action you took, you cannot guess your next state very accurate. What that means is that you go to a wide range of different states and your actions are very predictive of those states. So now let’s think about that a little bit. Why would we call something like this empowerment? It sounds like a very lofty term. Any guesses about why this is called empowerment? Yes. Measures to what degree you can cover your environment. Exactly, yeah. It’s- it measures the amount of control authority you have over your world. If your actions are highly predictive of your states, that means your actions are influencing your states, ah, in a fairly predictable way. So that means that you have a lot of control about what goes on. If- if these two entropies are equal, that means that your actions have no effect on the state. That means that you’re sort of enfeebled, you’re not able to affect the world around you. So that’s why it’s called, ah, empowerment. [NOISE] So anyway, that’s just a little exercise to help us think about mutual information in the context of control problems. Let’s talk about, ah, the first, ah, kind of algorithmic components in today’s lecture, which is how we can reformulate classic reinforcement learning as a distribution matching problem. In this case, there’ll be a distribution matching problem on actions. So I’ll first motivate this a little bit with a- with an actual challenge that we sometimes face in RL. So this is the challenge that I showed on that first slide. You take your favorite RL algorithm, this particular one is TRPO, you run it twice and you get two different answers, and you think, what the hell just happened? Um, well, you can, ah, set up a little toy problem that captures sort of a microcosm of this issue, ah, because it’s a little hard to reason about these complicated humanoids. So let’s set up something much simpler. Let’s say we have four legs instead of two because that’s clearly simpler, and we’re navigating a little maze. And our goal is to reach this blue square. Now when you’re first starting out, you can explore the upper passage or the lower passage, and both at first seem like you’re making progress towards the goal. So if you make it a little further along the upper passage and you think, oh, I’m getting closer, I’m getting closer, let’s just go there, [NOISE] but then you get stuck. So intuitively, the right strategy for a problem like this ought to be to explore both passages in roughly equal proportion until you figure out that one of them is decisively better than the other. And I would actually posit that the issue here is very much like the issue here that reinforcement learning, it- it’s self-reinforcing. Once you see that you’re making a little bit more progress along the upper passage, classic reinforcing learning algorithms, we’ll want to commit all the probability mass there because they’re getting larger rewards there in the same way that this little running guy as soon as he figures out one type of running gait, he’ll want to go more to that place and ignore all other possible gaits because that one running gait where he ran and made a little more progress is the one that seems to be more rewarding. And that’s a local optima problem. So how can we track both hypotheses or in general all hypotheses that are valuable to us? So we’re going to talk a lot about Q-functions. Many of you might already know what-Q functions are, but a short primer. Ah, in our reality of actions, you have states and you have rewards. Your goal is to maximize your expected reward on your policy. So hopefully, we’re all good with that. And the Q function is defined as the expected value of the rewards that you will accumulate if you start in state s, take the action a, and then afterwards follow your policy pi. And reward functions are very useful because if you ha- sorry, ah, Q -functions are very useful because if you have a Q-function for a policy, you can extract a better policy simply by being greedy with respect to your Q-function. So if you take an action with probability 1, if that action is the argmax of your Q-function, you know that this policy will be at least as good as the one that you use to train your Q-function. And of course from this, you can get, ah, a reinforcement learning algorithm, ah, the one that I outlined here is policy iteration. You can also directly enforce optimality on the Q-function by forcing your- your Q-function to obey this equation. So if you can enforce this equation everywhere, then you’ll just recover the optimal Q function directly. So that’s Q learning. Okay. So, um, yeah, this is basically the- the only equation you need to know for the Q learning algorithm. So we can do something like Q learning, ah, for our little example with the two passengers. And let’s say that initially, ah, we make a little bit more progress along this upper passage. If we make a little bit more progress along the upper passage, then the height of this, ah, Q-function will be a little larger. It’ll- it’ll take on slightly larger values for the- for going left here. And that’s a problem because if the way that we’re going to improve our policies is by taking the argmax, then our policy will then put all this probability mass on this better option. This option is not actually better, it just looks better because we got lucky. But as soon as we get lucky once, we’ll go there even more, which means that we’ll get even luckier until eventually we’re completely committing ourselves to exploring this upper passage without even having seen what’s going on down here. So that’s no good. What we can do instead is we could imagine committing our probability mass in proportion to some positive transformation of our Q-function so that if the Q values for two options look about equally good, we’ll explore them about equally often. And it turns out that a very convenient positive transformation to use is the exponential. We’ll see why in a second. Ah, but intuitively, if we choose our policy to not be the greedy policy but to commit probability mass in proportion to the exponential of the Q values, then two options that have roughly equal Q values will be explored to about equal degrees and we won’t necessarily have this problem of immediately committing to the first option that accidentally looks a little bit better. So why is it so convenient to use an exponential? Well, because it allows us to bring in some of those concepts from information theory. It turns out that if you write down the KL divergence between your policy Pi and the exponential of the Q function, so you can treat this as a probability distribution if you normalize it, then by the definition of KL divergence, this is equal to the expectation under Pi of Q minus log Pi. And this is now starting to look an awful lot like regular reinforcement learning. This first term is basically your expected reward, because remember that Q is the expectation of future rewards. So that means that this first term turns into the expectation under Pi of your future rewards. The second term, that’s the new guy. That’s something that we didn’t have in regular RL, but the expectation under Pi of log Pi is just the entropy. So all we’ve done, if we change our objective to this KL divergence, is we’ve taken our regular RL objective and added the entropy of Pi to it. We call this the maximum entropy objective because what it tells us to do is to maximize our, our rewards while at the same time being as random as possible. And this actually allows us to address this problem with the two passages, why? Well, because if you’re maximizing your rewards while being as random as possible, that means that when you have two roughly equally good options, you should take both options in equal proportion because it will give you higher entropy than committing to one of the two. Once you’re sure that one of the two options is much, much better, then it’s fine to commit all your probability mass to that. But if they look about equally good, you should try both, and that will allow you to maximize your entropy. So it’s a distribution matching objective in the sense that Pi is matching this exponential Q distribution, and it allows us to begin addressing this local optima issue. Now building on this idea, uh, we can do, uh, lots of fancy math. So for those of you that are, uh, that like concepts like variational inference, you can actually show that this kind of objective emerges from a particular kind of inference problem. So in optimal control or reinforcement learning, we asked the question, which actions lead to, to an optimal future? Which action maximizes your reward? That’s a very reasonable question to ask. In inference, you can imagine asking a slightly different question, but ending u- ending up with a very similar answer, which is, given that your future is optimal, which action would you have taken? It turns out that the answer to that second question is basically exactly that maximum entropy objective. So if we’re going to talk about inference, of course, we need to ask inference in what model. So we’re going to build up a graphical model, a probabilistic graphical model, that we can use for maximum entropy reinforcement learning. In this graphical model, we’re going to have states and actions, that’s- that should be familiar to all of you, that’s basically how a sequential decision-making process works. But we also need to somehow introduce our rewards. And the way that we will introduce our rewards at first looks a little peculiar, but it actually is going to make a lot of sense. So we’re going to introduce these additional auxiliary variables that I’m going to call O for optimality. These variables are not equal to rewards, they’re actually binary variables, and our evidence is that those variables are all true. That is to say our evidence is that you are being optimal, given that you are being optimal, what action are you taking? And we’re going to define the probability of those variables being true as the exponential of the reward. In a slight regularity condition for this to make sense, which is that your reward needs to always be negative because you can’t have, you know, probabilities that are larger than 1. But except for that regularity condition, this is a well-defined probabilistic graphical model. So now the question that you want to answer is, what’s the probability of an action given your current state and given that you are always being optimal? So given that you are rational, how likely is a certain action? And if we set up this inference problem, uh, we can actually do basically the usual thing that we do, we can do some va- varia- variable elimination. So I’ll walk through this pretty quickly., there’ll be lots of equations, but I promise you that they’re actually pretty simple equations. So we need to write down the distribution which we’re doing inference. So it’s a distribution over all of our states and actions because we don’t know what those are going to be given our evidence, which is that these optimality variables are true. We have three types of conditional probability distributions in this model, the initial state, p of s_0, our optimality variable probabilities exponential of r, and the transition probabilities p of s_t plus 1, you have an s_t, a_t. We’re gonna take this joint, and then we’re gonna compute the conditional that we want, which is p of a_t, given s_t comma our evidence. So that’s the thing that we want. And the way that we’re going to do this is going to be using variational inference, because s’s and a’s are continuous, huge, so doing this, uh, inference exactly is intractable, but variational inference gives us a nice optimization-based way to solve this inference problem, and it requires us to define a variational distribution. Remember that a variational distribution is just the distribution over all of our hidden variables that has some convenient form that makes inference easy. Usually, we choose the variational distribution to be something really simple like a product of independent marginals. Here, I’m going to make a very peculiar choice for my very short distribution. I’m going to choose it such that it looks a lot like my original distribution. Usually, if you’re doing variational inference, this is not a very good idea because the whole point of doing variational inference is to use a simpler variational family than your original model. But it turns out that in this very, very special case, this will lead to a- a very convenient solution. So the way that I’m going to define my variational distribution is by using the same initial state probabilities, the same transition probabilities, and then I’m going to- the only thing that I’m actually going to say that’s different is Pi of a given s. So up here I have this exponential r, here I have the Pi. So that’s my, uh, variational family. This is just, uh, a, a picture of the, of the variational distribution. So you have this additional edge from s to a, and you no longer have the O’s because that’s your evidence, and we don’t put evidence in our variational distributions. So now, at the variational lower bound, which is what we optimize when we do variational inference, it’s simply the KL divergence between Q and p. Um, if you, uh, don’t believe me, you can do a little bit of algebra. I promise you this is, this is in fact a correct way to write the variational lower bound. And being a KL divergence, so you can express it as simply the expectation under Q of log Q minus log p. And this is where the cleverness is gonna come in. The reason that we chose our Q in this very peculiar way is that now we’re gonna get lots and lots of cancellation. If we actually substitute in our definition of Q and our definition of p into here, we get a huge, huge equation, almost all of which cancels. I- if you’re wondering what I did here, is I just put the log of these in here and log of products as sum of logs, so that’s where all this stuff comes from. And of course because I chose the p of s_0 here to match the one here, that cancels out, the transition probabilities cancel out, so the only terms that remain are the log Pi’s and the rewards. And this is all in expectation under Q, which is this thing. So that’s just equal to the sum over all my time steps of the expectation of r, plus the entropy of Pi, which is exactly what we had on the previous slide when we talked about maximum entropy reinforcement learning. So maximum entropy reinforcement learning, its objective is exactly the variational lower bound for this particular variational approximation, which means that we are in fact doing approximate inference in this graphical model. That’s pretty neat. That means that, uh, our reinforcement learning procedure now actually corresponds to probabilistic inference. Uh, now, how do we actually optimize that variational lower bound? Uh, well we have a few choices, one choice is to just use policy gradient. Policy gradient is great because you can just write down your objective, calculate its derivatives, and then do gradient ascent, and it’ll work. It’ll take a huge number of samples, but it will work. Um, if you want to be a little more clever, you can actually modify Q-learning to optimize this objective. It turns out to look like a procedure that greatly resembles message passing. Uh, so how do you modify Q-learning to optimize this objective? I’m not gonna prove this, I’m just gonna state the result but, uh, you can look up the, uh, you know, the explanation for why this is true in this tutorial. So regular Q-learning performs updates like this. So reward plus the max over the next Q, that’s the Q-learning algorithm that we all learn from, uh, Richard Sutton’s textbook or whatever first reinforcement learning class you took. If you want to modify Q-learning to solve this inference problem, all you do is you replace this max with a seemingly cryptic log of the integral of an exponential. This first- at first this seems like a really bizarre thing to do, what’s a log of an integral of an exponential. But imagine for a second that your Q values are very, very big. If your Q values are very, very big, when you exponentiate them, the biggest of them dominates all the rest. When you sum them all together, that sum is entirely dominated by the exponential of the biggest value. So when you take the log, you essentially get back to the biggest value. So if your Q values are very big, then the log of the integral of the exponential is very close to a max. And that makes a lot of sense, because if we’re matching distributions, if we’re doing this max n thing, and we have one choice that is extremely large, and all other choices are extremely bad, then we’ll just put all our mass there, and then we’ll do a max. But if all our choices are about equally good, then we wanna put probability mass on all of them and then we have a very soft, uh, max here. That looks more like this one. [NOISE] So all you do is your regular, uh, Q-learning except that instead of a max, you use the log of the integral of exponentials and then you will actually optimize this maximum entropy RL problem. Okay. And then you’ll do this nice distribution match. Um, a, a modern algorithm that instantiates this basic idea in more of a policy iteration framework rather than a Q-learning framework is called soft actor-critic. Um, some of you probably heard about soft actor-critic in, uh, Kate Rakelly’s lecture. Uh, and actually soft actor-critic in fact solves that same inference problem. So it consists of primarily two steps. There’s a Q-function update, which updates the Q-function for a given policy. So this is not a Q-star update, this is Q-Pi update, and the only modification there is to incorporate that entropy term, the log Pi. So this will convert to Q-Pi for a given policy under the maximum entropy objective. And then there’s a policy update step which is basically the KL divergence. The MaxEnt RL, the KL divergence objective. KL divergence between Pi and this Q-function. And of course in practice you do the usual actor-critic thing which means you take one gradient step on this, one gradient step on that, and then repeat. But if it is a proper policy iteration procedure, you would actually run this thing to convergence, and then run this thing to convergence. And then you interact with the world, collect more data sort of usual RL thing. It turns out that the Q-function update is essentially doing message passing in your graphical model. And the policy update is fitting that variational distribution. So they actually correspond to parts of a well-defined inference procedure. Um, this algorithm works decently well, um, so it, it gets good results on a bunch of benchmark tests. But maybe more interestingly, it actually, uh, exhibits some interesting properties that might suggest that it could alleviate some of these issues that I outlined on the first slide of the lecture. For example, you can use soft actor-critic or in general soft optimality algorithm like soft Q-learning to pre-train policies that can then be fine tuned on more narrow objectives. This is not fully unsupervised pre-training yet that will come later, uh, but it’s sort of starting to get at that idea. So one of the things you can do is you can train a policy for this little, uh, quadrupedal amp robot with the reward function that says run as fast as possible. Now, let’s think about this for a minute. If you reward function is just the norm of your velocity and you run regular reinforcement learning, what will it do? Well, it’ll pick a direction arbitrarily and will run in that direction. If you’re doing MaxEnt entropy reinforcement learning, our objective is to maximize our reward while being as random as possible. So let’s, let’s imagine for a second what that would do with an objective that just says maximize the norm of your velocity. Okay. So if you’ve all thought about it, uh, let’s see what it actually does. So this is the soft Q-learning method. There it goes. It’s a Halloween special. Uh, it’s fright- it’s this frightening explosion of spiders. But once you get over the initial scare, you’ll notice that, uh, it’s actually running in essentially a distribution of possible directions. So it tries to run as- in as many directions as possible because by doing that it can maximize entropy. Now why is this useful other than, uh, you know, uh, scaring people? You can take this thing and you can fine tune it with a specific direction. So you can for example put it into a hallway. And then we’re going to fine tune it to run down this hallway. So you’ll see this one is initialized randomly, this one is initialized with deterministic RL, and this one is initialized with maximum entropy RL. The initialization from the deterministic RL knows how to run in one specific arbitrary direction which is almost certainly not the direction of this hallway because, uh, you know, it’s extremely unlikely that was the one I picked during pre-training. The maximum entropy one knows how to run in all possible directions. So all that has to do is just unlearn all the wrong directions and just keep going in the right direction. So that allows the MaxEnt initialization to learn a lot more quickly in practice. And of course if we look at the plots, that’s pretty much what we see. So the blue lines show fine tuning for the MaxEnt initialization and, ah, the green lines show fine tuning for the deterministic, uh, reinforcement learning algorithm. Um, okay, I’ll probably skip over the hierarchical policies since I do wanna get more to the, uh, fun unsupervised stuff. But maybe I’ll briefly show some- the robot videos as a little intermission. So, uh, besides, uh, generating videos with exploding spiders, we can use this to actually train robots to do interesting tasks in the real world. Um, so this is, uh, a soft Q-learning experiment where we use soft Q-learning to train this Sawyer robot to stack a Lego block. We weren’t so much interested as- in, in whether or not it could stack the Lego block. I mean we, we know how to do this part. This is not that difficult, but we wanted to see is whether the injection of that entropy maximization resulted in interesting behavior at test time. So what we did at test time of course is that we went in and we messed with it. We went in and we perturbed the robot to try to see how well it could react to our perturbations. And the reason that it can react to those perturbations is because during training you’re maximizing entropy, which means that you are essentially teaching the policy to be resilient to as much injected noise as possible. You can do this on other kinds of robots too. This is a- this is actually soft actor-critic running on a quadrupedal robot that is trying to learn how to walk. Uh, so initially, the robot doesn’t know anything. So it sort of wiggles around randomly. Uh, and then after some time passes, it begins making forward progress. And this, uh, this basically learns in about two hours how to walk forward. And once learning is concluded, then you can run the robot and it has a pretty decent- pretty plausible gate. And then again, you can start messing with it and see how much the entropy injection during training actually helped it acquire robust strategies. Of course, they’ve never seen slopes like this before. So struggles a little bit and you can- if you give it a large enough perturbation, it will eventually fail, but it has some degree of robustness, so it can walk downstairs. It can’t actually walk upstairs, that will fail, but it can go downstairs. Uh, and, uh, it can very poorly play Jenga. [LAUGHTER] So, um, and there we go. So that’s the- that’s maximum entropy RL. Now maximum entropy RL is not yet unsupervised reinforcement learning. We still have to give it a reward function, but subject to that reward function, it was learning a breadth of different behaviors. Next, we’re gonna talk about how we can actually learn without any reward function at all. So let’s think back to this scenario. You buy your household robot, you put it in your kitchen, you turn it on, and you let it work for a whole day, and then you come back in the evening and you give it a goal. And it’s supposed to do something during this day that best prepares it for the goal that you’re gonna give it. So what can it do? Uh, well, before we talk about what it’s gonna do during, during it’s day, let’s actually first fast-forward and talk about what’s gonna happen in the evening. In the evening, you need to give it an objective. And that means that you need an interface. Let’s nail down a very simple interface. Let’s say that you’re gonna show it a state that you want it to reach. So if it’s doing things using camera images, then maybe what you’re gonna show it is an image. If it’s doing things using, you know, a vector of states of all the objects in the world, then you’re gonna, you’re gonna give it a desired vector of states of objects in the world. So you’re gonna give it a goal state. This- I’m not necessarily gonna claim that this is a good interface. It’s just a simple one that allows us to get started. It’s a little weird to, you know, task your robot by giving it a photograph of the desired outcome. But at least it’s simple and we can all understand it. So if that’s the interface we’re gonna use, then somewhere under the hood, your friendly robot needs to be able to compare two states. And it needs to look at this picture, and it needs to look at this picture, and tell you how similar they are to each other. There are a variety of ways to do this, uh, but the way that we’re gonna try to do it is by building a latent variable model whose latent variables are a little bit more meaningful than the original observation space. So if you compare the pixels in an image, the, the resulting distance doesn’t really correspond to like the physical proximity of two states, but maybe you can embed those images into some lat- latent space where distances are meaningful and you have a variety of choices for how to do this. GANs will do this. We’re gonna use something called variational autoencoders. Just another latent variable model, and there are many choices. And then we’re gonna use the latent space of that variational autoencoder to compare two states. And then since we actually have this latent variable model, the way that we’re gonna get the robot to practice during the day is by generating samples from that latent variable model. So the robot has a model, it can sample from that model, it can essentially hallucinate potential images, set those as goals for itself and then go and practice reaching them. That makes a lot of sense. I think, you know, that this is a plausible way to practice goals. But there’s, um, a pretty substantial problem with this setup. Does anyone have ideas about what problem you might have if we go and do this? It might [inaudible] ? Right. So one answer is that it might- it might set easy goals for itself. Where’s the latent variable model gonna come from? We have to train it on data. There’s only one source of data, it’s the robot, so it’s gonna use its own experience to train that latent variable model. Maybe, it did some random useless stuff. Maybe, it broke the dishes, and now it has lots of data of breaking dishes. So it’ll generate more goals of breaking dishes, and it’ll just keep going in a loop, essentially doing the thing that it’s done successfully before setting those easy goals. So that’s no good. What we need is, we need some way to encourage it to diversify. So, um, let’s, uh, first sort of outline this algorithm a little bit more precisely. We have our variational autoencoder. That variational autoencoder has a decoder distribution that I’m going to call p theta of x given z, so x is your state or image and z is your latent variable. It has a prior p of z. So if you wanted to sample, your first sample z from p of z and then x from, uh, p theta x given z, and it also has an encoder q phi, uh, z given g-, uh, s, uh, sorry, z given x, that should be z given x. It’s a little typo. Um, and the way that your training is going to proceed is that you’re going to propose a goal. How do you propose a goal, where you sample from your VE? So sample z from p of z sample x from p of x given z, then you’re going to attempt to reach that goal using your policy. So your policy now is pi of a given x comma x g. So it’s conditioning your current state and the goal. So you’re going to learn this giant policy that- that will attempt to reach all possible goals. Of course, your policy, in general, will not necessarily succeed, it won’t reach that goal perfectly, it’ll reach some other state, x bar and if it’s a very good policy then x bar would be very similar to x g. If it is a very bad policy then x bar will be some random thing, and then you’re gonna learn. You’re going to use your data to update your policy, and you’re going to use your data to update your VE, which means that you’re going to change p theta and q phi. Now, of course, in practice, the way that we would implement this is we would, uh, use a replay buffer. So when you attempt to reach the goal, you just add this to your buffer, and then you use all of your data so far to update your policy and your VE, but for now, let’s- for simplicity let’s think of this as an online algorithm, that will just keep things a little simple, and then of course, you repeat. So this easy goals problem happens because we use this data that we actually, obtained from this policy to update our goal proposer which will then propose more goals that looked like the states were reached and will start going in a circle and not actually, do anything interesting. So intuitively, what we’d like to do, is we’d like to up weight those goals that seem more interesting to us. So if your robot lives in a 2D environment, and these little circles denote the states that it had visited. If you have a big clump of stuff over here and you have these really rare things that you reached out of the fringes. Intuitively, what you’d like to do is to somehow up-weight these rare things, and do those more often because those are probably the harder, the more interesting goals. So maybe, if you somehow up-weight them, and then fit your model, and then propose from that model, when you then go to reach those goals you might actually, end up putting more mass onto the tails of your distribution, and gradually, expand. And we’ll make this very vague notion a little bit more precise, later. So in order to do something like that, all we really have to do is we have to modify this last step because step four, is the step that actually, uses your data to change which goals we propose in the future, and we need to modify step four, so that if the state x bar that you reached was very interesting, it should get a bigger weight and if it was very boring it should get smaller weight. Now, interesting and boring here means, is it a state that you’ve reached, often or is it the state that you’ve reached, rarely? [NOISE] So that’s this step. [NOISE] Here’s how you can do it. Standard maximum likelihood estimation will basically, train your VAE to maximize some approximation to the log likelihood of x bar. Now, if it’s a VAE you’re actually using a variational lower bound to approximate this because you can’t evaluate log p of x bar, exactly but some approximation to log p of x bar. Weighted maximum likelihood estimation exactly, as the name might imply will just put a little multiplier on this and this multiplier is some function of x bar, and by choosing that multiplier carefully, we can actually, create this effect where the interesting or rare outcomes are up-weighted and the uninteresting or common outcomes are down-weighted. All we do, is we use our previous density estimator, our p theta x bar or some approximation thereof. Raised to some power alpha, where alpha is between negative 1 and 0, so it’s basically, gonna be, uh, it’s- it- the- the weight is smaller for larger probabilities to weight our likelihood. So that means that, if a particular x bar is extremely probable, it’s in this really dense region, then when you raise it to the power, let’s say, negative 0.5, it’ll get a much smaller weight. If it’s very rare, it’s probabilities like 0.0001, then you raise it to some negative power and you get a really big weight. So that’ll do exactly what we wanted, we’ll up-weight these things in the tails and the interesting result that you can prove which is described in the second paper, Dalal, Pong, Lin et al, is that for any alpha between negative 1 and 0, following this procedure actually, increases the entropy of p theta of x. Now, if you increase the entropy every single step, and you can prove that it doesn’t asymptote to something suboptimal and eventually, you get full entropy. You get the highest entropy distribution that you can have, and the highest entropy distribution that you can have is the uniform distribution over all valid states. Question? [inaudible] Right. So the question was, at each step why don’t you just uniformly sample from the state space? If you could uniformly sample from the state space that would be a better algorithm. The trouble that we have is that our, uh, observations or our states here. These Xs they might be for example, images from the robot’s camera. Uni-formally, sampling images, uh, by itself doesn’t actually produce uniformly sampled valid states because most random images are kind of static, they’re white noise. So what you want is you want uniformly sampled valid states, and that can be a really complex manifold in the space of all possible x vectors. So essentially, you can view this procedure as a really fancy way of discovering that uniform distribution. Any other questions? [BACKGROUND] Okay. Um, so now we have, uh, we’ve- we’ve defined how this step is going to work, we just use this w which is- which comes from using our previous density estimator, and now we can ask the question, what is the objective for this procedure? So we kinda cooked up this recipe but what is this recipe doing? Is it actually optimizing some well-defined objective? And that’s where things get a little bit interesting. So if your entropy increases every step, you can actually show that under a few additional conditions, what you’re actually optimizing is the entropy of your goal distribution. So this p of g this is the distribution of your goals which are coming from your VE. So you’re increasing that entropy, so you’re maximizing that entropy, and you can show that this is actually true. But that’s not all that you’re doing, you’re not just training a VE, you’re also training a policy. So your goals get higher entropy due to this skew fit procedure, but what does RL do? Well, your policy pi, and I’m gonna rewrite it as pi of a given S, G just to make it very explicit that there’s a state and there is a goal. It’s being trained to reach goals, and if it does a very good job of reaching those goals then it will reach a state, its final state S that is close to the goal, and in fact, that might even be equal to the goal. So what that means is that as your policy gets better and better, it becomes easier and easier to predict what the goal was from looking at the state that it reached. So if the policy is very good, if it always reaches the goal then G is equal to S. So P of G given S, is a very low entropy distribution. Now, where have we seen that before? We’re making one entropy big and we’re making some conditional entropy small. [BACKGROUND] So if that’s our objective, that’s actually, exactly the mutual information between S and G and that’s kinda cool. Somehow, we had this kind of cooked up procedure involving VEs and maximizing entropies and- and up-weighting things, and all that turns into is just the mutual information between the last state that you reaching your trajectory, and your goal, and if you maximize that mutual information, then you end up successfully, reaching a wide range of goals. So maximize mutual information between S and G, at least a good exploration because you’re trying to make this first entropy term big, and it leads to effective goal reaching because you’re trying to make that second entropy small, and of course, there are other ways to approximate that mutual information of which this algorithm is just one of them. Any questions about this? Okay. Again, I- I’m contractually obligated to show you robot videos. So, ah, we ran this thing on a robot and, ah, therefore what we have to show this to you, uh, at zero hours so that what, what is this robot doing first of all? It’s, it’s a robotic arm, It’s placed in front of a door, ah, you might think that it’s told to open the door, it’s not told to open the door, it’s just placed in front of the door but that’s a very suggestive way to, ah, set up your robot. So of course initially, what it’s going to do is it’s gonna wiggle its hand a little bit because that’s the most obvious thing to do for an untrained reinforcing learning algorithm, but pretty quickly it figures out that hand wiggling results in images that have a high probability under its VAE and it starts proposing all sorts of other goals that involve actually interacting with the door because that allows it to achieve outcomes that have a low probability under its VAE, and then that probability increases, and it opens the door to all sorts of different angles. And then at the end once it’s all trained, it can actually open this door to any angle that you want, ah, without actually being told in advance that door opening is what you’re supposed to do. So you can think of this as a very, very limited version of that household robot, where it’s not placed in a messy kitchen, it’s placed in this admittedly rather pristine laboratory setting, but maybe a step in the right direction. And you can test it, you can give it different goal states. So here, um, a person is actually giving it pictures and we’re just testing to make sure that it can open the door to every desired angle. All right. So that was learning with goals. Now, the next thing that I’m going to talk about is kind of a state distribution matching formulation of reinforcement learning that generalizes these ideas a little bit further. So remember before I talked about this MaxEnt RL thing, which matches action distributions, and I also talked about how we can maximize the entropy of possible goals. So can we somehow combine these ideas to match state distribution? So instead of just making your state distribution as broad as possible, can you say match a distribution in state space rather than in action space? It’s potentially quite useful because then you can go and explore diverse states under a prior over good states. So if- so maybe your household robot, uh, needs to be told in advance, I don’t care about states where my dishes are broken, I don’t care about states where, you know, you threw my vase out the window. All the other states, those I care about. So do everything except for breaking my dishes and throwing stuff out the window, and then you’re gonna have a much more coherent exploration strategy. So before I describe how to do this, as an aside, let me describe something that does not solve this problem. It, it might seem like it solves us from but it doesn’t, and it’s something called ex- intrinsic motivation. So this is a common method for doing exploration in non-information theoretic settings. You incentivize your policy Pi of a given s to explore diverse states before it sees any reward. And the way that you do it is by rewarding visiting states that are novel. This first seems like it’s very, very similar to what I discussed before but it actually leads to very different outcomes. So if a state is visited often, it’s not novel, if a state is visited rarely, then it is novel, and the way that we do this, this is kind of very standard, uh, reward bonus exploration is by adding a little term to our reward function based on the negative log p Pi of s. So you estimate how often your policy visits each state and then subtract those log probabilities from your reward, and that’ll get your policy to receive larger rewards for states that have visited rarely. So you update your policy to maximize this expected, the expected value of this modified reward, update your p Pi, update your estimate of the state marginal, and then repeat. And what does this do? Well, let’s say that initially your policy does this. You fit your state distribution, you get this green circle. It’s a little faint on the slide but there’s little old green circle that I actually drew around this thing. So then anything inside that green circle gets a lower reward because it has higher probability. Everything outside that circle gets a higher reward be- because there’s lower probability. So what does the policy do? Well, it goes somewhere else and then your distribution changes, and then it goes somewhere else, and so on. So your policy does end up going to lots of different places but at any given point in time, your policy is not actually covering the space, it’s just going to some arbitrary location. It’ll go somewhere else next time. So this distribution keeps chasi- chasing your policy and your policy keeps changing. At no instant in time, do you actually have a policy that covers the space. So it kind of seems like it’s very close to the right answer, but it’s not quite the right answer. There’s something missing. So can we fill in this missing piece? Can we somehow change this intrinsic motivation method so that it actually matches a desired state distribution? It turns out that it’s not actually that hard. So the state marginal matching problem is to learn a policy Pi so as to minimize the KL divergence between that policy state marginal and a desired state marginal. So you have a target density and our idea is to somehow utilize this intrinsic motivation, uh, principle. So we’re going to define our reward function as log of, uh, our desired probability minus the log of our current, uh, probability. So this is starting to look a lot like, uh, uh, KL divergence, it’s also starting to look a lot like intrinsic motivation. Now, this by itself doesn’t yet perform marginal matching because your policy keeps getting chased around like this your policy never actually ends up, uh, matching the distribution, but it’s very close. So the procedure is you, you learn your Pi_k and I’m going to actually gonna put the superscript k to denote the policy iteration k to maximize the expected value of your reward r_k. So r_k means that the p Pi is actually p Pi_k, you update your distribution to fit the state marginal to get p Pi_k plus 1 and then you repeat. And let’s say this orange circle is my desired distribution. So by itself, my policy will sort of try to escape from my, uh, distribution, it will keep getting chased around and never actually match the thing. So what if instead, we update our distribution p Pi_k to fit all the states seen so far? Um, so that will take care of one side of the problem because now we’re actually- we’re not just matching distribution from the current policy, we’re matching distribution from all policies. And then what we return at the end is not the last Pi_k but the mixture of all the policies seen so far because remember these policies were running all over the state space. It went here, it went there, it went there, no one policy covers the space but the ensemble of all policies together does actually cover the space. And it’s not actually just a vague intuition, it turns out that, uh, you can actually show that this ends up being, uh, the Nash equilibrium for a two-player game, where one player is trying to, uh, basically optimize this objective and the other one is trying to match the distribution. This is a, a technique called a self-player or someone referred to as historical averaging, where if you’re optimizing these objectives in alternating fashion the way that you actually recover at equilibrium, which has to be a stochastic policy, is by averaging together the iterates. So this, this is just like a theorem that you can look up. Very simple modification to intrinsic motivation, turns out to actually match distributions. So it’s a game between the two players, the players are Pi_k and p Pi_k. And of course, there’s a special case of this where if you choose p_star to be a constant, meaning the uniform distribution, uh, then you’re just maximizing state entropy, and then, then you recover the algorithm from the previous section. Okay. So does this work? Um, the answer is that it basically does. So this is a, a comparison between soft Actor-Critic, which is a maximum entropy reinforcing learning algorithm that matches action distributions versus the state marginal matching method which matches state distributions. And there’s a heat map visualizing this little quadrupedal robot running around in this little funny world with three hallways. And you can see that the heat map or say marginal matching is much more uniform. This is trying to match a uniform distribution. Um, I’ll skip over this. Here it’s trying to match kind of a funny lopsided distribution. So it’s trying to move this little block to different places and we tell it we want the left side to be twice as likely as the right side. Now, of course it can’t match it perfectly. The red line is a state marginal matching because that’s contend with the dynamics of the real world. So you can’t go to the left side without passing through at least a little bit of the right- right side. But you can see that it’s a little bit more skewed to the left, so it’s a lot more probability mass on the left side, a lot less than the right side. So kinda seems to be doing something. All right. But maybe the next question that I wanna ask and I think that’s an important question is- is this actually a good way to explore like intuitively kinda make sense. You want to see lots of different things in the world so you wanna get good coverage. So you wanna maximize your state entropy. Is it actually a good idea to maximize state entropy. So is state entropy really a good objective and if it’s a good objective what does a good objective for? So, you know, just to sort of recap what we saw before, the skew-fit does this, you know mutual information between states and goals, which corresponds to maximizing goal entropy and minimizing goal given state entropy. State marginal matching in the special case where you choose P star to be a constant also maximizes a state entropy. So it seems like this is something we kind of know how to do. They’re more or less the same thing. But when is this a good idea? So one place where it’s a good idea, and I’m gonna quote Eysenbach’s theorem. It’s not actually a theorem. It’s just Ben Eysenbach was the one who- who told me about it. But it’s, uh, it follows trivially from a classic maximum entropy modeling and result. So don’t take the name or the fact that I call it a theorem too seriously. Um, It says that if at test time an adversary will choose the worst goal, meaning that when you come home after letting your robot run all day, you know what your robot is good at and you’re gonna choose the worst possible thing. You’re like the world’s worst robot owner. If that’s what you’re going to do, then the- the best goal distribution to use for training is actually the one with the largest entropy. So if the robot knows that you’re gonna to be the worst possible robot owner, then the optimal thing for that robot to do without any other knowledge about what kind of goal you might propose is to maximize the entropy of goals. And this follows directly from the equivalence between maximum entropy modeling and the adversarial formulation where you maximize p and minimize q. So if you read about maximum entropy models, it’s just basically a trivial corollary to that. So if you wanna read a little bit more about this idea, it’s summarized in this paper called Unsupervised Meta-Learning for Reinforcement Learning, which I’ll come back to later. But the short version is that there is actually a kind of a principled reason to suppose that maximizing state entropy is good, uh, because it makes you robust in the face of the worst-case human so to speak. And another actually excellent paper to read about some theoretical reasons why high state might be a good idea is this paper called Provably Efficient Maximum Entropy Exploration by Hazan et al. So if you’re interested in some of the theoretical foundations for this, I highly recommend that paper. Okay. So the next kind of algorithmic idea I wanna discuss a little bit is how we can go a little bit beyond just covering states and actually think about covering the space of possible skills. So, so far, all of the unsupervised RL algorithms we’ve talked about, they are attempting to reach goals or reach states, maximize standard, reach that sort of thing. But there’s more to behavior than just reaching states. So next, I’m gonna talk about learning diverse skills. So a skill- you know, I’m gonna define skill in a very peculiar way. Don’t take this kind of too seriously but I just need a term for it so I’m going to call it a skill. I’m gonna to have a policy pie that gives a distribution over actions, conditioned on states, and conditioned over the- on the Z before this was a goal but now it’s just gonna be a task index. What’s a task index it’s- it’s an integer. It’s gonna be like zero, one, two, three. And intuitively, if you want diverse skills, then what you want this pi to do is you want it to go to- to go into different things for different values of Z. So maybe, if you give it a z of 0, you should you- should have it like go up here; if you give it a z of 1, it should go over here; a z of 2 should go there. So basically different z’s should do different things. Okay. That kinda makes sense. Um, now, like integers do this with maximum entropy reinforcement learning or that goal reaching thing. Well, so maximum entropy RL, it just does action entropy. Action entropy is not the same as covering the space of skills. So you’ll take different actions but you might actually end up in very similar places from taking very different actions. So that’s not quite what we want. You can take different actions and land on similar states. Reaching diverse goals is not the same as performing diverse tasks either and that- that’s a little bit more subtle. The reason for this is that not all behaviors can be captured by goal reaching. Let me give you an example of a behavior that cannot be captured by goal reaching. This is my- my favorite quadrupedal robot. It needs to go to this green location and it needs to avoid this red thing at all costs. So maybe the red thing is the- is the fire pit of death. That’s where you’re not allowed to go. The green thing is your- is your favorite object in the world. If you just try to reach this green state, then you’ll go straight through the bad red thing. So you cannot frame this as a pure goal reaching problem. Now, this task can still be defined in terms of states is just that now the state of distribution that you want is more complicated. There are some states that you want to not land, then in some other states, you want to land on as often as possible. And then to go back to the MaxEnt stuff you know the MaxEnt RL policies are stochastic but they’re not always controllable. So they- they’ll do- they, you know, if you remember that, uh, the ant running in all possible directions, we couldn’t really tell which direction to run, it just randomly runs in random directions. Here we want something very controllable. We want different task indices to correspond to different coherent behaviors. So the intuition behind the method that I’m gonna describe is that different skills, different values of z should visit different state-space regions. Not necessarily different goals but different regions. And this generalizes the notion of goals. So here’s how we can do it. We’re going to use a diversity promoting reward function. So when you train your policy, you’re always gonna train it the same way. You’re gonna to train it with RL. Maybe maximum entropy RL but some kind of a reward maximizing thing maybe with an entropy term. And the thing that you get to choose is this reward, right? So we’re going to train our policy to be the arg max of some expected reward. And of course, we have many different z’s. So we’ll sum that over all possible Zs. The choice that we have to make is the choice of R, and our R will depend on Z. So different tasks have different rewards but of course we can’t design them manually. We have to somehow acquire them automatically. So we’ll make a very simple choice. We’ll reward states that are unlikely for other Zs. We’ll say you get a big reward if you go to a state that’s very likely for the current Z and very unlikely for the other Zs. And that’ll encourage the Zs to specialize. So the way that you do this if you want to, uh, think about how this can be done mechanically, is you train a classifier. You train a classifier that predicts p of z given s. What is p of z given s? Well, z is an integer, s is a state. So you can treat z as a basic categorical labels p of z given s to classify the classifies which integer was fed into your policy that resulted in it visiting the state s. So if I see you going somewhere because somebody gave you a command, what I’m trying to guess is what command were you given that caused you to go over there. So then you can imagine this whole system, the training of the classifier, the training of the policy and so on as a loop where you have your policy or agent that takes actions. Those go to the environment, the environment responds with states. The policy at the beginning of the first timestep of every episode is given a skill, an integer, a z, and the state goes to this discriminator- this classifier, which attempts to guess what z was given to the policy. And they are colluding, they’re cooperating with each other. The policy’s trying to do the thing that makes the classifier get the right answer and the classifier’s trying to look at what the policy did and- and try to guess the right answer. So now, let’s think about what this interaction will actually do to the behavior of the policy. Let’s say that I have only two values of C just for simplicity. So I have a green value and a blue value. Initially, you have a random policy and when you condition on green and you condition on blue, it kinda does the same random stuff. But of course, just to random chance, sometimes the green lines will be a little bit further this way, the blue lines will be a little further this way. So when you train your classifier, the classifier will say I have no idea what’s going on, but if you really want me to draw a decision boundary, I’m gonna draw this one. Because, you know, there’s a little more green stuff over here, a little more blue stuff over here. It really has no idea but it’s like it’s saying, like maybe this is like 51% versus 49%. So there’s just a little bit of tiebreaker through random chance. When the policy gets to go, now the policy is trying to help this classifier. It’s saying, oh you think that this is 51% green and this is 49% green. Okay. I can do that. I can send more green stuff here and more blue stuff here. And then the classifier gets to go again and the classifier will draw the line a little more cleanly. And then the policy gets to go and it will separate out even more and so on. Eventually, this process will cause the policy and the classifier to, uh, agree on a separation, uh, of the state space. And of course, this is an example of two skills but now imagine you did this with like 2,000 skills. Then you’d have policies that go to all- you have skills that go to all the different places in the world and also avoid all the different places and basically do all- all possible combinations of state space region coverage. Um, okay. Any questions about this method? Yes. Is it possible to do this with like me- meaningful latent features of some kind? That’s a really interesting question. Yes. So the question for those of you who didn’t hear is, is it possible to do this with some kind of meaningful, uh, latent features? I think that could be possible, yes. So you could imagine, for example, um, doing some- something where you impose structure on the Z’s. You could, for example, instead of having the Z’s be integers, you can have them be vectors of categorical variables, almost like vectors of attributes. Uh, you could supervise some of the Z’s. You can say, well, these entries correspond to like these semantically meaningful features, the other interest could correspond to whatever you want. So you could probably impose a lot more structure on the Z’s. Uh, we’ve tried a little bit of that but not very much. I can, I can tell you a little bit after the lecture if you want. Any other questions? Yeah. Do Z’s make difference goes- corresponding to different states? Right. [inaudible]. Yeah, yeah, yeah. So the, so the question was- well, I made a big deal out of how goal-reaching is not the same as, uh, as skills. But now, I’m telling you that, that different skills correspond to different states. So, so what’s up? Um, the- it’s a very good question. And the answer is a little subtle because while this is trying to make different tasks correspond to different states, it’s no longer trying to make different task correspond to different goals. So you could have, for example, a state visitation profile for this blue line which is like avoid this region and end up in this region. So that’s a little bit more elaborate than just reaching goals. Now, there’s, uh- there are other versions of this kind of, uh, method that could be formulated. For example, you could imagine training your classifier to predict Z not conditional on a single state but conditional on an entire sequence of states. Then you have non-Markovian rewards which is a little tough, but then you would actually get this thing to separate the policy into all sorts of different trajectories, uh, which would maybe be a little closer to what your- to what you might want. Uh, there’s a subtlety here though which is that different trajectories might also not be exactly what you want because you can have two trajectories that look very different but achieve very similar outcomes. So there’s a choice that you have to make. In fact, people have proposed variants on this objective. There were a couple of other papers. I think I have a citation of that somewhere. Let’s see where I have that. Okay. So here, there’s a citation to Gregor et al, Variational Intrinsic Control. So if you check out this paper, Variational Intrinsic Control, uh, it actually proposed a slightly different discriminator which has conditioned on the current state and like the last state or something. And you can do all sorts of combinations, conditional trajectories, current state, future state, whatever, and they all end up with slightly different definitions of what a task is. Okay. Uh, any other questions? Okay. Um, let’s look at what this thing does. So, um, we can, uh, train it on everybody’s favorite Half Cheetah task. Uh, I’m not sure whether, uh, you guys had to do any of the MuJoCo RL things in this course. Do you know- raise your hand if you know what Half Cheetah is. Does it mean anything- okay. Okay. Good. [LAUGHTER] Um, so that Half Cheetah is supposed to run, that’s its goal in life. But if you don’t know what the reward function for Half Cheetah is and you just run this unsupervised skill extraction method, uh, sometimes you get skills that run in a very festive way, uh, sometimes you get skills that actually run backwards. It’s very easy to tell whether you’re running backwards or forwards. So the discriminator really likes that. Uh, but sometimes, it’ll also do like a weird front flip, because that’s also very easy to distinguish from running forward and backward. So basically, it tries to do the most extreme things that are the easiest to distinguish. Um, here is the ant. Uh, the ant kind of- you know, the ant has a lot more different things that it could do. So it ends up doing these kinda funny wiggles where sometimes it’ll walk in like a, a letter S-shaped, sometimes a letter J shape. So it sort of walks in different wiggles. Uh, if you run this on a really simple task like mountain car, uh, mountain car is a pretty simple task. There’s only so many things you could do. So usually, you get one of the skills extra corresponding to just outright solving the task. So this, this skill is just doing the task. It’s basically reaching the goal, the other skills are doing other weird stuff, like this one appears to be trying to go as fast as possible. So something is actually working and the skills are actually distinguishable from one another. Uh, you can also use this for hierarchical RL, so you can train up some skills for the cheetah and then you can make the cheetah do hurdles by treating the skills, the different Z’s as different actions for higher level policy. The same with the ant. You can take the skills for the ant and then train a higher level policy that directs the ants to different places. [NOISE] So these are some of the learning curves for the hierarchical version. Uh, but perhaps for the purpose of today’s lecture, one thing I do wanna make sure to cover is the connection between this and all these other information theoretic ideas that I touched on. So recall that our objective is to maximize the expected value of our reward and our reward is log p of z given s. It turns out that this is also maximizing mutual information. But now instead of goals, you have Z’s. So it’s maximizing mutual information between z and s. Remember that mutual information is a difference of two entropies, the entropy of Z minus the entropy of Z given S. This first term, back when we had goals, it was very hard. Now, it’s very easy because these Z’s are integers. How do you maximize the entropy of Z? We’ll just pick your integers uniformly at random. If you picked them uniformly at random, that first term is maximized. So that’s really easy. It’s actually only the second term that’s hard. [NOISE] And the second term is of course minimized by maximizing log p of z given s. If your classifier is perfect, then that second term has zero- has, has no entropy. So just make your classifier be really, really good, conveniently enough, that’s exactly what the whole method is doing because the classifier is training itself to be really good and the policy is changing itself so the classifiers had an easier time classifying its states. So this method is also doing mutual information between z and s, where z is basically the replacement for goals now. Okay. So the last thing I’m gonna talk about, and this is gonna be quite brief, uh, is a kind of an application of this idea to meta-learning. So, so far, we just talked about the unsupervised part, like how do you discover these different skills? But what if you actually want to use this, uh, in a meta-learning setting? So I don’t think I need a quick summary of- well, you should all know what Meta-RL is, right? So I, I could just skip this. I, I think if you don’t- you know, if you haven’t been paying attention then maybe, maybe it’s too late for that. But let’s talk about something else. [NOISE] Let’s talk about the bad stuff. Let’s talk about meta-overfitting, uh, which is about as exciting as it sounds. Uh, so meta-learning requires task distributions. And where do the task distributions come from in meta-learning? Well, they come from you? Right. You have to build them. Uh, if you’re doing reinforcement learning, maybe you figured out that a good set of tasks to meta-train on is tasks that require them to run in different directions. And if you were smart enough to design those tasks, define rewards for all of them, and then use them for meta-training, then you might be able to adapt quickly to new tasks. So you might, you know, run something like MAML, then the ant does some cool stuff where it runs in place, ready to go. And as soon as given even a tiny bit of reward for running in a certain direction, immediately figures it out. But, of course, that’s only if you chose the right meta-training tasks. When there are too few meta-training tasks, you can meta-overfit which means that you’re- you become really good at recalling your training tasks, but really bad at trying to figure out new tasks. The same as overfitting and, and supervised learning, you can have meta-overfitting and meta-learning. So specifying task distributions can actually be a very delicate choice and it can be quite difficult, especially for reinforcement learning because their task correspond to reward functions which you’re gonna program by hand. So it’d be very nice to be able to propose tasks automatically [NOISE] so that you get as many tasks as you want and avoid meta-overfitting. Of course, you have to propose them properly so their distribution has something in common with the task that you want to test-time. So you can imagine the kind of a pipeline for unsupervised meta-reinforcement learning, where you’re given a particular environment without a reward function, you automatically acquire some tasks, basically invent some things to do in that environment, and then meta-learn on those tasks, so that then at test-ti me when somebody comes along and gives you a reward function, you can adapt very quickly to that reward function, without having received any reward functions during meta-training. So what does it mean to propose tasks? Well, it means making up some rewards r of s, z for different tasks to z, so different Zs should have different rewards. And Meta-RL just means train your model to maximize the expectation with respect to all of your tasks of the reward of the policy after the policy has been updated on that task. So that’s just the definition of Meta-RL. So how can you propose tasks? Well, you can do exactly the thing that I described before, you can choose your r of s, z to maximize the mutual information between z and s, basically using any of the techniques covered in the lecture up to now. And there’s a crucial difference here, which is bef- before the result was a policy conditioned on the task. Before the result was always like Pi of a given s, g or Pi of a given s, z. But if we’re doing Meta-learning, that’s not what we’re after. In Meta-learning, what we’re after is a policy that can adapt to rewards. So if we use those tasks for Meta-RL, then what we get is not a Pi of a given s, z, what we get is a policy that can learn from rewards. And that’s really nice, because now we can define new tasks for this thing just by defining a reward. We don’t even have to know what the Zs mean. Um, so this basic idea does actually, work. So this is an experiment that uses the, uh, the- the s- skill, you know, mutual information between s, z method called the diversity is all you need or DIAYN, uh, in an unsupervised Meta-RL pipeline. And what these comparisons shows, uh, the green one shows the performance add adaptation time, so that’s Meta testing, where the Meta training was done using task proposals using that Mutual Information method. The red line shows RL from scratch, and the blue line, that’s a pretty funny one that actually, shows a setting where the Meta-learning was done by proposing random tasks. The random tasks proposed by such- by initializing the discriminator with random weights. So it so sort of a random shopping of the space. The interesting thing about this is that even the random proposals actually, work decently well, in some case, like they work well for this simple 2D navigation task, they kind of work decently, well for the ant. They don’t seem to work so well for the cheetah for some reason. The DIAYN ones are a little more consistent, uh, but the- the main point is that this kind of unsupervised proposals can actually, get you some, uh, decently meaningful Meta-learning models. So this is kind of potentially a viable way to address the Meta-overfitting issue. Okay. Uh, well, uh, that’s it for the lecture. Uh, I’d be happy to take any questions from you guys, uh, about either the last part, the unsupervised Meta-RL or anything that preceded it. Um, so if you have any questions now’s the time to ask. Yeah? Uh, in the [inaudible] [OVERLAPPING]. Yeah. [OVERLAPPING] Any ideas for how to use information theory to make that adjustment? [OVERLAPPING] Yeah, that’s a really good question. So the question was I pick the number of skills, well, I didn’t pick it, Ben Eysenbach picked it. [LAUGHTER]. So that’s my excuse. Uh, somewhat arbitrarily, is there a better way to do it? Um, I don’t know. Uh, we did try something that it was seem pretty obvious, we just use a continuous space of skills. That really, didn’t work. We’re not sure why. Uh, we also tried somewhat less obvious things like using a combinatorial space of skill. So if you- instead of having an integer, you might have a vector of binary values. So if you have, you know, 32 binary values, you can have 2 to the 32nd power of different skills, still discrete set, but a very large number, that actually, works. Um, but I don’t know how to select that number in a principled way. Yeah? So, um, around this idea of [inaudible] mutual information, we can optimize in two ways, right? We can in- introduce [inaudible] in variational distribution [inaudible] introduce the forward model. [OVERLAPPING] And I don’t know if you’ve [inaudible]. And when you introduce the forward model, we get this optimization with reward that depends on the forward, right? [OVERLAPPING] Then you talk about his other idea whether we use the probability of a particular goal we reached as it increased motivation [OVERLAPPING] which sounds like really similar to this idea of curiosity [OVERLAPPING] Um, and one thing I’ve been trying to kinda reconcile about these two methods of increasing motivation is that, if you take mutual information with foreign model has [inaudible] [OVERLAPPING] so you can take curiosity and other stuff like, [OVERLAPPING] using the novelty of the status as your reward. Basically, [inaudible] are over the state distribution, they optimize exactly the opposite thing, [OVERLAPPING] but it seems that both of these things [NOISE] create highly structured, um [OVERLAPPING] Right. Yeah. [OVERLAPPING] Thi- this is a- this is a very complex question that I ca- I can probably give you, you know maybe like a 20-30 minute answer, but I’ll- I’ll give you the short answer, um, you can get- well there- there’s a two-part short answer. Uh, one part is, you can get meaningful behavior out of seemingly contradictory objectives. There are actually, many more examples tha- that I could give you a- after the lecture if you like. So that’s a thing that happens, uh, and this isn’t the only case where it happens. But um, in terms of trying to reconcile the- the relationships between these different techniques, uh, something that I might actually, recommend is a- is a paper by Marc Bellemare, Unifying Count-Based Exploration and Intrinsic Motivation, which discusses this in a little bit more detail. Uh, I don’t think he covers the, the Model Error version of intrinsic motivation in a lot of detail, but he does talk a lot about the log P of S versus the counts and so on, and that might be a starting point to try to understand how these things can be reconciled. Yeah? So the sort of diversity method seems like [inaudible] RL where you’re kind of partitioning the state-space, but can’t you apply it to a more general sense of just something for data [OVERLAPPING] [inaudible] Um, Right. So the question was can som- can somebody like this mutual information diversity method be applied to general unstructured data? Um, so clustering does exactly this. So if, if, if you imagine doing like ye-, uh, you know, uh, Gaussian mixture model or k-means clustering it- that’s exactly- exactly what it’s doing. It can there also in the sense, to learn tasks to maximum of entropy on this data, as well? Um, Right. Can it learn tasks? That is a more complex question. I mean, you could imagine a clustering as learning tasks, like whe- when you cluster the spacing and say predict the cluster from the, from the data points, so that’s a task. Um, in fact that will probably be the closest analogue actually, because if you’re classifiers doing you know, p of z given s, then if you do clustering into p of z given x tha- that’s the closest analog. I think, you can probably do more than that, but I don’t know. It’s a good idea. Any other questions? Yeah. [inaudible] make sure you’re pretty good at everything, um, as it goes to you’re trying to match some target distribution, um, and it seems like as your state space gets bigger and bigger, the strategy is sort of, being conservative and, you know, preparing for the worst-case is less and less useful [OVERLAPPING] um, and particularly, if you go back, in case, your kitchen example, you know, if your robot spends all day [LAUGHTER] preparing for your worst-case demand obviously, that’s going to include your command where you can say. [OVERLAPPING] You might need a couple of days. Sorry? You might need a couple of days. [LAUGHTER] [inaudible] Um, so I- I’m curious you know, if- if- to make these algorithms, useful in- in practice, it seems like using this sort of you know, pure maximum entropy effective procedures and revision probably isn’t going to be as useful as the ones where we come up. We have [inaudible] some sort of prior about what are useful type of [inaudible] actually [inaudible] question, um, what is sort of your thought on how to go from, how I want to prepare for, you know, likely things I might ask you to do in [inaudible] work, into an actual prior that we could inject into this algorithm. Yeah. So um, the question was es- essentially, uh, how can we do better than just optimizing for the worst case in very complex environments, where optimizing for the worst-case would take forever, basically, where there are a gajillion things you could do. Um, two part answer, uh, one is, is a- a very lame answer which is that this- the state- a state distribution matching formulation of reinforcement learning, kind of aims to do exactly that. It assumes that you’re given some distribution. I think my question was if you’re- if you wanna do that? How do you actually, go from in my brain, I have this idea that things are useful [OVERLAPPING] Yeah. [inaudible] [OVERLAPPING] Yeah. [inaudible] [OVERLAPPING] Yeah. Right. So how do you actually get that distribution? That’s a really good question. And I don’t really know. And I think that question also gets at something a bit deeper it’s- it’s really kind of a design question. It’s like what’s the right way to- to communicate these things? Or what’s the right way to do- do, you know, should you use data should you have some communication channel? But wha- one other answer that I- that I- that I wanna give here which was maybe, a little bit of lateral thinking on this topic is that, perhaps it’s not so unreasonable to try to learn all the things. If you’re do- doing a little bit better than just density estimation, because what you can do is you can figure out that even though, uh, you haven’t broken every single dish yet, having broken one of the dishes you can kinda guess what’s going to happen if your break all the others. So maybe you don’t have to break every single dish before deciding that you’ve mastered the dish breaking skill and move on to something more productive. So perhaps, uh, having this thing be somehow, aware of your capability to generalize might actually, cover the space without like physically covering the space. Okay? Should we wrap up? Yeah, thanks Sergey. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *