Imran Rashid:Training intelligent game agents using deep reinforcement learning | PyData London 2019



okay hi everyone it's a pleasure to be here so my titles on training intelligent game agents with deep reinforcement learning and so what can you expect so if there was a subtitle this talk I would call it yd purring first reinforcement learning is so hard to get working there's a fair bit of maps and hopefully you've kind of had experience with deep learning and kind of understand terms like likelihood stochastic gradient etcetera and you've had experience working with any auto grad library even if you're kind of very kind of new to machine learning and you should hopefully be able to kind of get the intuition about kind of how we train agents and deep reinforcement learning yeah so there's a lot to get through and yeah don't be kind of worried if I you don't digest everything so just a quick intro kind of about me so I currently work in the research and development team found this Factory I studied my masters in physics and I'm very kind of obsessed with deep reinforcement learning and hopefully I can kind of convey that kind of passion to you today so they're kind of outline for this talk is we're going to talk about we're just going to have an introduction to what deep reinforcement learning is Markov decision processes MVPs are the kind of backbone behind reinforcement learning in general and then we've got three kind of kind of classes of methods main classes of methods used in reinforcement learning policy gradients value based methods and actor critics doll methods and then we'll kind of conclude after that and so kind of just to start off like it's a good idea to you see kind of where reinforcement learning fits in to into machine learning so supervised learning is very much about kind of building a model that map's data to labels and you've got a very well-defined task unsupervised learning is a kind of bit more broader and it's usually kind of involves kind of kind of fundamentally understanding the kind of structure of probability distribution of your data and kind of lots of kind of techniques like clustering dimensionality reduction density estimation generating new samples and fit into unsupervised learning and reinforcement learning is about maximizing the rewards we get from the environment so it's it's not entirely supervised learning it's not entirely unsupervised learning it's kind of how it fits you know in its own kind of domain because so the main task is that we're trying to kind of maximize we're trying to learn a behavior or a policy in the environment that we're in and we don't so in supervised learning you've really got the data to start off with you've really got your labels but we actually interact in the environment and then we get data so and how we kind of choose to interact with environment kind of changes what a data we've got we get and that kind of changes how we learn and so a good example is kind of like when you've got a toddler you know who kind of kind of approaches a fire or flame for the first time and they're not going to do it again and because they've learnt that that action corresponds to some negative reward so other things which we'll kind of see later is that the feedback can be delayed and we usually have a goal in mind but we don't necessarily have to and yes so we don't have labels per se we have the reward signals and so here's kind of a few kind of kind of objective a few kind of typical machine learning tasks and kind of let's kind of discuss about where they fall in so kind of things like detecting tumors from MRI scans I think you know it's pretty clear that it's you're using supervised learning because you need to have labels of you know you need to have positive or negative labels you know to learn once what's two more isn't compression and so there's no labels in that so it's unsupervised something like predicting house prices you know you'd probably have kind of labels like you know distance to a station and other kind of things which will affect house price and then you would kind of build a function approximately at that kind of maximizes that kind of likelihood of kind of getting the right kind of house prices so fake news generator that's a bit of a kind of it's a weird one so usually for language models we're kind of so language models like fake news generator is it's unsupervised because you haven't got any true labels you're learning you're learning the kind of probability you you're maximizing likelihood that your tokens your words your sentences come from your data the way you train it is very akin to supervised learning so like if you've kind of seen how like a text generators trained its trained like using like an auto regressive models where you you've got your kind of probability distribution of all your tokens and then you can of factorize it out and the label is your next token so you're maximizing for example like the probability of Tok ten given token 0 to 9 or 1 to 9 yep yes so and a total earning to walk so that doesn't really fit in to unsupervised learning and because you're kind of learning kind of kind of really you're trying to maximize your kind of reward in your environment but it's not supervised learning as well because you don't have the labels to start off with you kind of interact with your environment the table kind of changes how they move it cetera to kind of learn kind of what's how to walk and how not to walk okay so I'll go through this quickly so here's some examples of reinforcement learning and so the first one top left is the kind of is the kind of nature article about alphago and I think that was one of the kind of first kind of bigfeetz in reinforcement learning you've got a typical typically kind of robotics you would try to use kind of reinforcement learning for example to to pick up an object so you would have rewards for example where you have negative one reward everywhere and that you have a large positive reward once the hand reaches the object and you've got personalised ad recommendations that's an example of reinforcement learning self-driving cars and sometimes you can train a machine translation using reinforcement learning C up so deep reinforcement learning is literally kind of deep learning plus reinforcement learning so you have you have your kind of an environment where so say you're trying to kind of solve a Rubik's Cube you take an action so that's something you can control and your action is kind of by your policy that's what we're trying to learn and so given the particular configuration we're trying to learn an action that maximizes the total reward we trend then and a policy that maximizes and total reward and so you have your Rubik's Cube you take an action and then you end up with a different configuration of your Rubik's Cube so that's the next day and you also kind of get a reward whether you're that could be for example with you or whether you've got for example one side all the same color etc and your goal is to kind of kind of have a policy or kind of a set of actions which maximize your expected reward and that can be kind of defined as kind of maximizing theta so our policy is kind of parametrized by theta and that maximizes expected reward and so that's the probability of directory and I've just I can deflect dries it out okay so I'll talk about this kind of briefly so MDP kind of gives us a kind of formulation of how an agent kind of can interact in this environment and it's the backbone of RL so usually in an MDP you have you're in a particular state and you can choose an action and you get to another state and sometimes you choose you're in one state and you choose an action and you get to one particular state but then next time you pass by that state and you choose the same action you enter a different state so it's governed by kind of transition kind of probabilities and the kind of the kind of key thing about MVPs is that though they obeyed a Markov property which is that the probability that you enter the next state given your current state is always constant it's independent of your history and that's something that kind of view we need in reinforcement learning for things to work yeah so it's basically our sufficient statistic so I'm a sufficient statistic is so for example in a recurrent neural network your latest hidden state is if you were kind of using RNN to kind of model identity than your latest hidden state would be your and sufficient cystic um so just to be a bit more formal so MDP and has a set of possible states so for example all the places delivery drone can be a set of actions so it could be the different kind of motions I'm moving forward back left up right down or kind of staying stationary their transition probabilities usually they're deterministic so if you take if you're in a state and you take an action you'll usually always end up in the next thing but sometimes in different environments you they're not deterministic and you need to kind of consider transition probabilities kind of carefully then you have your rewards so the rewards are usually man-made so in this case for example you could give a positive reward when the drone delivers the package to the right house or a -10 reward for example if it livers to if it is doing nothing or it's delivering it to the wrong house oh yeah and sometimes you have states which are terminal which means that you kind of stop and your rollout is finished and so an example would be if a drone crashes into a tree or if it completes his mission okay so okay so I'll just quickly run through this usually we don't want to calculate the exact sum of the rewards because if you have and if you have like an infinite trajectory then your rewards will kind of diverge if you're off all your rewards are positive and so usually we apply discount factor usually about something like no point 999 and set out the rewards kind of further away are still relevant furthering to the beach are still relevant but like their total reward is still capped yeah and it also helps us reduce the variance as well usually we're not usually kind of we want to talk about the expected kind of return rather than just a return in a single rollout so that's the Q function and the Q function can be just written out like an expectation under your policy so the PI here's a denotes your policy so if you yeah so if you have a different policy then you're gonna have a different Q value because your total rewards are we kind of different or your return will be different and that useful no value is an Adu SFIL kind of function is their value function and that's similar to a key function but it's the expected reward given a particular state you're in so yeah we want to find a policy that visits optimal value states or we want to equivalently you should convince yourself that it's the same we want to kind of have a policy that kind of always chooses the states that give us the maximum return so we label them as Q star so Q star is the optimal action value okay okay so who you mentioned that just a quick side note so usually in a real world we're working with partially observable MVPs and their MVPs that don't obey the Markov property a good example would be for example if you saw if you had a at this state if you had a state of a picture of the car and then you know where its location is that for example you don't know whether it's driving right or left or backwards or forward you don't know it's acceleration usually kind of for in order to turn a harshly observable MDP into an MDP we take into account its previous states as well so for example if you know where the car was like in the last and a few seconds then you can work out its velocity acceleration and everything you need to know so the first kind of reinforcement learning method we see is policy gradient method so it's called reinforce and so remember our objective is to maximize is to find the policy parameter theta that maximizes the expected return in our environment so we just call j the expected return usually we kind of use jane for that kind of objective and so we want to in order to find a maximum we need to take the grading of it but it's not so straightforward because you can't take a grading of expectations so we use something called a school function trick or it's also called a log derivative trick or reinforce trick and so if you kind of write the expectation as integral then you can take so grad of Pi can be rewritten as by grad log of Pi and then we can take the PI out and then we've got an expectation over an objective that we can other gradients that we can compute and the nice thing about this is that so or when we write it out when we kind of factorize out the kind of policy of a achoo directory then we can see that the states and transition dynamics are independent of theta and so we can kind of get rid of them and so we're left with this kind of objective okay all right so so how do we train this and so we can write expectation as just kind of a sum of an average and and the variance of this is very PI so one thing one trick we can do is is like we realize that the future actions don't don't depend on the past and don't determine the past rewards so we only need to find out the rewards to go when we're at a particular state so this is the Q function by the way as well so that the thing inside a bracket can be looked at as the Q function when it's in expectation okay so so we're computing the maximum likelihood estimate of our policy so when we're using like Auto graduals and we this kind of objective is quite nice because it means that we can just get rid of the grad and just take the loss as this surrogate loss and so it can be seen as kind of a product of two parts so the first part is kind of very familiar in supervised learning and so it's basically like a supervised objective where your actions are kind of the labels in your as this you're kind of air data and so if you're working in a discrete action space then your maximum your negative likelihood and is usually computed with a cross entropy or in the case where it's continuous you kind of make kind of assumption you make the assumption that probably distribution is Gaussian and then you do the squared error between your target and you estimate and so it's weighted by the return so what this means is that we're kind of taking larger steps where you've got a kind of bigger returns okay so in the first notebook and there is a kind of a task on implementing the policy gradient and so you might want to like kind of just take a few minutes to kind of have a look look at that okay all right there should have been an email sent out with github link okay okay but I want to kind of show it here as well yep yep okay one second I think I'll show it in here instead how do I switch all right how do you think that she's stripe right than left by your cursor on top all right bitches all right okay so um I'll just show it quickly over here and yes sure Oh oh wait there's somebody all right okay oops sorry fun see from yep all right so oh yeah so different zip files so yes so it's lost in Bayes and then training punching game agents redoing possible enemy and so yeah it's the first one in policy gradient okay I think I'll just kind of quickly explain what's going on here right now so we're using the opening eyes gym environment and so so with the environment we can kind of do kind of rollouts in we can do rollouts and get the state actions and rewards as we're going along and so over here we kind of build a policy Network I tried to start with TF layers of dense intensive 7.13 and for some reason it was ten times slower to do the forward pass I'm not sure why and so here we're computing the policy as a function of our stinks so the size of that should be kind of the number of actions which is – and over here we kind of so our objective is here so it's kind of the mean of the log policy actions times that kind of rewards to go yeah sure okay that's what about and over here I've added so you can ignore you don't need to worry about this part too much but and I've added an entropy term because it helps with kind of exploration and making sure it doesn't kind of deposit isn't on veg to working okay and yeah so you kind of compute that loss and then you kind of update your network and then you continue to do rollouts and then kind of so in here to carpool it's the aim of the game is to kind of get it to balance and so I did that okay so I'll just go back to the reward does everyone have access to this the drink okay I'm going to try to get back to the PowerPoint now one second all right yeah okay so kind of unfortunately kind of the algorithm we're using is quite poor because the grading of the variance of the gradients are quite large and so kind of what I did to kind of show is and I kind of compared to kind of gradient methods once called one the one we're using is reinforced and other ones called repair motorization trick and we can't use a rebound terrorization trick and because our we can't our we can't differentiate our rewards with respect to kind of our policy parameters but I put its yeah so it's a kind of a weird objective and but you can think of it as like fixed like a VA with a fixed decoder and ignoring the KL divergence term so we're trying to find we're trying to find the encoder parameters that maximizes dis likelihood and both P and Q are Gaussian distributions and Beck hues mean and standard deviation are parameterized by a neural network so all this skip this part and okay so so the black so you can ignore the green dot the black dot is the kind of destination of awkward and crosses orders to go that's when the cross is kind of converged to the black dot and we've kind of maximized our likelihood and so you can see that's it kind of actually you can't actually see this on the screen and but there's supposed to be like a disk showing the stand evasion and that's like the points all kind of all over the place so if you look at the means here the kind of true meaning is kind of 5:5 that's the fan paint is supposed to converge to but it's actually sort of the cross is where they mean where the mean of our prediction is it actually kind of just kind of ambles around it and kind of bounces kind of back and forth and then kind of eventually kind of gets there but it's still taking huge strides and kind of if you have an allowing or a lower learning rate than this then it just blows up and you get non values in contrast with the kind of pathways derivative or the reformation trick and you find that it kind of nicely slowly goes towards the right direction towards a black dot yeah so I just interrupted it now okay so we're back to where we are there's a very good paper which kind of kind of kind of explains how to take kind of ingredients through kind of Monte Carlo kind of estimates it's called gradient estimation using stochastic computational graphs and it's by ash women at war it out right okay so can we train this on old data and so when we're kind of kind of collecting new data is collecting new data is very expensive and so sometimes you might want to kind of reuse our old data and but if we look at this objective it says that the trajectory or the kind of our kind of states of action samples kind of have to be in the kind of current policy that we're in so that's why it's called policy gradient because it's on policy you can you can derive a kind of off policy version but it's really nasty and so you can use something called important something which is usually used for when you want to kind of reduce the variance of an expectation so in this case so in this case kind of pie bar would be our kind of policy under a new trajectory sorry would be a kind of new policy another new new parameters and an approximation to this gives us this really horrible kind of objective and the kind of really bad thing about this is that so this term over here is kind of exponential in your time and so it can blow up or kind of go to zero and also kind of using a baseline can also help you kind of reduce the variance of your gradients and so you can the way you can show this is by taking the expectation of the grad log of your policy times B and showing that equals to zero now you might be wondering like why we you might want to do this well if you take and so if you take the if you calculate the variance of this by taking the expectation – and expectation s squared then you will kind of find that the variance kind of can decrease with a good baseline and a good baseline is typically an average reward and we'll come back to that later so the kind of pros and cons of policy gradient is that sort of one big Pro is that you can learn stochastic policies and and the reason why that's good is like for example if you've got a game of rock-paper-scissors then if you've if you have a deterministic policy and then like an opponent can easily exploit that by kind of for example if you've if you always kind of decide to do and kind of scissors and after your opponent is rocked and they're polling that can exploit that and kind of always win and we'll see kind of a good example in just a bit and you can work with discrete and continuous action spaces and so generally it's hard to kind of modify your algorithm to work with continuous action spaces and but there's kind of no reason why you can't do both and because in our objective and it works fine there kind of gradients are unbiased and so it means that you can kind of converge to a local optimum and they're kind of on the cons we've seen that the variance is kind of very high and we can take some measures to reduce that like using a larger batch size or smaller learning rate it's also not straightforward to kind of implement a width of policy data and if you've got an environment that's deterministic then it learns less quicker their new terminus T policies okay so and one kind of a good example of a game I found and is so you're an environment and and each of these squares is a state that you can be in and obviously you want to get the the kind of money back and you want to avoid their skulls now if you're using a stochastic if you're using a deterministic policy so in this environment and the environment is kind of is partially observable so the state doesn't so when the states on one of the gray spots it doesn't know whether it's on the left one on the right one because it seeds a door either way but if you're learning deterministic policy then you have to choose one and so you'll kind of be so if you if you're on the left then you'll kind of be stuck and in that kind of loop and you won't get help but in this stochastic policy you would it would probably you you would have like an even probability of going left or right so we looked at policy based methods and now we're going to look at value based methods and the question is can we met the policy gradients completely and so kind of we return to our initial objective which is that we want to visit and Q values and remember the Q values are the the Q's and values are given the state in action is the kind of expected reward from that state and so we want to kind of visit the optimal ones so we always so when we're out of state and we've been presented with different actions we want to go to the Q that's heist and and so we want to yeah so we want to take the action basically that where we're always in kind of a little more queue states so obviously no all actions are you cool if we know the optimal action value we've kind of solved the problem and so we can land the optimal key values directly and so we'll talk about that in a bit so one method we can use is something called bootstrapping and that basically says that we can kind of get our kind of a better estimate of the value in our current state by knowing the kind of value in the next state and so the intuition is is that there are rewards flow kind of from the future time sets to the present time steps another way of looking at is that the value estimates at high time steps are more reliable so for example if you if you're kind of one step before terminal state then then you've then you've got a much lower variance because there's only a few actions that you can take and a few rewards you can get but if you're at the start of your MVP then there's lots of possibilities that you can of kind of rewards that you can get so a for finite number of states kind of so development of demand the operators to find those so be and of the value state is the action that maximizes your reward and plus gamma x so this tea here is that transition dynamics matrix but you can think of it as an expectation after kind of next values and it can be shown too it can be shown that it's a contracting operator and that basically means that if we keep applying the bellman optimality operator then we can kind of converge to optimal value States and so it also applies to Q values too unfortunately when we're using deep learning we don't have any kind of guarantees on the convergence and see why that's the case is that usually and so we're working in continuous States so we don't kind of construct the matrix in typical Q line and what we want to do is we want to so why here is there kind of is the target is the target value so why here is basically the maximum of the road plus the next value step and we want to kind of minimize bellmen error and and so that should be there should be a beta it should be B operator here and basically what you can show is that when you have both of these contractions together then you don't get a contraction anymore and but you can still kind of expect some kind of good results and this problem is known as the deadly triad and and if you want to know more about it you can look in a certain and batos book okay so for deep queue networks we have our cue function as a reward plus gamma times the maximum the action that maximizes that the maximum Q value in the next state and we want to minimize this error so I put kind of ice cubes kind of over this target value this target Q because we actually don't want to change so we don't want to back propagate gradients for our target value and because we're using it as a reference unfortunately we're not kind of learning a maximum likelihood estimation anymore and that's because we're using that's because our kind of target and why is no longer our kind of max our kind of target Q value is isn't accurate it's still an estimation and yeah so the target Q body is biased and told ya and at how cute and SMA Q values are correlated so kind of just to understand the problem instability in a Q learning a bit more we can think of our kind of Q values as being parameterised so we can think of our Q values as a net and where the net is kind of where the shape of the net is kind of traumatized by theta our neural network parameters so when we have high local Q value that's larger than our initial estimate we update our parameters using stochastic gradient descent and we pulled the net higher unfortunately when we pulled the net higher we also pull the neighboring points as well and we might not necessarily want to do that and so if we upgrade a policy in every time step we effectively keep and kind of increasing the net pulling up the net in the same place so one way to kind of mitigate this is by kind of randomizing the kind of state's actions rewards and that we kind of feed in to our objective and so this is called a replay buffer so the replay buffer basically stores your state action reward and next date – Falls and then when you want to update the q-value you take a tuple from random and that kind of ensures that you're not picking the kind of the tuples that are next to each other and this kind of state star next to each other and actions that are next to each at the state action pairs that are next to each other and so you're not getting that kind of problem we kind of discussed before it's really easy to implement and there's no kind of correction term and so basically we just kind of and changed they're kind of the expectation part so now it's our tuples are coming from the buffer so that's our objective and okay so another way we can kind of mitigate the issue is by decoupling the target Network parameters from the estimated queue and so there's kind of two places where we can get the target parameters so one is by having a totally new queue network and sort of saying queue network kind of trained on different samples and so that would kind of health could have increased stability or we can use an usually what's usually used is an older saved network of our queue values and the downside to that is that the older kind of Q values are still somewhat correlated with the current Q values sometimes you can use kind of polyak so yeah so one so one way of kind of adjusting the network parameters is using Polyakov urging so we're just slowly updating or Thea parameters and so you can see an example so this diagram here as well so after every and K number of steps you kind of transfer your current and theta to your your target Q theta okay so choosing an exploration policy for laps so when we kind of run the algorithm we need to make a kind of decision about kind of what actions we take and so kind of initially you might think that you want to take the actions that maximized Q value the next Q value and because then you're more likely to kind of go into regions with a higher reward but sometimes what you might find is that and if you explore kind of kind of uncharted areas they might have kind of much higher Q values and usually kind of a good exploration policy can be kind of difference between a really bad algorithm in a really good algorithm and so one common one is called epsilon greedy so with a certain probability you choose maximum Q that you choose an action and with the maximum Q value otherwise you kind of choose any other action at random okay there are some other kind of exploration kind of heuristics one common one is called Boltzmann acceleration we just basically take the action kind of proportional to the exponential of the q function after your different actions okay so that's a kind of overview of deep Q learning so you kind of save your target Network parameters you kind of run a rollout and you can kind of choose which what policy you am used to run a rollout and then use sample some tuples from your replay buffer and then you kind of train this loss so there's different things you can so you can tweak the algorithm in in different ways so you can choose to for example before you can choose for example to take a loop from say four to three in a cycle if you wanted to if for example kind of gathering data is very expensive okay another problem in kind of q-learning is this kind of problem of overestimating our Q values so usually and one way I kind of try to minimize this bellman error or temporal difference error we're taking away kind of constantly kind of minimizing this entire bellman error but actually kind of what you want is so ideally in like an ideal world you kind of want so our Q values are noisy and you want the you want to take the bellman difference off the kind of true Q value and of the true kind of Q values for your target and current state and so ideally you'd want to kind of take the kind of expectation of those Q values but what you're in fact doing is you're taking an expectation over the maxes and and that basically means that you're kind of overestimating your target Q value okay so in double Q learning and they kind of get around that and this kind of issue of kind of D correlating the noise and in our kind of actions we take from the noise and the values and it's by any kind of a subtle method and they basically choose a different network to select action from the from the network to kind of evaluate values so that's our kind of current sorry that's our current target value and just a one thing we change is now we kind of take the odd max over they're kind of over the kind of the current q-value parameters or the new target parameters and that basically works to kind of be correlate this noise issue and so in the double q-learning paper they found that so I'm not sure how they got their true value but what they found was that in double Wendy kind of applied at fixed than their value estimates working a much marker up okay so okay so there's a kind of a notebook for cue line as well one second so you might just want to take a bit of time to kind of look into that so it's this one here it's called double D Q and in breakout I think I did the 3 later on for my own reference so you might just want to take a couple of minutes to kind of look on that okay so so come play the baby of some way sniffing okay so um there's kind of a few things I just want to kind of point out in here and that is so it's so you're working in environment of breakout and now this is an example this is an example of partially observable MDP because given our current state so if you so like this picture is a state that would be kind of usually fed in to our MVP now from that state you don't know whether it's moving right or left and so it's a partially observable MDP and the way to the way usually they forward hary games they kind of obviate this issue is by stacking the previous frames as well so over here you've got four previous frames that are kind of stuck together so here and the other kind of subtlety probably that you might and I notice is that so our Q functions a function of our state and actions but for the network we only input our states and we kind of empower our only our states because we want to get the Q function for all our possible actions and so the output of a network is the Q functions for and for that particular state and all the different actions and the reason why we do it this way is that it's very easy to do two odd max because then you can just select the highest Q value and donnell subtlety with yet so with normal kind of dqn the target values are kind of straightforward to kind of implement and the only thing you need to do 4wqn is just get the old max or that other network and of the kind of current action values key values and so here's another thing as well so we want to so we used TF don't stop gradient and for our target values because we don't want to propagate freedom and actually in this particular case and the way it's out you don't need to use TF don't stop gradients because when you've got the copy the after network the gradients can't by prop grave anyway okay so all this connects so they're kind of lost kind of kind of group of algorithms is called that we're looking at is actor critic algorithms and so basically we've looked at policy based methods and we've looked at value based methods and we want to know is there a way of kind of combining both of them together so with with the policy based methods we saw that we saw that they were unbiased but the variance was quite large and the kind of intuition for wide variance was so large is because when you're taking the expected returns when you're taking the rollouts and you can have lots of different possible returns and so you can have for the same policy you can have one very high return and one very low return and but for Q values we saw that we found that they were biased because of development operator and not that can't be applied and for deep learning but the variance is also the variance however is very low and the reason for that is because you're just taking the mean squared error between your current Q value and the estimate of your next Q value and your next Q value won't change much over your algorithm as you're training so okay so if you look at kind of our kind of objective degree in this we want to take and kind of in place of Phi we have the kind of expected return so if we take Phi as just kind of expected return as a way of kind of running our policy then we have a policy gradient algorithm and we can apply a kind of control variant or baseline to reduce the variance however this is the exact same thing as the Q value so the reason why I've put the Q value over there as well is because you can parameterize your Q value with a neural network and so it will be biased and but the kind of objective is the thing and but you'll have lower variance a kind of a good baseline to use is your value at your current state and so when you take Q minus V it's called an advantage and the kind of intuition behind that is that you're kind of advantage is you're so your advantage a particular state in action is so if you take a particular it's so a particular state your advantage is how much better its particular action compared to any other action and you saw that sorry and we saw that we can apply the kind of baseline as long as it's kind of independent of your actions and so your expectations won't change oh sorry the expectation of the gradients and won't be different but it'll just be lower the variance of them will be lower and so with the advantage they're written over there as Q minus V and you can train using two different neural networks if you wanted to only use one neural network then you can just write Q as your sir over your Q is your Q value is your kind of the reward you get at your tape after you take a particular action and plus the kind of some of your rewards if you kind of run the policy and they expected some of the rewards if you kind of run a policy and and so that's just basically rewards plus your discount factor times you're on x value state and so and this is a kind of typical kind of algorithm for Q actor critic so kind of one thing to kind of point out is that so you've got two different neural networks and you need to update Q and C you're updating Q using the bellman error as well or the temporal different Sara and and so this objective is still biased because of that Q but the variance is greatly reduced and and here's one for the the advantage actor critic and it's basically very similar except we've taken the we've taken the advantage to be and so we've just taken instead of the cue over here we've just used so on the last line it says theta you update the kind of theta using advantage times gradient of love point you just changed the cue to your advantage and so I think yeah so there's a a notebook for this particular algorithm and that's the last notebook on there so you might want to look at that it's called yeah advantage actor great acquit and the kung-fu gym environment so there's not much so you're trying to different in your network so one for your policy and one for your value and but other than that there's not much that's different from and policy gradient okay so kind of I've trained it and so if you get home you can have a look at it as well and so it kind of works pretty well I'm going to go back to the slides and so let's take a look at one of the advantage estimates okay okay so if we look at our advantage estimates again and so we were using we were trying to minimize the temporal difference error and using between by kind of minimizing the error between the current Q value state and your next next value state you can get a lower you can expand your next value State as your as a reward plus gamma times the next value state two steps on and you can keep doing that and eventually you basically get the policy gradient and with a baseline which is your value state and so kind of you the kind of range from here is that you have a high bias and low variance at the top but at the bottom you have a low vison high variance so at the top the reason why so at the bottom the reason why you have a high variance is because the rewards from the rewards from your rollout can have a high variance so under the same policy you can have lots of different you can have a kind of a great range of returns and and the reason why you've got a low variance at the top is because you're only kind of is only kind of the kind of variance you get from your value state your next value state and you're the kind of reward at the next when you take an action at that stain okay so I'll just kind of skip this so um there's okay so kind of one of the more kind of advanced algorithms is called DD PG and deep deterministic policy gradient and it looks at the kind of so if we look at back at the if we look back at Q learning we have to take we had to take an egg max over the Q the Q values and however if you're working with a continuous action space then you can't really it's very difficult to kind of find a way to take an egg Max and so the intuition behind DD PG is that you want to take so if you so you use the deterministic policy and your deterministic policy is kind of dependent on your state and you just kind of differentiate your Q values see you just basically kind of take the derivative of your deterministic policy with respect to your and Peter from your Q values and because you want to kind of take gradient ascent and to kind of make sure you get a policy that kind of maximizes the Q values and and so the loss is just basically so you've got to loss see you got your loss from your critic and which is just computed with you or bellman error and then you lost with your actor which you can do using chain rule and so it won't really go into this other algorithm it's called trust I won't go too much into it so it's called trust region policy optimization and basically it kind of looks at the kind of the variance the kind of high variance issue and in usual policy gradient and what it does is it tries to so they're kind of intuition behind it is if you've got a policy and which is parametrized by theta if you kind of change theta by a small amount then there's a possibility you can have a completely different wild policy especially if it changes you can as especially if it changes the policy at the especially if it changes their kind of actions towards the beginning of the directory and so the kind of solution is well why don't we constrain the kind of the policy our new policy from our old policy and with a KL divergence and I mean it's constrained and so that it's less than Delta and it works much much better than old algorithms we've seen so far including DDP G and so kind of lastly I kind of want to talk about why deep reinforcement learning isn't really kind of ready to be kind of deployed at large it's a really hard for a number of reasons and so classical robotic techniques kind of in optimal control outperform RL already and it's – it's to date a hungry usually kind of when you're working with complex environments then you have to wait a very very long time before you see your kind of rewards starting to pick up it also requires a reward function and that's kind of man-made as well and so in sometimes there's environments where it's very difficult to see kind of how you can kind of allocate your rewards in the best way and you can fall into suboptimal behaviors and I'll show one example of that on the next slide and it's poor at generalization and transfer learning and it's very sensitive to high perform to tuning and this can make it really frustrating because usually you don't know so if you've built a new algorithm then you have to wait a very very long time before you can work out whether your algorithm is good or not because for for a long time you're just going to have your kind of fix you're going to have your kind of rewards kind of be kind of like close to zero or kind of like Plateau to begin with and the kind of variants and noise issue in reinforcement learning is kind of very high so here's an example on the VGO environment and so I encourage you to kind of get it if you can so if you're a student and then you have it you can kind of download it for free and it's compatible with an opening a gym and so I think for this particular environment they used an algorithm called normalize advantage functions and it kind of dropped into a kind of suboptimal policy so and when you were in a suboptimal policy like that it's kind of very difficult to get out of it because you're in a high rope you're in a relatively high reward region so some kind of research areas ongoing or kind of multi-agent reinforcement learning model-based reinforcement learning and so multi-agent reinforcement learning is when you've got lots of different agents and and it could be kind of cooperating or vying for some goal and and it's quite challenging because and you're not working you're not really working in MD you're not really working and because you're your environment is kind of dynamic and so you can't use for example your replay buffer you can't use the states and actually as you used before because you're kind of environment is kind of being it's kind of it's turned very kind of dynamic because of the introduction of other agents so model-based RL is another area where you kind of where you're learning about your model so for example you could be learning the transition dynamics and sometimes that can really kind of boost your performance Metro RL which is about kind of learning to reinforcement and representation learning and kind of effective exploration techniques imitation learning and so an inverse RL so kind of given a policy can you kind of reverse engineer what the rewards in the very man should be like transfer learning and okay so I just kind of like to end this off we do a kind of a nice quote around it's why Alex Earp and we works in Google brain so several times now I've seen people get lured by recent work they tried deep reinforcement any for the first time and without fail they underestimate the perils difficulties without fail the toy problem is not as easy as it looks and the feel destroys them a few times until they set realistic expectations so thanks for listening duster end of my talk [Applause]

Leave a Reply

Your email address will not be published. Required fields are marked *