Markov Decision Processes (MDPs) – Structuring a Reinforcement Learning Problem



what's up guys welcome back to this series on reinforcement learning in this video we're going to discuss Markov decision processes or mdps this topic will lay the bedrock for our understanding of reinforcement learning so let's get to it Markov decision processes give us a way to formalize sequential decision making this formalization is the basis for problems that are solved with reinforcement learning to kick things off let's discuss the components involved in an NDP in an MDP we have a decision maker called an agent that interacts with the environment that it's placed in these interactions occur sequentially over time at each time step the agent will get some representation of the environment State and given this representation that agent selects an action to take the environment is then transitioned into some new state and the agent is given a reward as a consequence of its previous action so to summarize the components of an MDP include the environment the agent all the possible states of the environment all the actions that the agent can take in the environment and all the rewards that the agent can receive from taking actions in the environment this process of selecting an action from a given state transitioning to a new state and receiving a reward happens sequentially over and over again which creates something called a trajectory that shows the sequence of state actions and rewards throughout the process it's the agents goal to maximize the total amount of rewards that it receives from taking actions and given states of the environment this means that the agent wants to maximize not just the immediate reward but the cumulative rewards that it will receive over time alright let's get a bit mathy and represent an MDP with mathematical notation this will make things easier for us going forward so we're now going to repeat what we just usually discussed but in a more formal and mathematically notated way in an MDP we have a set of states big s a set of actions big a and a set of rewards big R will assume that each of these sets has a finite number of elements at each time step T that agent receives some representation of the environments state s sub T based on this state the agent selects an action a sub T and together this state in this action gives us the state action pair s T comma a tee time is then incremented to the next time step T plus 1 and the environment is transitioned into a new state represented by S sub T plus 1 at this time the agent receives a numerical reward R T plus 1 from the action taken from the previous state so generally we can kind of think of this process of receiving a reward as an arbitrary function that map's state action pairs to rewards the trajectory representing the sequential process of selecting an action from a state and then transitioning to a new state and receiving a reward can be represented like this this diagram and nicely illustrates this entire idea let's break this diagram down into steps step 1 at time T the environment is in state s T step 2 the agent observes the current state and selects action a T step 3 the environment transitions to state s T plus 1 and grants the agent reward R T plus 1 this process then starts over for the next time step T plus 1 now since the set of states and the set of rewards are finite the random variables are T and s T that represent the reward in the state at time T have well-defined probability distributions in other words all the possible values that can be assigned to RT and s t have some Associated probability these distributions depend on the preceding state in action that occurred in the previous time step t minus 1 so for example suppose s prime is a state within the set of all states and R is a reward within the set of all rewards then there is some probability that the state at time T will be S Prime and that the reward at time T will be R this probability is determined by the particular values of the preceding state and preceding action we have a bit more formal details regarding transition probabilities on the corresponding blog for this video on deep lizard comm so be sure to check that out alright we now have a formal way to model sequential decision making how do you feel about Markov decision processes so far some of this may take a bit of time to sink in but if you can understand the relationship between the agent and the environment and how they interact with each other over time then you're off to a great start it's a good idea to utilize the blog for this video to get more familiar with the mathematical notation because we'll be seeing it a lot in future videos and while you're at it check out the Deep lizard hivemind for exclusive perks and rewards like we discussed earlier MDPs are the bedrock for reinforcement learning so make sure to get comfortable with what we covered here and next time we'll build on the concept of cumulative rewards that we introduced earlier thanks for contributing to collective intelligence and alcea in the next one do they have detection the human is the question they finally killed quick finger those the counterplay comes in from open a I in fact jumping baton with the silence he's caught the trickster trading they'll get the kill they'll take down the liar arguably the pesky hero that open a I have presented he has been making a lot of plays but now finally team human are able to kill him on

21 thoughts on “Markov Decision Processes (MDPs) – Structuring a Reinforcement Learning Problem”

  1. Check out the corresponding blog and other resources for this video at:
    http://deeplizard.com/learn/video/my207WNoeyA

  2. Excellent explanation. It will be great if you could make a video series on all Math concepts behind Machine learning.

  3. this is by far the best tutorial I've seen about this topic. I'm about to watch the whole series πŸ˜€

  4. Are you sure this is Markov? I think you're thinking Pablov. I'm looking for Markovian on/off states.

  5. What I learned:
    1、MDP is formalize decision making process. (Yeah, everybody teach the MDP at first ,no body tell me why until now . Its a strange world)
    2、The R(t+1) is because of At , before I always think ,Rt is pair with At
    3、The agent is care about accumulate reward ( For others dont know )

Leave a Reply

Your email address will not be published. Required fields are marked *