To answer this question, lets revisit the components of an MDP, the most typical decision making framework for RL.
An MDP is typically defined by a 4-tuple (S,A,R,T)(S,A,R,T) where
SS is the state/observation space of an environment
AA is the set of actions the agent can choose between
R(s,a)R(s,a) is a function that returns the reward received for taking action aa in state ss
T(s′|s,a)T(s′|s,a) is a transition probability function, specifying the probability that the environment will transition to state s′s′ if the agent takes action aa in state ss.
Our goal is to find a policy ππ that maximizes the expected future (discounted) reward.
Now if we know what all those elements of an MDP are, we can just compute the solution before ever actually executing an action in the environment. In AI, we typically call computing the solution to a decision-making problem before executing an actual decision
planning
. Some classic planning algorithms for MDPs include Value Iteration, Policy Iteration, and whole lot more.
But the RL problem isn’t so kind to us. What makes a problem an RL problem, rather than a planning problem, is the agent does *not* know all the elements of the MDP, precluding it from being able to plan a solution. Specifically, the agent does not know how the world will change in response to its actions (the transition function TT), nor what immediate reward it will receive for doing so (the reward function RR). The agent will simply have to try taking actions in the environment, observe what happens, and somehow, find a good policy from doing so.
So, if the agent does not know the transition function TT nor the reward function RR, preventing it from planning a solution out, how can it find a good policy? Well, it turns out there are lots of ways!
One approach that might immediately strike you, after framing the problem like this, is for the agent to learn a
model
of how the environment works from its observations and then plan a solution using that model. That is, if the agent is currently in state s1s1, takes action a1,a1, and then observes the environment transition to state s2s2 with reward r2r2, that information can be used to improve its estimate of T(s2|s1,a1)T(s2|s1,a1) and R(s1,a1)R(s1,a1), which can be performed using supervised learning approaches. Once the agent has adequately modelled the environment, it can use a planning algorithm with its learned model to find a policy. RL solutions that follow this framework are
model-based RL algorithms
.
As it turns out though, we don’t have to learn a model of the environment to find a good policy. One of the most classic examples is
Q-learning
, which directly estimates the optimal
Q
-values of each action in each state (roughly, the utility of each action in each state), from which a policy may be derived by choosing the action with the highest Q-value in the current state.
Actor-critic
and
policy search
methods directly search over policy space to find policies that result in better reward from the environment. Because these approaches do not learn a model of the environment they are called
model-free algorithms
.
So if you want a way to check if an RL algorithm is model-based or model-free, ask yourself this question: after learning, can the agent make predictions about what the next state and reward will be before it takes each action? If it can, then it’s a model-based RL algorithm. if it cannot, it’s a model-free algorithm.
This same idea may also apply to decision-making processes other than MDPs.