State
As the agent interacts with an environment, the process results in a sequence of observations (), actions (), and rewards (), as described previously. At some time step , what the agent knows so far is the sequence of , , and that it observed until time step . It intuitively makes sense to call this the history:
What happens next at time step depends on the history. Formally, the information used to determine what happens next is called the state. Because it depends on the history up until that time step, it can be denoted as follows:
Here, denotes some function.
There is one subtle piece of information that is important for you to understand before we proceed. Let's have another look at the general representation of a reinforcement learning system:
Now, you will notice that the two main entities in the system, the agent and the environment, each has its own representation of the state. The environment state, sometimes denoted by , is the environment's own (private) representation, which the environment uses to pick the next observation and reward. This state is not usually visible/available to the agent. Likewise, the agent has its own internal representation of the state, sometimes denoted by , which is the information used by the agent to base its actions on. Because this representation is internal to the agent, it is up to the agent to use any function to represent it. Typically, it is some function based on the history that the agent has observed so far. On a related note, a Markov state is a representation of the state using all the useful information from the history. By definition, using the Markov property, a state is Markov or Markovian if, and only if, , which means that the future is independent of the past given the present. In other words, such a state is a sufficient statistic of the future. Once the state is known, the history can be thrown away. Usually, the environment state, , and the history, , satisfy the Markov property.
In some cases, the environment may make its internal environmental state directly visible to the agent. Such environments are called fully observable environments. In cases where the agent cannot directly observe the environment state, the agent must construct its own state representation from what it observes. Such environments are called partially observable environments. For example, an agent playing poker can only observe the public cards and not the cards the other players possess. Therefore, it is a partially observed environment. Similarly, an autonomous car with just a camera does not know its absolute location in its environment, which makes the environment only partially observable.
In the next sections, we will learn about some of the key components of an agent.