State Representation

We want the Learning Agent to learn a general policy that works for any environment, independently of the locations of the landmarks and targets. Hence, our state representation must not directly employ the locations of the landmarks. Moreover, the robot cannot directly observe the complete state of the environment, which would include the location of the robot, all obstacles, and all landmarks! Instead, the task of the robot is to learn, under conditions of incomplete knowledge, about the locations of obstacles, landmarks, and targets.

State spaces that encode incomplete knowledge are known as ``belief state spaces'' [15]. The purpose of a belief state representation is to capture the current state of knowledge of the agent, rather than the current state of the external world. In our case, the Learning Agent is trying to move from a starting belief state in which it knows nothing to a goal belief state in which it is confident that it is located at the target location. Along the way, it seeks to avoid getting lost (which is a belief state in which it does not know its location relative to the target position).

To explain our state representation, we begin by defining a set of belief state variables. Then we explain how these are discretized to provide a small set of features each taking on a small set of values, so that $P(s'\vert s,a)$ and

can be represented with small tables.

At any given point in time, the headings to all objects (landmarks and the target position) are divided into six sectors. The field of view of the robot is 60 degrees, so at any point in time, the robot can observe one sector, see Figure 5.5. For each sector, we represent information about the number of landmarks believed to be in that sector and the precision of our beliefs about their headings and distances. This information is gathered from an initial version of the Visual Memory that constantly updates the location of the seen landmarks, and to which the Learning Agent has access.

**Figure 5.5:** Division of environment in sectors. The arrow shows the direction in which the robot is facing (direction of motion, not direction of gaze)
$\includegraphics[height=2.7cm]{figures/RL/sectors}$

The imprecision of a landmark is computed using the equation 3.3 already given in Section 3.2.2:

$\begin{displaymath} I(l)=\lambda \cdot \tanh(\beta \cdot I_d(l)) + (1-\lambda) \cdot \frac{I_h(l)}{2\pi} \end{displaymath}$

We summarize the agent's knowledge of the landmarks in each sector by averaging the imprecision of the four most-precisely-known landmarks. The function $Best: N \times 2^L \rightarrow 2^L$ selects a subset,

, of a group of landmarks, $L=\{l_1,...,l_m\}$ , such that $\vert B \vert \leq n \wedge \forall_{l \in B}{\forall_{l' \in L-B}{I(l) \leq I(l')}}$ . Having 4 landmarks in one sector is already very good, since only 3 landmarks are needed to use the beta-coefficient system network. Furthermore, we do not want these measures to be affected by bad landmarks when we have some that are good enough. That is why we use

when computing $\overline{I}(s)$ .