Experimentation

We have employed the Webots simulator to perform our experiments. The environment contains a set of landmarks, one of which is designated as the target. There is also a wall that surrounds the region in which the robot is navigating. The landmarks are the only objects in the environment. There are no obstacles, as obstacle avoidance is handled by the Pilot system. However, the robot can be blocked by the landmarks or by the wall. In each trial, the robot starts at a random location in this environment, and it has to reach the target. The trial terminates under three conditions: (a) if the robot reaches the target (and is confident that it has reached the target), (b) if the robot takes 500 steps without reaching the target, or (c) if the robot is blocked. When the trial is finished, the next one begins with another random initial location for the robot.

In order to see if the performance of the system improves after learning, we compared it with a hand-coded policy. The hand-coded policy used the same discretized features as the learning algorithm (Target Distance, Landmark Count, Landmark Imprecision and Target Location Imprecision). The following table shows the policy for choosing an action depending on the values of these features :

Target distance	Lmk Count	Lmk Impr	Target Loc Impr	Action
		$\ast$	$\ast$	MLL
	$\neg low$		$\ast$	MVL
	$\neg low$	$\neg high$		MOT
	$\neg low$	$\neg high$	$\neg high$	MB
$\neg high$	$\ast$			MVL
$\neg high$	$\ast$	$\neg high$		MVT
	$\ast$	$\ast$	$\neg$	MVT
	$\ast$	$\ast$	$\neg high$	MB

Variable
Target Distance		$\leq 2$
Target Location Imprecision	-		$\geq 5$
Landmark Count	-		$\geq 2$
Landmark Imprecision	-		$\geq 5$

The reader should note that this hand-coded policy is not the same as the policy produced by the hand-coded bidding functions described in Chapter 4. We have chosen this policy because it allows us to debug and test the Learning Agent separately from the rest of the multi-agent system.

The Learning Agent was trained for 2000 simulated trials. At regular intervals, the learned value function was tested by placing the robot in 100 randomly-chosen starting locations, running one trial from each location, and measuring the total reward, the total number of actions, and whether the robot succeeded in reaching the target position. The same set of 100 starting locations was employed in each testing period. The hand-coded policy was also evaluated on these 100 starting locations.

First, let us consider the fraction of successful trials. Figure 5.6 shows that even after only 100 trials, the Learning Agent is already out-performing the hand-coded policy. After 2000 trials, the Learning Agent succeeds in reaching the target in 84 of the trials, compared to only 24 for the hand-coded policy. From these results we also see that our hand-coded policy was pretty bad. Although we could have tried to rewrite the policy to improve its performance, the results show that Reinforcement Learning can greatly help on solving complex tradeoffs, very difficult to handle manually.

**Figure 5.6:** Number of successful test trials as a function of the amount of training
$\includegraphics[height=5cm]{figures/RL/success24}$

A second way of analyzing the performance of the Learning Agent is to compute the average reward per trial, the number of actions per trial, and the number of actions of each type. Table 5.1 displays this information after 2000 training trials. Each value is averaged over five test runs. The only difference between test runs is the random number seed for the Webots simulator. We see that while the hand-coded policy receives an average of

units of reward, the learned policy only receives

units, which is a huge improvement. In addition, the Learning Agent on the average only requires 50 steps to terminate a trial (reach the goal, become blocked, or execute 500 steps) compared to 153 steps for the hand-coded policy. Actually, the Learning Agent never terminates because of reaching the 500-step limit.

**Table 5.1:** Comparison of the Learning Agent (LA) and the hand-coded policy (HC) after 2000 training trials.
	Reward per trial	Actions per trial	MB	MOT	MVT	MVL	MLL
HC	-858	153.33	4.94	18.59	0.52	121.96	7.32
LA	-336	49.95	11.41	6.52	5.61	4.97	21.43

Table 5.1 contains other interesting information. In particular, we see that the Learning Agent has learned to perform fewer MOT and MVL actions and more MB, MVT, and MLL actions. Note particularly that the Learning Agent is executing an average of 11.4 MB (Move Blind) actions per trial, compared to only 4.9 for the hand-coded policy. One of the goals of applying Reinforcement Learning was to find a policy that freed the camera for use by the low-level obstacle avoidance routines, and this is exactly what has happened: the hand-coded policy uses the camera 96% of the time, while the Learning Agent uses it only 77% of the time. On the other hand, we were surprised to see that the Learning Agent chooses to execute the most expensive action, MLL, so often (21.4 times per trial, compared to only 7.3 times per trial for the hand-coded policy). Certainly, it has found that a mix of MLL and MB gives better reward than the combination of MVL and MOT that is produced by the hand-coded policy. The Learning Agent spends much more time looking for new landmarks and much less time verifying the direction and distance to known landmarks.