TM Reward Managers

From PRLT
Revision as of 15:38, 17 July 2006 by Mau (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Overview

The reward managers in PRLT extend the main class RewardManager (core/src/RewardManager.cpp). The most important methods used in this class hierarchy are:

  • Reset
  • GetReward

Obviously there are other methods used to set the current and the previous state configuration as well as the executed actions. This class, given the current state configuration and the executed action, computes the rewards to be assigned to each agent using the GetReward method. This method is called by all the agents. Notice that this class is used both for single agent and multi agent environments.

Reset Method

The Reset method is mainly used to reset some internal variables of the (derivated) classes used during the reward computation. For example, in the GridMultiTokenRewardManager class, the Reset method resets the variables mSum and mTime that are used to compute the reward values to be returned to each agent.

GetReward Method

This method is the core of a generic reward manager. It is called by each agent after the execution of the joint action (or the single action if we are dealing with a single agent environment). Given a string representation of the agent reward function (it is usually an identifier specified in the xml reward configuration), this function computes the reward to be assigned to each agent using the actual state configuration and the current joint action executed by the agents. Notice that the reward value to be returned to each agent is stored in a variable passed as reference to this function. This function returns true if the current trial is finished (this definition is highly related to the environment under examination), false otherwise.

Reward Function Examples

In the following some reward function used in multi agent environment are described

Bar Problem

The Bar Problem reward manager (pdf), given the string representation of the actuar reward type for each agent, computes the reward to be assigned to that agent with the following function:

bool  BarRewardManager::GetReward (const string rewardFunction, float& rReward)
{
  size_t pos = rewardFunction.find(Utils::xSeparator,0);
  assert(pos != string::npos);
  // computes the agent number
  unsigned agent = atoi(rewardFunction.substr(((unsigned)pos)+1,rewardFunction.length()-((unsigned)pos)-1).c_str());
  if (agent == 1)
  {
    mNumCalls++;
    mGlobalReward = ComputeGlobalReward();
    LogWorldUtility();
    LogBarConfiguration();
  }
  rReward = ComputeReward(agent-1);
  return false;
}

Notice that the previous function computes only the agent number calling it and logs the relevant information of the reward manager (reward values, world utility value, ...). The true reward computation is done by the ComputeReward function. In this case, this function always returns false because in this environment the trial termination condition is undefined.

Multi Agent Grid World with Tokens

The reward manager is related to the grid configuration.

bool  GridMultiTokenRewardManager::GetReward (const string rewardFunction, float& rReward)
{
  size_t pos = rewardFunction.find(Utils::xSeparator,0);
  assert(pos != string::npos);
  // computes the agent number
  unsigned agent_caller = atoi(rewardFunction.substr(((unsigned)pos)+1,rewardFunction.length()-((unsigned)pos)-1).c_str());
  InitVariables();
  mSum = ComputeSumOfTokensTakenUntilNow();
  mOverallSum = ComputeOverallSumOfAllTokens();
  if (agent_caller == 1)
  {
    mTime++;
  }
  rReward = ComputeReward(agent_caller-1);
  return false;
}

This reward manager computes the sum of the tokens taken until the current time step and the sum of all the tokens laid on the grid. Given these two values and the agent number, using the ComputeReward function it computes the reward to be assigned to each agent according to their reward functions. Notice that also in this case this function always returns false. In this situation we can define the termination rule as stop this simulation when all the tokens are taken. In this case, when the termination rule is satisfied, the function will return true.