Skip to content

Latest commit

 

History

History
149 lines (101 loc) · 6.61 KB

Specs.md

File metadata and controls

149 lines (101 loc) · 6.61 KB

WareWolf Environment Flow

As shown in the following diagram, the workflow is straightforward. It is divided into 4 phases:

  1. Communication during Night : Non villager (aka wolves) agents cast a vote.
  2. Execution during Night : Non villager agents execute action. Wolves will eat an agent.
  3. Communication during Day : Every agents cast a vote.
  4. Communication during Night : Every agent vote and the majority wins.

This four phases are manages by the is_night and is_com boolean flags. Moreover each phase has an unique id which is fed as an observation to the model.

this

Shuffling

Un/shuffling is done exclusively at the start/end of each env.step() call.

In a multi agent env each agent have to keep their roles during the whole training phase in order to effectively learn the best strategy for their role. This requires to have static ids which leads to agents to map id to role, defeating the purpose of the game.

To avoid this behavior a shuffling dictionary is randomly initialize at every env.reset() call. This dictionary maps an agent id to another random id without any repetition. Each time a env.step() function comes to an end, it shuffles the return ids using the latter dict. An inverse unshuffle dictionary is then used at each env.step() on the action_dict to fix the correct indices

Rewards

A reward dictionary is initialize with [agent_id]=0 at the start of the env.step() function. Most of the rewards come from the penalty entry in the CONFIG dictionary:

 penalties=dict(
        # penalty dictionary
        # penalty to give for each day that has passed
        day=-1,
        # when a player dies
        death=-5,
        # victory
        victory=+25,
        # lost
        lost=-25,
        # penalty used for punishing votes that are not chosen during execution/kill.
        # If agent1 outputs [4] as a target and agent2 get executed then agent1 get
        # a penalty of trg_accord
        trg_accord=-1,

    ),
    
  

Target Rewards

As written in the Action space session, an agent action is a list of discrete values in range [0,env.num_player-1].

There is a direct correlation between an agent output and its reward:

Meaningful vote

To incentive agents to vote in a meaningful way, at the end of each execution phase, the agent is penalized based on the index of the executed agent in its output vector.

Example
  • agent output for execution [0,3,4,4,2,1,5,6,0,8]
  • Executed agent = 4
  • Weight for penalty : w=indexOf(executed, output)=2 (notice how the first occurence is selected).
  • Agent penalty : trg_accord*w

Everyone Reward

  • penalty of day after each phase cycle (when day and after execution)
  • penalty of death if agent dies (eaten, executed)
  • reward of victory if agent group wins (either wolves or villagers)
  • penalty of lost if agent group looses (either wolves or villagers)

Metrics

To understand better the nature of the learning various metrics were added. All of them can be found in the env.initialize_info(), and are as follows:

    suicide=0,  # number of times a player vote for itself
    win_wolf=0,  # number of times wolves win
    win_vil=0,  # number of times villagers win
    tot_days=0,  # total number of days before a match is over
    accord=0,  # number of agents that voted for someone which was not killed

Some of them can be normalized to be in range [0,1].

Action Space

The action space for each agent is the following: gym.spaces.MultiDiscrete([self.num_players] * (self.signal_length + 1))

Which can be seen as a vector with two components:

  1. The first element is the target which an agent want to be killed. It is an integer in range [0,num_player-1]
  2. From the second element on, the vector is a sequences of integers in range signal range. They constitute the basis for the communication system.

Using the ray.rllib.MultiAgentEnv wrapper the action object passed to the env.step() function is a dictionary mapping an agent id to a vector.

This vector is supposed to be used as a preference for voting. The first n values from the target will be used to decide which player to eat/execute. When the time comes to kill an agent the first n targets from every agent are counted and the most common is chosen (if there is no most common one then a random one is picked instead).

Observations

The observation space is an instance of gym.spaces.Dic with the following entries:

# number of days passed
day=spaces.Discrete(self.max_days),

# idx is agent id_, value is boll for agent alive
status_map=spaces.MultiBinary(self.num_players),

# number in range number of phases [com night, night, com day, day]
phase=spaces.Discrete(4),

# targets is now a vector, having an element outputted from each agent
targets=gym.spaces.Box(low=-1, high=self.num_players, shape=(self.num_players,), dtype=np.int32),
# signal is a matrix of dimension [num_player, signal_range]
signal=gym.spaces.Box(low=-1, high=self.signal_range - 1, shape=(self.num_players, self.signal_length),
                      dtype=np.int32)

# own id
own_id=gym.spaces.Discrete(self.num_players),


Most of the observations are straight forward but will be described nonetheless:

  • day: Discrete number counting the day passed during a match. In range [1,max_days] will be converted to a OneHotVector when feeded to the model. A day passes when the last phase (number 4) is concluded and an agent has been executed.
  • status_map : MultiBinary vector of lenght num player. Used to map agent to the being alive condition by index (0 dead, 1 alive).
  • phase : Discreate in range [0,3], maps the phase integer to a OHV in the model.
  • targets : It is simply an integer in range [0, num player -1].
  • signal : It is a vector of length signal_length. Each element is an integer in range [0,signal_range-1]

PA Wrapper

To prevent the model to choose invalid action in the target vector a parametric wrapper is used around the original environment.

The wrapper keeps the original observation, flattening them in a numpy array, and adds a boolean action mask of size [num. players,1] which is then used in the model to set logits to zero for invalid actions.

This speeds up the training and renders the reward shaping for such invalid actions useless.

Eval Wrapper

An evaluation wrapper has been used to implement logging, custom metrics and more. It is built around the Pa Wrapper.

Complete Example

A diagram example game can be found here. The image shows how observations and action are handled by the environment during a match.