Evaluation methodology

If you submit the agent for evaluation, it is processed by what is essentially a black box, equipped with submission limits and delays. The main reason is to prevent the agents from optimizing for the reward algorithm and let you focus more on the goal and the overall behavior of the agent.

There are, however, traits of the agents that you should strive for:

consistency
efficiency
stealthiness

What does it mean in practice...?

Agents' evaluation and scoring (aside from the final one) is being done on your machines and is done separately for each scenario. Each evaluation consists of a given number of runs and there is a hard limit on the number of actions and run duration (in virtual seconds). But don't worry - these limits are quite high and exist mostly to stop the runs going nowhere.

Once a set of evaluation runs is finished, the run data is sent to our servers for analysis. When calculating the final score this is the general approach:

Positive rewards¶

As each scenario variant has a fixed attack graph, there is a clear path that leads from the start to the goal. For every correct step on this path, the agent's score is increased.

Negative rewards¶

The score is decreased (sometimes substantially, sometimes only a little) whenever the following happens:

The agent fails to finish an episode of the run.
The agent misrepresent its results (when sending a signal to the environment).
The agent repeats the same actions with the same parameters (e.g., repeated scanning of the same network).
The agent fails deterministic actions (e.g., applying wrong exploit).
The agent is detected or blocked by the active defenders in the scenario (there is a difference between being initially blocked and trying to act when already blocked).

The score is also decreased relative to the number of messages the agent sends, i.e., the fewer messages, the better score.

Final calculation¶

Once the score is determined for each episode, a weighted average is calculated for the final score and there is also an additional penalization for too large difference in scores.

Comments¶

While you are not given the exact factors and weights, you can infer the impact of some of the negative rewards. Take, for example, the difference in performance of the heuristic agent between scenario no. 2 and no. 3. The third scenario introduced the active defender that can block messages after a certain number has been sent. While the heuristic agent is able to discover the blocking length to prevent block refreshing, it does nothing to find out what the message sending threshold is and resends blocked actions whenever the block expires. In effect, it often duplicates actions and sends a needless amount of messages, and so its score takes a nosedive. (On the other hand, this is your opportunity to shine.)