Minimax solves for optimal opponent play, minimizing the best move an opponent could make. But what if we knew the opponent wouldn’t make the best move? What if we knew what the opponent would do ahead of time? In that case, we could beat them faster by playing moves which take advantage of this fact, even if our move isn’t objectively the best move. Don’t play the game, play the man.

For the purposes of this problem let’s assume it is a chess algorithm but this concept can be generalized to any algorithm which can use minimax. Anyone who has played a game against another person understands the idea of “downloading” your opponent: playing against a person enough that you feel like you know what they will do. Hikaru Nakamura does.

It seems like this type of player would have many possible applications, including training for specific skill levels and even trainers personalized to an individual’s playing style.

## Prior work

There has been previous work in terms of opponent modeling in chess. Maia chess has been trained to model several Elo levels of opponent play which makes it ideal to be used as a backbone in my experiments.

## The Policy Function

We can think of the policy function as a model of the opponent. We will therefore be optimizing an algorithm to play against this model, rather than the optimal opponent.

The specifics of the policy function are as follows. The function takes two parameters, a board position B and an opponent skill level L. The parameter L is important because the nature of the opponent’s suboptimal play is important. All levels of chess players play sub-optimally, but the types of mistakes a beginner or intermediate-level player will make are different from a grandmaster. The algorithm should exploit this fact. In practice, this algorithm should play in a much more “trappy” way than the methodical slow grind of a typical chess engine.

Formally, at move $i$ there will be $n_i$ possible moves. For each of the $n_i$ moves, we should calculate an evaluation weighted by the opponent’s probability of playing responses to that move:

$e_i(m_i) = p_L(M_{i+1}) \cdot r(M_{i+1})$

Where $M_{i+1}$ is the set of possible moves the opponent can make in response to move $m_i$, $p_L$ is the probability distribution of each move in $M_i$ being made, and $r$ is some real evaluation of the board position.

Let’s use an example. Suppose that at some stage in the game the algorithm must choose between 2 possible moves. For simplicity, the opponent will have 4 possible replies after each of the two possible moves. After the algorithm makes the the first possible move, the probabilities look like this:

### Opponent move distribution at position 1

Opponent Move Probability of play Resulting position evaluation
1 0.6 5
2 0.3 1
3 0.1 0.5
4 0.1 -1

$\begin{bmatrix} 0.6 \\ 0.3 \\ 0.1 \\ 0.1 \end{bmatrix} \cdot \begin{bmatrix} 5 \\ 1 \\ 0.5 \\ -1 \end{bmatrix} = 3.25$

### Opponent move distribution at position 2

And after the second possible move, the opponent move distribution looks like this:

Opponent Move Probability of play Resulting position evaluation
1 0.9 4
2 0.05 0.5
3 0.03 0
4 0.02 -2

$\begin{bmatrix} 0.9 \\ 0.05 \\ 0.03 \\ 0.02 \end{bmatrix} \cdot \begin{bmatrix} 4 \\ 0.5 \\ 0 \\ -2 \end{bmatrix} = 3.585$

In these tables positions are evaluated from the algorithm’s point of view, so higher evaluation=better. Assume the evaluation function is a standard evaluation from a chess engine e.g. Stockfish.

In this example it is clear that the algorithm should choose move 2, because it is highly likely that after making this move the algorithm will have a decisive advantage.

## Checkmates

All of this works quite well in the middle-game, where position evaluations are all within a reasonable range. But what happens when one of the opponent’s moves leads to checkmate? How much value should we give to the possibility that an opponent might blunder into checkmate? The above calculation only works for finite numbers. The evaluation of a checkmate is +/- infinity. Instead, let’s choose a large, finite number for checkmate.

The behavior of the algorithm depends heavily on our choice for this checkmate evaluation, the CHECKMATE_WEIGHT. If the number is too high, the algorithm will favor positions where the opponent might get checkmated, playing risky moves. If the number is too low, the algorithm will choose other branches which win less quickly. After a bit of trial and error, I settled on +10 for checkmate. This is enough for the algorithm to seek out these types of positions.

A further question is what to do once checkmate is inevitable. That is, what should the algorithm do once it will definitely win? One option is to continue to sample from the distribution, continuing to play the move which maximizes winning chances. At the beginning, this was my intuition. Unfortunately, this leads to many draws by repetition.

Instead, once the game is in a forced-win situation (“mate in N moves”) the algorithm will stop sampling from the distribution and simply play the moves which lead to checkmate. While this will inevitably lead to a win, it changes the character of the algorithm somewhat. Therefore, I have limited the lookahead to checkmates in the immediate next few moves (specifically, mate in 5 or fewer moves). Since the original goal was to create an algorithm which is used for training, this seems reasonable.

## Games

I simulated 400 matches testing this opponent modeling approach, which I wrote about here.