Skip to content
dav1312 edited this page Feb 6, 2023 · 61 revisions

Interpretation of the Stockfish evaluation.

The evaluation of a position that results from search has traditionally been measured in pawns or centipawns (1 pawn = 100 centipawns). A value of 1, implied a 1 pawn advantage. However, with engines being so strong, and the NNUE evaluation being much less tied to material value, a new scheme was needed. The new normalized evaluation is now linked to the probability of winning, with a 1.0 pawn advantage being a 0.5 (that is 50%) win probability. An evaluation of 0.0 means equal chances for a win or a loss, but also nearly 100% chance of a draw.

Some GUIs will be able to show the win/draw/loss probabilities directly, when the UCI_ShowWDL engine option is set to True.

The full plots of win, loss and draw probability are given below. From these probabilities, one can also obtain the expected match score.

Probabilities Expected match score

The probability of winning or drawing a game, of course, depends on the opponent and the time control. With bullet games, the draw rate will lower, and against a weak opponent, even a negative score could result in a win. These graphs have been generated from a model derived from Fishtest data for Stockfish playing against Stockfish (so an equally strong opponent), at 60+0.6s per game. The curves are expected to evolve, i.e. as the engines get stronger, an evaluation of 0.0 will approach the 100% draw limit. These curves are for SF15.1 (Dec 2022).


Optimal settings

To get the best possible evaluation or the strongest move for a given position, the key is to let Stockfish analyze long enough, using a recent release (or development version), properly selected for the CPU architecture.

The following settings are important as well:

Threads

tl;dr: Maximum - (1 or 2 threads).

Set the number of threads to the maximum available, possibly leaving 1 or 2 threads free for other tasks. SMT or Hyper-threading is beneficial, so normally the number of threads available is twice the number of cores available. Consumer hardware typically has at least 4-8 threads, Stockfish supports hundreds of threads.

More detailed results on the efficiency of threading are available.

Hash

tl;dr: Maximum - (1 or 2 GiB RAM) .

Set the hash to nearly the maximum amount of memory (RAM) available, leaving some memory free for other tasks. The Hash can be any value, not just powers of two. The value is specified in MiB, and typical consumer hardware will have GiB of RAM. For a system with 8GiB of RAM one could use 6000 as a reasonable value for the Hash.

More detailed results on the cost of too little hash are available.

MultiPV

tl;dr: 1.

A higher value weakens the quality of the best move computed, as resources are used to compute other moves.

More detailed results on the cost of MultiPV are available.


The Elo rating of Stockfish

"What is the Elo of Stockfish?": A seemingly simple question, with no easy answer. First, the obvious: it is higher than any human Elo, and when SF 15.1 ranked with more than 4000 Elo on some rating lists, YouTube knew.

To answer the question in more detail, some background info is needed. In its simplest form, the Elo rating system predicts the score of a match between two players, and conversely, a match between two players will give information about the Elo difference between them. The Elo difference will depend on the conditions of the match. For human players, the time control (blitz vs classical TC) or the variant (standard chess vs Fischer random chess) are well-known factors that influence the Elo difference between the two players. Needless to say, one needs sufficiently many games to confidently measure the Elo difference or match score. Finally, given an Elo difference between two players, one needs to know the Elo rating of one of them to know the Elo rating of the other. More generally, one needs an anchor or reference within a group of opponents, and if that reference is different in different groups, the Elo number can not be compared.

The same observations hold for computing the Elo rating of Stockfish, with caveats related to the fact that engines play very high-level chess, and are able to draw the majority of games between engines of similar strength. From the starting position or any other very balanced condition, a draw rate of 100% has essentially been reached between top engines especially at rapid or longer TCs, even more so on powerful hardware. This results in small Elo differences between top engines, e.g. a +19 -2 =79 match score is a convincing win, but a small Elo difference. Carefully constructed books of starting positions that have a clear advantage for one side can reduce that draw rate significantly and increase Elo differences. The book used in the match is thus an important factor in the computed Elo difference. Similarly, the pool of opponents and their ranking has a large impact on the Elo rating, and Elo ratings computed with different pools of opponents can hardly be compared, especially if weaker (but different) engines are part of that pool. Finally, in order to accurately compute Elo differences at this level, a very large number of games (typically tens of thousands of games) are needed, as small samples of games (independent of the time control) will lead to large relative errors.

Having introduced all these caveats, accurately measuring Elo differences is central to the development of Stockfish, and our Fishtest framework constantly measures with great precision the Elo difference of Stockfish and its proposed improvements. These performance improvements are accurately tracked over time on the regression testing wiki page. The same page also links to various external websites that rank Stockfish against a wide range of other engines.

Finally, rating Stockfish on a human scale (e.g. FIDE Elo) has become an almost impossible task, as strength differences between engines and humans are now so large, that this difference can hardly be measured. After all, this would require a human to play Stockfish for long enough to have at least a handful of draws and wins.


Stockfish crashed

Stockfish may crash if fed incorrect fens, or fens with illegal positions. Full validation code is complex to write, and within the UCI protocol there is no established mechanism to communicate such an error back to the GUI. Therefore Stockfish is written with the expectation that the input fen is correct.

On the other hand, the GUI must carefully check fens. If you find a GUI through which you can crash Stockfish or any other engine, then by all means report it to that GUI's developers.


Does Stockfish support chess variants ?

The official Stockfish engine only supports standard chess and Chess960 or Fischer Random Chess (FRC). However, various forks based on Stockfish support variants, most notably The Fairy-Stockfish project.


Can Stockfish use my GPU ?

No, Stockfish is a chess engine that uses the CPU only for chess evaluation. Its NNUE evaluation (see this in-depth description) is very effective on CPUs. With extremely short inference times (sub-micro-second), this network can not be efficiently evaluated on GPUs, in particular with the alpha-beta search that Stockfish employs. However, for training networks, Stockfish employs GPUs with effective code that is part of the NNUE pytorch trainer. Other chess engines require GPUs for effective evaluation, as they are based on large convolutional or transformer networks, and use a search algorithm that allows for batching evaluations. See also the Leela Chess Zero (Lc0) project.

Clone this wiki locally