I found and exterminated a bug that caused MadChess to crash in about one quarter of one percent of games. I noticed the engine crashes on line 6 of this code in the Board.IsMoveLegal(ref ulong Move)
method.
First Hypothesis : Illegal Move Causing King Capture?
I suspected the crash was caused by kingSquare == Square.Illegal
(meaning king not on any square) due to a missing king. Perhaps a move two ply earlier exposed its own king to check, MadChess erroneously considered the move legal, then captured the king the previous ply? If that were true, in the current ply, when MadChess tests whether a pseudo-legal move is legal (doesn’t leave its own king in check) it would crash due to an IndexOutOfRange exception inside the IsSquareAttacked(int Square)
method when calling pawnAttackMask = BlackPawnAttackMasks[Square];
(or the white pawn equivalent code). The value of Square.Illegal
is 64. The black pawn attack masks array holds 64 values (max index == 63) so attempting to retrieve a value at the 64th index causes an exception.
Logging
To test my hypothesis, I added code that logs the entire sequence of moves leading to a position missing a king. After a few hundred games MadChess logged this bizarre sequence.
The abbreviations in the Played Move line above describe aspects of the move stored in bits of a ulong
type. The abbreviations and their bit ranges are explained in the following code comment.
In the above position, the black king moves from g7 to h8 and promotes to a pawn. What the hell? A few hundred games later, MadChess logged another bizarre sequence.
The white king on f5 transforms into a pawn when it moves to e6. What’s going on here?
Kings are transforming into pawns. What’s going on here?
Second Hypothesis : Zobrist Hash Key Collision?
Well, the logs contradict my hypothesis that a previous move was erroneously considered legal, allowing capture of a king on the following move. No, the problem here is not capture of a king. The problem is a king transforming into a pawn. Furthermore, the logs demonstrate while the problem is related to promotion (nonsensically to a pawn) it is not confined to pawns promoting or to moves to the eighth rank. In the second example a king transforms into a pawn when moving to the sixth rank.
In the first log, I noticed the illegal move that caused the engine to crash was marked as a best move (B = True). This led me to believe a Zobrist hash key collision caused Cache.GetPosition(ulong Key)
to return an invalid position to the search, and the engine, due to insufficient move validity testing, blindly played the cached position’s best move (illegal in the actual position on the board), corrupting board state (removing a king from the board). This would explain the rarity of the bug (1/4 of 1% of games) because Zobrist hash key collisions (where two different board positions hash to the same ulong
value) are highly unlikely. So I enhanced the validity testing done prior to playing any move not generated by the engine (from the move Cache
, from a PGN file, from a position
UCI command, etc).
The ValidateMove(ref ulong Move)
code below serves two purposes.
- It asks can this piece reasonably move from here to there? Though as I say in a code comment, it’s not an exhaustive examination.
- It sets aspects of the move relied upon by downstream code, such as
PrioritizeMoves(Position Position, ulong[] Moves, ...
) andBoard.PlayMove(ulong Move)
.
The engine that crashed already included move validity testing. If a cached position contains an invalid best move, the best move is not played. Instead, the engine generates moves. I improved the move validity code by more closely examining pawn promotions, king moves, and captures of kings- the last one obviously illegal. I ran another tournament with a version of the engine that included missing king logging and the following move validity code.
If you’re paying close attention you’ll know the enhanced move validity testing did not prevent the engine from crashing. The second log disproves the Zobrist hash collision / insufficient move validity testing hypothesis. In the second log, the illegal move that caused the engine to crash was not marked as a best move (B = False), meaning it did not come from the cache. So what’s causing the crash?
Third Hypothesis : Bitwise Overflow?
I noticed both crashes involved the engine playing a move that, nonsensically, promotes to a pawn. In both cases the Promo bit range was set to 1, indicating Promo = P, white pawn. Pieces are encoded as follows.
Hmm, that’s a strange coincidence. Recognizing the second log indicates the bug is related to move generation (not cached moves) and thinking of the programming adage, “Debugging is like being the detective in a crime movie where you are also the murderer,” I resisted the temptation to blame the GUI and contemplated how I could have killed my own chess engine?
Debugging is like being the detective in a crime movie where you are also the murderer.
Well, the Board.PlayMove(ulong Move)
method simply plays the move it is given, trusting it’s a valid move. If it promotes a piece to a pawn it does so because it was “told” to do it. Knowing this, I examined ulong
move encoding, focusing on bits near the Promo bit range. What’s immediately beneath (to the right, less signifcant bits) the Promo bit range? Killer Move. Is it possible code setting the Killer Move aspect overflows its bit range and spills into the Promo range?
No, not possible. Killer Move can have only three values: 2, 1, or 0, all of which fit inside 2 bits (2 Pow (50 – 49 + 1) = 4). What’s beneath Killer Move? Move History. Is it possible code setting the History aspect overflows its bit range and spills into the Killer Move and Promo bit ranges?
Yes.
No code restricts Move History to an upper limit. I didn’t bother restricting move history scores to a defined integer range because I thought 27 bits (2 Pow (48 – 22 + 1) = 134,217,728) would accommodate any history score likely to occur in the engine. Perhaps even at bullet time control this limit is exceeded? Both crashes did occur late in a game: at ply 305 (move 153) in the first log and ply 174 (move 87) in the second log. Perhaps in the course of a game a move history score became so large it required an additional three bits (three more doublings to a value >= 1,073,741,824) to accomodate it, spilling into the first bit of the Promo aspect, setting it to 1, indicating promote to white pawn?
Bug Fix
To test this hypothesis (and prevent the engine from crashing) I began to write code that explicitly tests- prior to setting the move history bits- if the move history score exceeds 134,217,728. Then I realized I don’t have to explicitly test a limit. The engine already employs a technique that ensures values fit within a defined range: a bit mask. So I applied the move history mask when setting its value (line 9).
This fixed the bug. In 4,000 games at bullet time control, MadChess never crashed. The engine searches moves at a rate of about 5,000,000 per second. If the average game at 2 min + 1 sec / move time control lasts 60 moves, that’s 180 seconds per game x 5,000,000 moves searched per second x 4,000 games = 3.6 trillion moves searched without encountering a kingless board position.
Preventing Bitwise Overflow
Why does the bit mask fix the bug? Let me illustrate the problem by aligning move bits with history score bits. If the history score has reached 1,073,741,823, its binary representation is 111111111111111111111111111111. (The problem actually occurs earlier, when history == 536,870,912 decimal == 100000000000000000000000000000 binary. But let’s use a binary value consisting of all ones to make the alignment examples below easier to read.) When the history score is shifted then “or’d” into the ulong
move via Move |= ((ulong)history << _historyShift);
, the upper bits of the history score overwrite bits in the Killer and Promo ranges as illustrated below.
If we first apply ("and") the history mask to the shifted history score, then "or" it into the ulong
move via Move |= ((ulong)history << _historyShift) & _historyMask;
, the mask prevents any upper bits of the history score from overwriting bits in the Killer and Promo ranges as illustrated below.
- H means "History Score"
- & means "bitwise and"
- = means "the result"
Align the result from line 3 above to the ulong
move and you'll notice in line 11 the upper bits of the history score have been truncated (effectively limiting the score to a maximum value of 1,073,741,823) and do not overwrite bits in the Killer and Promo ranges.
Hmm... Now I'm wondering how the hell history score can reach 500 million+ merely by incrementing by the square of the distance to the horizon? Does MadChess not clear history scores at the beginning of a new game? Does the CuteChess GUI not send the ucinewgame
command when starting games in a tournament? Or are these high history scores to be expected in drawn positions (like KRkr) where the engine can reach depth 64 (incrementing by 4,096 each time a beta cutoff occurs)? I'll investigate those questions later.
Because a move's history value is set in the Search.PrioritizeMoves(Position Position, ulong[] Moves...)
method, I'll categorize this as a Search update. Due to the rarity of the bug, the fix did not have any appreciable effect on engine strength. MadChess remains 2513 Elo at bullet time control.
Hopefully you found this blog post helpful. Happy chess programming!
See my Engine Crash Detective Story post on the TalkChess forum for a discussion of this blog post.
Feature | Category | Date | Commit1 | WAC2 | Elo Rating3 | Improvement |
---|---|---|---|---|---|---|
Crash Bug | Search | 2020 Aug 29 | 2d855ec | 288 | 2513 | +0 |
King Safety | Evaluation | 2020 Aug 16 | 6794c89 | 288 | 2513 | +63 |
Eval Param Tuning | Evaluation | 2020 Jul 23 | bef88d5 | 283 | 2450 | +30 |
Late Move Pruning | Search | 2020 Feb 08 | 6f3d17a | 288 | 2420 | +29 |
Piece Mobility | Evaluation | 2020 Feb 01 | 5c5d4fc | 282 | 2391 | +62 |
Passed Pawns | Evaluation | 2018 Dec 27 | 103 | 279 | 2329 | +119 |
Staged Move Generation | Search | 2018 Dec 15 | 93 | 275 | 2210 | +39 |
History Heuristics | Search | 2018 Dec 03 | 84 | 275 | 2171 | +28 |
Eval Param Tuning | Evaluation | 2018 Nov 24 | 75 | 272 | 2143 | +47 |
Sophisticated Search Material and Piece Location |
Baseline | 2018 Nov 08 | 58 | 269 | 2096 | 0 |
- GitHub commit (hash) or Subversion source code revision (integer)
- Win At Chess position test, 3 seconds per position
- Bullet chess, 2 min / game + 1 sec / move
i>Bullet chess, 2 min / game + 1 sec / move