Statistical Methods and Algorithms in Fishtest
This document outlines the core statistical models, testing methodologies, and algorithms employed by Fishtest for chess engine evaluation and parameter tuning.
Engine Match Modeling
Fishtest utilizes advanced models to accurately represent the outcome probabilities in chess engine matches:
- Pentanomial Model:
- Fishtest employs the pentanomial model for analyzing engine match results.
- This model considers five outcomes (e.g., ll, ld, dd/wl, wd, ww), providing a more accurate representation than the traditional trinomial model (win, draw, loss).
- Benefit: Leads to a substantial saving of testing resources compared to the trinomial model.
- Opening Book Bias Estimation:
- The difference between the results predicted by the pentanomial and trinomial models allows for an estimation of the Root Mean Square (RMS) value of biases present in the opening book.
- This calculation is based on the accounting identity [archive].
Sequential Testing Framework (GSPRT)
For efficient sequential testing (determining if a change is statistically significant as quickly as possible), Fishtest uses the Generalized Sequential Probability Ratio Test (GSPRT):
- Core Principle (GSPRT):
- Fishtest uses the GSPRT, a generalization of the standard SPRT.
- In GSPRT, unknown parameters within the statistical hypotheses (H0 and H1) are replaced by their maximum likelihood estimates.
- It's shown (loc. cit.) that GSPRT asymptotically behaves like SPRT when the error probabilities approach zero.
- The asymptotic guarantees of GSPRT do not depend on the Elo differences being tested being small.
- Normalized Elo for Bounds:
- Elo Estimation from GSPRT Results:
- To estimate Elo differences from completed GSPRT tests, Fishtest uses formula (6.1) in this document [archive].
- The underlying model approximates the GSPRT process as a random walk, specifically using formula (2.1) from this document [archive].
- This approximation essentially treats the GSPRT like an SPRT for the drift of a Brownian motion, where the infinitesimal variance is estimated from the sample data.
- SPRT Calculator and Resource Estimation:
- The web-based SPRT calculator and the resource consumption estimation tables rely on formulas detailed in this document [archive].
Parameter Tuning
Fishtest uses the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm for parameter tuning process:
- Normalized parameter space:
- Users choose per‑axis
c_i
soθ_i ± c_i Δ_i
gives a small, measurable Elo gap (≈ a few Elo); this is equivalent to use normalized parametersφ_i = θ_i / c_i
with the Elo gap computed withφ_i ± Δ_i
. - Gradient signal.
wins − losses
over the task supplies the two‑sided finite‑difference signal; for small Elo gaps,g_φ_i ≈
wins − losses· Δ_i
(constant coefficients absorbed intor_i
). - In φ-space the learning rate
r_i
has the same magnitude order, we can use a global learning rate value. - Mapped back to the θ-space, the θ‑update is
Δθ_i = r_i · c_i · g_φ_i
. - Newton‑like behavior. Per‑axis
c_i
encodes axis scale/curvature, yielding diagonal quasi‑Newton steps in θ (larger on flatter axes, smaller on steeper ones) even while tuning with a global learning rate.
- Users choose per‑axis
- Parallel/out‑of‑order updates:
- Many heterogeneous workers run in parallel.
- Each task is tied to its dispatch snapshot and applies its update on arrival (order‑agnostic).
Quality Control
To ensure the reliability of test results contributed by distributed workers, Fishtest uses a χ2 (Chi-squared) test to detect statistically anomalous workers whose results deviate significantly from the norm.
Validation and Simulation
Many of the formulas and statistical models used in Fishtest have been validated through simulation: