📚 Methodology

How the model works

A Bayesian hierarchical Dixon-Coles bivariate Poisson model, fitted on a century of international football, simulated forward 50,000 times. This page explains every layer of the pipeline, the data, the calibration, and where it can still improve.

The model in one paragraph

For each match between teams h (home) and a (away), we model the number of goals scored by each side as Poisson with team-specific rates. Each team has a latent attack parameter and a latent defense parameter; goal rates are exponentials of the appropriate combination plus a global home-advantage bonus and intercept:

log(λ_home) = α + att[h] − def[a] + γ · (1 − is_neutral)
log(λ_away) = α + att[a] − def[h]

The two scorelines are notindependent at the low end. International football has a small but persistent surplus of 0-0 / 1-1 / 0-1 / 1-0 results compared to what independent Poisson predicts. Dixon & Coles (1997) handle this with a correction factor τ that adjusts only the four lowest-score cells:

P(X=x, Y=y) ∝ τ(x, y; λ_h, λ_a, ρ) · Pois(x | λ_h) · Pois(y | λ_a)

We're Bayesian about it: every parameter is given a prior, and we fit by sampling the joint posterior with NUTS (the modern descendant of HMC). Each simulated tournament draws fresh values for every parameter, so the final probabilities propagate the model's full uncertainty, not just a point estimate.

Why this model, and not XGBoost or a neural net

Scoreline distributions for free. A classifier for 1X2 outcomes tells you P(W/D/L); a goal-rate model tells you the entire distribution over scorelines, which falls out naturally into clean sheets, knockout-stage extra time, penalty shootout edges, and Golden Boot expected goals.
Calibrated uncertainty. XGBoost gives a point probability per match; Bayesian inference gives a posterior over the entire function. When we draw 50,000 different posterior samples and simulate 50,000 different tournaments, we're propagating both the model's parametric uncertainty and the inherent randomness of football outcomes.
Sparse-data teams behave correctly. Curaçao has played far fewer international matches than France. A flat estimator over-fits to its noise; the hierarchical model shrinks Curaçao's att/def toward a population mean and inflates its uncertainty automatically.
The literature. Dixon-Coles is the practitioner baseline. FiveThirtyEight's SPI, Octosport, and most peer-reviewed football models use it or close variants. Beating it consistently with neural nets requires event-level data (StatsBomb-grade) that doesn't exist for most international fixtures.

Priors: how Elo + EA FC 25 inform the model

Plain Dixon Coles has a famous identifiability problem: adding a constant to every att and subtracting it from the intercept yields the same likelihood. To fix this and inject external information, we use:

ZeroSumNormal priors on att and def, forcing them to sum to zero by construction. Now the intercept α has the unambiguous interpretation "average log-goal-rate across all teams" and att/def are pure deviations.
Informative priors on the att/def means, anchored by current Elo (z-scored, 70%) and EA FC 25 squad strength (top-23 mean overall, z-scored, 30%). This is the cleanest way to bring 2026-current information into a model whose training data goes back to the 1990s.
Hyperpriors on the spread (HalfNormal scale, σ ~ HalfNormal(0.5)) control how far team strengths can deviate from the prior mean. The data drives this. If matches strongly disagree with Elo, σ inflates and the data wins.

Match weights: time decay × tournament importance

International football data goes back to 1872. Spain in 1880 has nothing to do with Spain in 2026. We weight every match in the likelihood by:

weight = exp(−ln(2) · age_years / 2.5) · importance

The time decay half life is 2.5 years, much longer than club football (typically 3-6 months) because national team rosters turn over slowly. The importance term follows the Elo K-factor convention: World Cup matches at 1.0, qualifiers at 0.65, major continental tournaments at 0.85, friendlies at 0.30 (managers experiment, the signal is weaker).

Pre-tournament friendlies in May and June 2026 will be the highest-signal data we ever get for the actual matches we're about to predict. The model will keep updating as those land.

The simulator: from match probabilities to tournament probabilities

The Bayesian model gives us per-match scoreline distributions. To get tournament-level numbers:

Draw one random sample from the posterior (one set of att, def, intercept, home_adv, ρ values).
Sample all 6 group-stage fixtures for each of the 12 groups via Dixon-Coles-corrected Poisson, apply the full FIFA tiebreak chain (points → GD → GF → head-to-head → FIFA rank → drawing of lots).
Determine the 8 best third-placed teams by FIFA's published criteria and assign them to compatible R32 slots.
Run the bracket through R32 → R16 → QF → SF → Final, with extra-time goal rates scaled to 30/90 of regulation, and a small skill edge in penalty shootouts (~55/45 for the favorite).
Repeat 50,000 times. Aggregate.

The Alternate Realities page shows 100 of these tournaments individually so you can see what the spread looks like.

Calibration: how do we know it's actually good?

We back-tested by re-fitting the model on data available before each World Cup, then predicting all 64 matches:

Year	n	Brier ↓	Log-loss ↓	Accuracy ↑	Goal MAE ↓
2018	64	0.582	0.978	54.7%	1.20
2022	64	0.569	0.971	57.8%	1.42
Naive (1/3-1/3-1/3)	N/A	0.667	1.099	33.3%	N/A

Brier score on a 3-class (W/D/L) outcome is bounded above by 2.0; the naive uniform forecast gets 0.667. A Brier of 0.55-0.60 is the published benchmark for international football. FiveThirtyEight's SPI sits in that range, top Kaggle competition entries do too, and bookmaker markets settle there. Our 0.57-0.58 is calibrated against the same standard.

The model's biggest misses are the famous WC upsets: Argentina 1-1 Iceland (2018), South Korea 2-0 Germany (2018), Argentina 1-2 Saudi Arabia (2022), Cameroon 1-0 Brazil (2022). Bookmakers got those wrong too. Tournament football has irreducible variance.

Data sources

49,256 international match results 1872 → 2026-03-31 from martj42/international_results. 30,997 of these (1990 →) are used in the actual fit, covering the 48 qualified teams plus all 176 of their direct opponents (the 1-hop neighbourhood). Including the wider universe lets transitive evidence inform the qualified teams' strength.
Per-goal records with scorer, minute, penalty/own-goal flags from the same repo's goalscorers.csv, used by the Golden Boot module.
Live Elo ratings from eloratings.net for all 48 qualified teams, scraped from the public TSV endpoint.
EA FC 25 player ratings for 18,205 footballers from Kaggle aniss7/fifa-player-data-from-sofifa-2025-06-03, aggregated to top-23 mean overall per nation.
Official FIFA Final Draw (5 December 2025, Washington DC). The 12 groups of 4 are hardcoded in src/wc2026/draw.py.

Roadmap: improvements shipped & pending

Shipped

Sum to zero priors on att/def. Fixes identifiability between team effects and intercept. Brier −0.003 on 2018, +3pp accuracy on 2022.
Wider training universe (1 hop neighbourhood). Fit now uses 30,997 matches across 224 teams instead of just the 4,226 between qualified teams. Major posterior shifts: Brazil rose from 9.7% to 13.7% champion probability, Argentina took #1 (16.7%), and France's prior-driven 12.9% came down to 6.9% as recent on-field results carry more weight.
Probabilistic Golden Boot via 20k Monte Carlo sims of per-player Poisson goal draws. Gives P(wins Golden Boot) per player rather than a point estimate. Top 3: Haaland 14.1%, Kane 13.9%, Mbappé 9.9%.
FC 25 ↔ historical scorer blending. Dirichlet Multinomial conjugate update floors emerging stars (Yamal, Wirtz) with attacking-attribute priors so they're not penalized for short international scoring history.

In flight

Cross validated time decay half life. Sweeping [1.0, 1.5, 2.0, 2.5, 3.0, 4.0]-year half-lives across 2018 + 2022 back-tests. Winner becomes the production default.

Per team home advantage. Currently γ is a single global parameter (~0.22 log-goals). Empirical data (Kneafsey & Mueller 2017) shows it varies from 0.2 to 0.5 across nations. A per-team γ also lets us model partial home advantage at host venues (Mexico games near Los Angeles, Argentina games in Miami).
Confederation level shrinkage targets. Sparse data teams currently shrink toward a global mean. Shrinking Curaçao toward a CONCACAF mean and France toward a UEFA mean would respect the structural difference in baselines.
Squad list updates. When 26 man rosters drop ~2026-06-01, recompute squad strength using only listed players (currently uses the broader player pool). Mbappé-out / Messi-out style sensitivity scenarios should be a one-flag flip.
Ensemble with an XGBoost or neural-net challenger, blended via a constrained log-pool that minimizes back-test log-loss. Ensembling typically buys 0.005-0.01 Brier in the football literature.
StatsBomb xG events for matches where they exist (recent WCs, Euros, Women's WC). Per-shot data is much higher signal-per-match than the final score.

Code & reproducibility

The whole pipeline is a handful of scripts. End-to-end reproduces in ~25 min on this server:

python scripts/download_data.py    # martj42 + Elo + EA FC 25 + goalscorers (~1 min)
python scripts/train.py            # PyMC, 4 chains × 2k draws on 30k matches (~10 min)
python scripts/simulate.py         # 50,000 tournament rollouts (~6 min)
python scripts/compute_matchups.py # 2,256 pairwise probabilities (~2 min)
python scripts/golden_boot.py      # top-scorer expectations + P(Golden Boot)
python scripts/sweep_halflife.py   # cross-validate time-decay (optional)
python scripts/generate_eda.py     # 10 EDA charts
python scripts/generate_report.py  # PDF technical report

The Bayesian fit uses PyMC's NUTS sampler with 4 chains × 2,000 tune × 2,000 draws. Production fits get 0 divergences with target_accept = 0.95.

Every output JSON the frontend reads (/results.json, /matchups.json, /team_meta.json, /golden_boot.json) is a symlink into the Python pipeline's output/ directory, so re-running the pipeline immediately refreshes the dashboard.

Full technical report: report.pdf.

References

Dixon, M. J. & Coles, S. G. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics 46(2), 265-280.

Baio, G. & Blangiardo, M. (2010). Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics 37(2), 253-264.

Kneafsey, M. & Mueller, S. (2017). Neutral grounds in international football: tournament-by-tournament home advantage analysis.

Berrar, D., Lopes, P. & Dubitzky, W. (2019). Incorporating domain knowledge in machine learning for soccer outcome prediction. Machine Learning 108, 97-126.