Methodology

How the model works

A Bayesian hierarchical Dixon-Coles bivariate Poisson model, fitted on a century of international football, simulated forward 50,000 times. This page explains every layer of the pipeline, the data, the calibration, and where it can still improve.

The model in one paragraph

For each match between teams h (home) and a (away), we model the number of goals scored by each side as Poisson with team-specific rates. Each team has a latent attack parameter and a latent defense parameter; goal rates are exponentials of the appropriate combination plus a global home-advantage bonus and intercept:

log(λhome) = α + att[h] − def[a] + γ · (1 − is_neutral)
log(λaway) = α + att[a] − def[h]

The two scorelines are notindependent at the low end. International football has a small but persistent surplus of 0-0 / 1-1 / 0-1 / 1-0 results compared to what independent Poisson predicts. Dixon & Coles (1997) handle this with a correction factor τ that adjusts only the four lowest-score cells:

P(X=x, Y=y) ∝ τ(x, y; λh, λa, ρ) · Pois(x | λh) · Pois(y | λa)

We're Bayesian about it: every parameter is given a prior, and we fit by sampling the joint posterior with NUTS (the modern descendant of HMC). Each simulated tournament draws fresh values for every parameter, so the final probabilities propagate the model's full uncertainty, not just a point estimate.

Why this model, and not XGBoost or a neural net
  • Scoreline distributions for free. A classifier for 1X2 outcomes tells you P(W/D/L); a goal-rate model tells you the entire distribution over scorelines, which falls out naturally into clean sheets, knockout-stage extra time, penalty shootout edges, and Golden Boot expected goals.
  • Calibrated uncertainty. XGBoost gives a point probability per match; Bayesian inference gives a posterior over the entire function. When we draw 50,000 different posterior samples and simulate 50,000 different tournaments, we're propagating both the model's parametric uncertainty and the inherent randomness of football outcomes.
  • Sparse-data teams behave correctly. Curaçao has played far fewer international matches than France. A flat estimator over-fits to its noise; the hierarchical model shrinks Curaçao's att/def toward a population mean and inflates its uncertainty automatically.
  • The literature. Dixon-Coles is the practitioner baseline. FiveThirtyEight's SPI, Octosport, and most peer-reviewed football models use it or close variants. Beating it consistently with neural nets requires event-level data (StatsBomb-grade) that doesn't exist for most international fixtures.
Priors: how Elo + EA FC 25 inform the model

Plain Dixon Coles has a famous identifiability problem: adding a constant to every att and subtracting it from the intercept yields the same likelihood. To fix this and inject external information, we use:

  • ZeroSumNormal priors on att and def, forcing them to sum to zero by construction. Now the intercept α has the unambiguous interpretation "average log-goal-rate across all teams" and att/def are pure deviations.
  • Informative priors on the att/def means, anchored by current Elo (z-scored, 70%) and EA FC 25 squad strength (top-23 mean overall, z-scored, 30%). This is the cleanest way to bring 2026-current information into a model whose training data goes back to the 1990s.
  • Hyperpriors on the spread (HalfNormal scale, σ ~ HalfNormal(0.5)) control how far team strengths can deviate from the prior mean. The data drives this. If matches strongly disagree with Elo, σ inflates and the data wins.
Match weights: time decay × tournament importance

International football data goes back to 1872. Spain in 1880 has nothing to do with Spain in 2026. We weight every match in the likelihood by:

weight = exp(−ln(2) · age_years / 4.0) · importance

The time decay half life is 4 years (cross-validated against the 2018 and 2022 World Cup back-tests; best avg Brier 0.5745 at 4y vs 0.5748 at 2.5y), much longer than club football (typically 3–6 months) because national team rosters turn over slowly. The importance term follows the Elo K-factor convention: World Cup matches at 1.0, qualifiers at 0.65, major continental tournaments at 0.85, friendlies at 0.30 (managers experiment, the signal is weaker).

Pre-tournament friendlies in May and June 2026 will be the highest-signal data we ever get for the actual matches we're about to predict. The model will keep updating as those land.

The simulator: from match probabilities to tournament probabilities

The Bayesian model gives us per-match scoreline distributions. To get tournament-level numbers:

  1. Draw one random sample from the posterior (one set of att, def, intercept, home_adv, ρ values).
  2. Sample all 6 group-stage fixtures for each of the 12 groups via Dixon-Coles-corrected Poisson, apply the full FIFA tiebreak chain (points → GD → GF → head-to-head → FIFA rank → drawing of lots).
  3. Determine the 8 best third-placed teams by FIFA's published criteria and assign them to compatible R32 slots.
  4. Run the bracket through R32 → R16 → QF → SF → Final, with extra-time goal rates scaled to 30/90 of regulation, and a small skill edge in penalty shootouts (~55/45 for the favorite).
  5. Repeat 50,000 times. Aggregate.

The Alternate Realities page shows 100 of these tournaments individually so you can see what the spread looks like.

Calibration: how do we know it's actually good?

We back-tested by re-fitting the model on data available before each World Cup, then predicting all 64 matches:

YearnBrier ↓Log-loss ↓Accuracy ↑Goal MAE ↓
2018640.5640.95056.2%1.20
2022640.5640.97154.7%1.40
Naive (1/3-1/3-1/3)N/A0.6671.09933.3%N/A

Brier score on a 3-class (W/D/L) outcome is bounded above by 2.0; the naive uniform forecast gets 0.667. A Brier of 0.55-0.60 is the published benchmark for international football. FiveThirtyEight's SPI sits in that range, top Kaggle competition entries do too, and bookmaker markets settle there. Our 0.57-0.58 is calibrated against the same standard.

The model's biggest misses are the famous WC upsets: Argentina 1-1 Iceland (2018), South Korea 2-0 Germany (2018), Argentina 1-2 Saudi Arabia (2022), Cameroon 1-0 Brazil (2022). Bookmakers got those wrong too. Tournament football has irreducible variance.

Data sources
  • 49,256 international match results 1872 → 2026-03-31 from martj42/international_results. 30,997 of these (1990 →) are used in the actual fit, covering the 48 qualified teams plus all 176 of their direct opponents (the 1-hop neighbourhood). Including the wider universe lets transitive evidence inform the qualified teams' strength.
  • Per-goal records with scorer, minute, penalty/own-goal flags from the same repo's goalscorers.csv, used by the Golden Boot module.
  • Live Elo ratings from eloratings.net for all 48 qualified teams, scraped from the public TSV endpoint.
  • EA FC 25 player ratings for 18,205 footballers from Kaggle aniss7/fifa-player-data-from-sofifa-2025-06-03, aggregated to top-23 mean overall per nation.
  • Official FIFA Final Draw (5 December 2025, Washington DC). The 12 groups of 4 are hardcoded in src/wc2026/draw.py.
Half-life sensitivity (back-test)cross-validation
Avg Brier across the 2018 + 2022 World Cup back-tests at each candidate half-life. Lower is better.
  • 1.0y0.5881
  • 1.5y0.5822
  • 2.0y0.5790
  • 2.5y0.5748
  • 3.0y0.5762
  • 4.0yprod0.5745

4.0y wins by a hair (Brier 0.5745), 2.5y is essentially tied, and going below 2.0y costs ~0.01-0.02 Brier. Production is set at 4.0y; the model is robust in the 2.5-4.0y range.

EA FC squad-strength contribution (ablation)ablation
How much does the EA FC squad-strength weight (30% of the prior) actually move the forecast? This is the production model vs. a refit with squad=None (Elo-only prior). Both at 50,000 simulations.

Top 12 — production vs. pure Elo

  • Spain15.2%+0.20
  • Brazil14.9%-1.21
  • Argentina12.9%+0.42
  • England9.0%-0.46
  • France7.8%-0.22
  • Portugal6.0%+0.05
  • Germany5.2%-0.12
  • Netherlands4.4%-0.16
  • Colombia4.1%+0.65
  • Morocco3.1%-0.13
  • Belgium2.9%-0.16
  • Uruguay2.4%+0.03
teamprodΔ pure-Elo vs prod (pp)pp

Biggest movers (any rank)

  • Brazil14.93%13.72%-1.21
  • Colombia4.12%4.77%+0.65
  • England9.00%8.54%-0.46
  • Argentina12.89%13.30%+0.42
  • Ecuador1.44%1.73%+0.29
  • France7.84%7.62%-0.22
  • Spain15.20%15.40%+0.20
  • Belgium2.94%2.78%-0.16
teamprodpure-EloΔ pp

Reading. Every team shifts by ≤ 1.2pp. Brazil drops most when squad strength is removed (-1.21pp), Colombia rises most (+0.65pp), and the rank order of the top 12 is preserved. The squad-strength signal is modest but real — it favours teams with deep, well-rated rosters (Brazil, England, France) over teams whose Elo runs ahead of their underlying squad depth (Colombia, Argentina, Ecuador).

Squad-strength source: EA FC 25 vs Transfermarktcomparison
Spearman rank correlation0.961(1.0 = identical ranks, 0.0 = unrelated)

EA FC 25 ratings (annual scout panel + performance data) vs. Transfermarkt market values (weekly-refreshed crowd-sourced valuations). Across all 48 qualified teams the two priors agree strongly on rank order — most meaningful disagreements concern teams whose Elo and on-paper depth diverge.

Biggest UP movers — TM ranks higher than EA FC

  • South Africa+0.73σ
  • Ivory Coast+0.40σ
  • Norway+0.36σ
  • United States+0.35σ
  • Ecuador+0.35σ
  • Sweden+0.26σ

Biggest DOWN movers — TM ranks lower than EA FC

  • Panama-0.43σ
  • Iran-0.36σ
  • Egypt-0.33σ
  • Saudi Arabia-0.33σ
  • South Korea-0.33σ
  • Australia-0.31σ

Top 12 — by each prior

#EA FC 25Transfermarkt
1FranceFrance
2SpainEngland
3GermanySpain
4BrazilBrazil
5PortugalGermany
6EnglandPortugal
7ArgentinaNetherlands
8NetherlandsArgentina
9BelgiumBelgium
10UruguayIvory Coast
11CroatiaMorocco
12MoroccoNorway

Interpretation. Most movement is directionally consistent with current bookmaker consensus and recent form: England up (deep Premier League pool), Saudi Arabia/Iran/Korea down (TM more sceptical of underrated domestic leagues), Norway / Ivory Coast / Morocco up (Haaland, Diaby, Hakimi command real market value). Where the two disagree most is on age-curve teams: TM rewards young valuations (Yamal, Bellingham, Mbappé) while EA FC rates current ability irrespective of contract value.

Champion probability — full retrain on each prior

Both priors were used to fit the full Bayesian model and run 50,000 tournament simulations. Top-12 numbers from each:

TeamEA FC (prod)TransfermarktΔ pp
Spain15.20%14.84%-0.35
Brazil14.93%14.28%-0.65
Argentina12.89%13.11%+0.22
England9.00%8.89%-0.11
France7.84%8.02%+0.18
Portugal6.00%6.02%+0.02
Germany5.15%5.01%-0.15
Netherlands4.41%4.38%-0.04
Colombia4.12%4.33%+0.21
Morocco3.14%3.07%-0.08
Belgium2.94%2.76%-0.18
Uruguay2.39%2.41%+0.02

Maximum shift on any team: 0.65pp. The top-12 rank order is preserved entirely. Spain still leads, Brazil and Argentina stay 2–3, England and France stay 4–5. The forecast is robust to the choice of squad-strength source.

Why EA FC stays as production

  • Ability-focused. EA FC scout ratings target current on-pitch ability. TM market values bake in age, contract length, league inflation, and hype — a 17yo with potential can outvalue a 32yo who is the actual starter.
  • Forecast is robust. Spearman 0.961 and a max 0.65pp champion-% shift mean the choice is methodologically minor. Either prior produces essentially the same numbers.
  • TM is wired and available. The pipeline supports --squad-source tm; users who prefer the market-value framing can reproduce the TM run from the same code.
Per-team home advantage — posterior estimatesvalidation

Mean γ by confederation

  • CONMEBOLn=100.275
  • AFCn=470.274
  • OFCn=110.274
  • CAFn=540.274
  • UEFAn=570.274
  • CONCACAFn=410.273
  • OTHERn=40.274

Bars scaled to the (small) spread between confederations. CONMEBOL trends highest, in line with the altitude/away-fan effects in the Latin American football literature.

Top 10 teams by posterior γ

  • 1Bolivia0.280
  • 2New Zealand0.279
  • 3Guatemala0.279
  • 4Scotland0.279
  • 5Qatar0.279
  • 6Chile0.278
  • 7Wales0.277
  • 8Malaysia0.277
  • 9United States0.277
  • 10Honduras0.277

What this shows. Per-team home advantage γᵢ is hierarchical (Normal around a global mean of ≈0.274log-goals, posterior σ ≈ 0.025). The data doesn’t pull individual teams far from the global mean — but it is enough to surface the right tier. Bolivia (La Paz, 3,640 m) and Chile lead, Ecuador and Peru sit above average, and CONMEBOL as a whole has the highest confederation-mean γ.

Why so flat? National teams play relatively few home matches (5-15 a year), and many qualifiers are at neutral or quasi-neutral venues. The hierarchical shrinkage prior pulls thin-data teams toward the global mean. The directional signal is correct; the magnitudes are honest about how much the data supports.

What’s NOT in the model

Honesty about absences matters as much as accuracy about what’s included. The model deliberately does not use:

  • Injuries and suspensions. No public real-time feed is reliable enough to incorporate without injecting noise. The model is blind to whether a key player is fit on matchday.
  • Final tournament squads.Coaches name 26-player squads in late May / early June 2026. Those squads are not yet known when the model is fit, so it works from EA FC 25’s top-23 by overall rating per nation, which is a proxy for the “most likely squad” rather than the actual one.
  • In-tournament results.Once the World Cup begins, the dashboard is a snapshot. To reflect actual matchday outcomes, the pipeline must be re-run manually — there is no live update loop.
  • Recent form / momentumbeyond what is implicit in Elo and the 4-year time-decay weight. There is no separate “hot streak” multiplier.
  • Lineup-specific xG / per-shot data.StatsBomb event data exists for some matches but isn’t consistently available for international football over the full historical window, so the model uses final-score Poisson rates only.
  • Manager / coaching effects.No per-coach term. Tournament specialists (e.g. teams that consistently overperform in knockouts) get no boost beyond what the data has already imparted to that team’s att/def parameters.
  • Travel and rest.2026’s three-host format means some teams cross more time zones than others. The model treats every match as venue-neutral once the per-team home advantage term is accounted for.
  • Position-aware EA FC ratings. The Kaggle dump exposes a single overall_ratingper player (their primary-position OVR). Versatile players (e.g. a left-back with attacking attributes who could play CAM at a higher OVR) get counted at their lower primary-position rating. Computing per-position OVR requires either per-player sofifa scraping or replicating EA’s undocumented position-fit formula from raw attributes. Given the squad term is only a 30% prior weight and hierarchical shrinkage damps the effect, the team-level impact is small.

Adding noisy or partial signals to a calibrated Bayesian model usually hurts calibration unless the new signal has a well-formed prior. Several of these are on the roadmap below; the others stay out unless they can be added without degrading the back-test Brier.

Roadmap: improvements shipped & pending
Shipped
  • Sum to zero priors on att/def. Fixes identifiability between team effects and intercept. Brier −0.003 on 2018, +3pp accuracy on 2022.
  • Wider training universe (1 hop neighbourhood). Fit now uses 30,997 matches across 224 teams instead of just the 4,226 between qualified teams. Major posterior shifts: Brazil rose from 9.7% to 13.7% champion probability, Argentina took #1 (16.7%), and France's prior-driven 12.9% came down to 6.9% as recent on-field results carry more weight.
  • Probabilistic Golden Boot via 20k Monte Carlo sims of per-player Poisson goal draws. Gives P(wins Golden Boot) per player rather than a point estimate. Top 3: Haaland 14.1%, Kane 13.9%, Mbappé 9.9%.
  • FC 25 ↔ historical scorer blending. Dirichlet Multinomial conjugate update floors emerging stars (Yamal, Wirtz) with attacking-attribute priors so they're not penalized for short international scoring history.
Just shipped
  • Tournament-context offset. Hierarchical αmatch[type] across five match categories (friendly / qualifier / continental / WC group / WC knockout) — Ley et al. (2019), Groll et al. (2019). Avg back-test Brier dropped from 0.5685 to 0.564 (−0.0045), in the literature's predicted 0.005-0.012 range; 2018 log-loss improved 0.024 vs. baseline.
  • Cross-validated time-decay half-life. Swept [1.0, 1.5, 2.0, 2.5, 3.0, 4.0]-year half-lives on 2018 + 2022 back-tests. 4.0y wins (avg Brier 0.5745) but 2.5y was nearly identical (0.5748); the model is robust in the 2.5-4.0y range. Production default bumped to 4.0y. Going below 2.0y costs ~0.01-0.02 Brier.
  • Per-team home advantage as a hierarchical Normal: γi ~ Normal(γμ, σγ) instead of a single global γ. Empirical data (Kneafsey & Mueller 2017) shows the home bonus ranges 0.2 to 0.5 log-goals across nations. The model now extracts that variance from training data.
  • Confederation-level shrinkage priors. Each team's att/def now shrinks toward its confederation mean (UEFA, CONMEBOL, CONCACAF, CAF, AFC, OFC) on top of the global ZeroSumNormal prior. Sparse-data teams (Curaçao, Cape Verde) are pulled toward their continental baseline rather than a global mean — and the top-end gets compressed: Spain 14.7%, Brazil 14.7%, Argentina 12.9% on the latest run, the closest to the bookmaker market we've had.
Next
  • Squad list updates. When 26 man rosters drop ~2026-06-01, recompute squad strength using only listed players (currently uses the broader player pool). Mbappé-out / Messi-out style sensitivity scenarios should be a one-flag flip.
  • Ensemble with an XGBoost or neural-net challenger, blended via a constrained log-pool that minimizes back-test log-loss. Ensembling typically buys 0.005-0.01 Brier in the football literature.
  • StatsBomb xG events for matches where they exist (recent WCs, Euros, Women's WC). Per-shot data is much higher signal-per-match than the final score.
Code & reproducibility

The whole pipeline is a handful of scripts. End-to-end reproduces in ~25 min on this server:

python scripts/download_data.py    # martj42 + Elo + EA FC 25 + goalscorers (~1 min)
python scripts/train.py            # PyMC, 4 chains × 2k draws on 30k matches (~10 min)
python scripts/simulate.py         # 50,000 tournament rollouts (~6 min)
python scripts/compute_matchups.py # 2,256 pairwise probabilities (~2 min)
python scripts/golden_boot.py      # top-scorer expectations + P(Golden Boot)
python scripts/sweep_halflife.py   # cross-validate time-decay (optional)
python scripts/generate_eda.py     # 10 EDA charts
python scripts/generate_report.py  # PDF technical report

The Bayesian fit uses PyMC's NUTS sampler with 4 chains × 2,000 tune × 2,000 draws. Production fits get 0 divergences with target_accept = 0.95.

Every output JSON the frontend reads (/results.json, /matchups.json, /team_meta.json, /golden_boot.json) is a symlink into the Python pipeline's output/ directory, so re-running the pipeline immediately refreshes the dashboard.

Full technical report: report.pdf.

Cite this work

If you use this forecaster in a paper, post, or article, a link back is appreciated.

Bennour, N. (2026). WC 2026 Forecaster: a hierarchical Bayesian model for the 2026 FIFA World Cup. https://github.com/0xNadr/wc2026
References

Dixon, M. J. & Coles, S. G. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics 46(2), 265-280.

Baio, G. & Blangiardo, M. (2010). Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics 37(2), 253-264.

Kneafsey, M. & Mueller, S. (2017). Neutral grounds in international football: tournament-by-tournament home advantage analysis.

Berrar, D., Lopes, P. & Dubitzky, W. (2019). Incorporating domain knowledge in machine learning for soccer outcome prediction. Machine Learning 108, 97-126.