For AI forecasting and agent researchers
ForecastBench scores AI forecasters publicly, daily, on a contamination-resistant set of real-world questions.
The systems near the top retrieve dated, multilingual evidence at inference, filtered to publications dated before the forecast date.
NOSIBLE WORLD provides that retrieval.
Papers from 2024 to 2026. Every one moves Brier in the same direction: dated, point-in-time, multilingual retrieval at inference, paired with structured aggregation.
Karger et al. 2024 designed ForecastBench to resist contamination. Questions refresh daily from prediction markets and real-world time series. Systems with point-in-time retrieval at inference outperform systems without.
| # | System | Brier | Retrieval | Set |
|---|---|---|---|---|
| ▮ 01 | Superforecaster median | 0.111 | Human, open web | FB-7-21 |
| ▮ 02 | AIA Forecaster · agentic PIT search | 0.113 | Agentic, point-in-time | FB-7-21 |
| 03 | GPT-4.5 · market-prompted | 0.101 | Mixed, contaminated | ForecastBench |
| 04 | AIA Forecaster · search removed | 0.123 | None | FB-7-21 |
| 05 | Naive 0.5 baseline | 0.250 | None | ForecastBench |
Row 03. GPT-4.5 reaches 0.101 only when the prompt includes prediction-market forecasts. The model then copies them at a 0.994 correlation. Remove the market prime and the order inverts. AIA Forecaster comes within 0.002 of the superforecaster median when it runs point-in-time retrieval at inference.
0.113
AIA Forecaster with agentic dated search. Within 0.002 of the human superforecaster median on FB-7-21.
0.123
The same model, the same prompts, the same calibration stack, the search call removed. Brier rises by 0.010.
An undated forecast is unverifiable. ForecastBench scores the source as much as the model.
A timestamped, multilingual event index built from the open web. Every event carries a verified publication timestamp. Every claim links to its primary source. The retrieval index freezes to any forecast date you choose, so your agent only sees publications dated on or before that timestamp.
Mandarin-language signals on pneumonia clusters surface in open sources weeks before global risk repricing.
Evergrande missed-coupon and onshore-bond signals visible in Chinese filings ahead of the offshore cross-default.
Russian invasion of Ukraine resolves a public forecasting tournament question with a dated track record.
The builds described in the AIA, Halawi, and Chandak papers. Wire each against a pinned NOSIBLE WORLD API version and replay it months later for an auditor or a research panel.
Resolve every question against a dated open web. Pass a forecast date and get evidence cards filtered to publications on or before that timestamp, in 95 source languages. The Halawi 2024 method, on a pinned index.
Retrieval · point in timeRemoves the AIA search-removed delta; Brier 0.123 returns to 0.113.
Wire every forecast and resolution into a running Brier and Expected Calibration Error trace per question and per model. A private scoreboard for your agent fleet, scored on questions that resolved after training cutoff.
Calibration · live traceCatches calibration drift before a production agent ever ships.
Every probability the agent emitted, every evidence card it retrieved, every publication timestamp it cited, pinned to an API version that replays months later. Hand an auditor or a research panel a frozen view of the inputs.
Audit · pinned replayMakes every prediction replayable and auditable months later.
AI forecasting is benchmarked publicly, and the systems near the top retrieve dated, multilingual evidence at inference. Models with that retrieval beat models without. NOSIBLE WORLD provides it.
Build AI forecasters that
win on ForecastBench.
Talk to us about wiring NOSIBLE WORLD into your forecasting agent stack.