§02Agentic Superforecasters/ 13Live · Global

For AI forecasting and agent researchers

Build AI forecasters that
win on ForecastBench.

ForecastBench scores AI forecasters publicly, daily, on a contamination-resistant set of real-world questions.

The systems near the top retrieve dated, multilingual evidence at inference, filtered to publications dated before the forecast date.

NOSIBLE WORLD provides that retrieval.

Start Trial→See the 2024 to 2026 papers↓

§01/ 06

ForecastBench evidence · 2024–2026

What the recent papers find.

Papers from 2024 to 2026. Every one moves Brier in the same direction: dated, point-in-time, multilingual retrieval at inference, paired with structured aggregation.

ICLR 2025 · 2024Karger et al.

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Contamination-resistant benchmark with daily-refreshed questions from prediction markets. Superforecasters median 0.081 Brier; top frontier model 0.101.

ARXIV · 2025Adaptive Intelligence Agents

AIA Forecaster: Technical Report on AI Judgmental Forecasting

Multi-agent forecaster with point-in-time search matches superforecaster median at 0.113 Brier; removing search alone moves it to 0.123.

NEURIPS 2024 · 2024Halawi, Zhang, Yueh-Han & Steinhardt

Approaching Human-Level Forecasting with Language Models

Retrieve articles dated before resolution, summarise, decompose, aggregate. The first paper where a language model neared the crowd.

ARXIV · 2025Chandak, Goel, Prabhu, Hardt & Geiping

Scaling Open-Ended Reasoning To Predict the Future

Trains OpenForecaster 8B on 52K open-ended questions from a date-controlled open web; matches proprietary frontier models on accuracy and calibration.

ARXIV · 2025Paleka, Goel, Geiping & Tramèr

Pitfalls in Evaluating Language Model Forecasters

Catalogues temporal leakage and backtest extrapolation as the two failure classes that contaminate every honest forecaster evaluation.

ICLR 2025 · 2024Paleka et al.

Consistency Checks for Language Model Forecasters

An arbitrage-based consistency metric correlates with future Brier, letting you score forecasters instantly without waiting for resolution.

ARXIV · 2025Turtel, Franklin & Schoenegger

LLMs Can Teach Themselves to Better Predict the Future

Outcome-driven self-play with Direct Preference Optimization lifts Phi-4 14B and DeepSeek-R1 14B 7 to 10 percent on forecasts.

ARXIV · 2025Lu

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

On 464 Metaculus questions, frontier models beat the crowd Brier but trail expert forecasters. Retrieval is the gap.

§02/ 06

The leaderboard

How ForecastBench scores forecasters.

Karger et al. 2024 designed ForecastBench to resist contamination. Questions refresh daily from prediction markets and real-world time series. Systems with point-in-time retrieval at inference outperform systems without.

Snapshot · ForecastBench · 2026Brier score, lower is better. Source: Karger et al. 2024 and AIA Forecaster 2025.

#	System	Brier	Retrieval	Set
▮ 01	Superforecaster median	0.111	Human, open web	FB-7-21
▮ 02	AIA Forecaster · agentic PIT search	0.113	Agentic, point-in-time	FB-7-21
03	GPT-4.5 · market-prompted	0.101	Mixed, contaminated	ForecastBench
04	AIA Forecaster · search removed	0.123	None	FB-7-21
05	Naive 0.5 baseline	0.250	None	ForecastBench

Row 03. GPT-4.5 reaches 0.101 only when the prompt includes prediction-market forecasts. The model then copies them at a 0.994 correlation. Remove the market prime and the order inverts. AIA Forecaster comes within 0.002 of the superforecaster median when it runs point-in-time retrieval at inference.

▮ With point-in-time retrieval

0.113

AIA Forecaster with agentic dated search. Within 0.002 of the human superforecaster median on FB-7-21.

▮ Same system, no search

0.123

The same model, the same prompts, the same calibration stack, the search call removed. Brier rises by 0.010.

▮ Field note

An undated forecast is unverifiable. ForecastBench scores the source as much as the model.

§04/ 06

NOSIBLE WORLD · the retrieval index

Dated, multilingual, point-in-time retrieval.

A timestamped, multilingual event index built from the open web. Every event carries a verified publication timestamp. Every claim links to its primary source. The retrieval index freezes to any forecast date you choose, so your agent only sees publications dated on or before that timestamp.

100M+Events

300K+Sources

95Languages at source

30 yearsPoint-in-time depth

▮ Named cases

Pandemic
Wuhan · 2020·01
Mandarin-language signals on pneumonia clusters surface in open sources weeks before global risk repricing.
Credit
Shenzhen · 2021·09
Evergrande missed-coupon and onshore-bond signals visible in Chinese filings ahead of the offshore cross-default.
Geopolitical
Kyiv · 2022·02
Russian invasion of Ukraine resolves a public forecasting tournament question with a dated track record.

§05/ 06

What you build

Agents to build.

The builds described in the AIA, Halawi, and Chandak papers. Wire each against a pinned NOSIBLE WORLD API version and replay it months later for an auditor or a research panel.

§01Retrieval · point in time

Point-in-time retrieval-augmented forecaster

Resolve every question against a dated open web. Pass a forecast date and get evidence cards filtered to publications on or before that timestamp, in 95 source languages. The Halawi 2024 method, on a pinned index.

Retrieval · point in time

▮ Payoff

Removes the AIA search-removed delta; Brier 0.123 returns to 0.113.

§02Calibration · live trace

Calibration tracking on a live leaderboard

Wire every forecast and resolution into a running Brier and Expected Calibration Error trace per question and per model. A private scoreboard for your agent fleet, scored on questions that resolved after training cutoff.

Calibration · live trace

▮ Payoff

Catches calibration drift before a production agent ever ships.

§03Audit · pinned replay

Audit-ready prediction trail

Every probability the agent emitted, every evidence card it retrieved, every publication timestamp it cited, pinned to an API version that replays months later. Hand an auditor or a research panel a frozen view of the inputs.

Audit · pinned replay

▮ Payoff

Makes every prediction replayable and auditable months later.

▮ §06 · The disclosure

AI forecasting is benchmarked publicly, and the systems near the top retrieve dated, multilingual evidence at inference. Models with that retrieval beat models without. NOSIBLE WORLD provides it.

Build AI forecasters that
win on ForecastBench.

§06 · Get started

Talk to us about wiring NOSIBLE WORLD into your forecasting agent stack.

Start Trial→

§02Agentic Superforecasters/ 13Live · Global

For AI forecasting and agent researchers

Build AI forecasters that
win on ForecastBench.

ForecastBench scores AI forecasters publicly, daily, on a contamination-resistant set of real-world questions.

The systems near the top retrieve dated, multilingual evidence at inference, filtered to publications dated before the forecast date.

NOSIBLE WORLD provides that retrieval.

Start Trial→See the 2024 to 2026 papers↓

§01/ 06

ForecastBench evidence · 2024–2026

What the recent papers find.

Papers from 2024 to 2026. Every one moves Brier in the same direction: dated, point-in-time, multilingual retrieval at inference, paired with structured aggregation.

ICLR 2025 · 2024Karger et al.

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Contamination-resistant benchmark with daily-refreshed questions from prediction markets. Superforecasters median 0.081 Brier; top frontier model 0.101.

ARXIV · 2025Adaptive Intelligence Agents

AIA Forecaster: Technical Report on AI Judgmental Forecasting

Multi-agent forecaster with point-in-time search matches superforecaster median at 0.113 Brier; removing search alone moves it to 0.123.

NEURIPS 2024 · 2024Halawi, Zhang, Yueh-Han & Steinhardt

Approaching Human-Level Forecasting with Language Models

Retrieve articles dated before resolution, summarise, decompose, aggregate. The first paper where a language model neared the crowd.

ARXIV · 2025Chandak, Goel, Prabhu, Hardt & Geiping

Scaling Open-Ended Reasoning To Predict the Future

Trains OpenForecaster 8B on 52K open-ended questions from a date-controlled open web; matches proprietary frontier models on accuracy and calibration.

ARXIV · 2025Paleka, Goel, Geiping & Tramèr

Pitfalls in Evaluating Language Model Forecasters

Catalogues temporal leakage and backtest extrapolation as the two failure classes that contaminate every honest forecaster evaluation.

ICLR 2025 · 2024Paleka et al.

Consistency Checks for Language Model Forecasters

An arbitrage-based consistency metric correlates with future Brier, letting you score forecasters instantly without waiting for resolution.

ARXIV · 2025Turtel, Franklin & Schoenegger

LLMs Can Teach Themselves to Better Predict the Future

Outcome-driven self-play with Direct Preference Optimization lifts Phi-4 14B and DeepSeek-R1 14B 7 to 10 percent on forecasts.

ARXIV · 2025Lu

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

On 464 Metaculus questions, frontier models beat the crowd Brier but trail expert forecasters. Retrieval is the gap.

§02/ 06

The leaderboard

How ForecastBench scores forecasters.

Snapshot · ForecastBench · 2026Brier score, lower is better. Source: Karger et al. 2024 and AIA Forecaster 2025.

#	System	Brier	Retrieval	Set
▮ 01	Superforecaster median	0.111	Human, open web	FB-7-21
▮ 02	AIA Forecaster · agentic PIT search	0.113	Agentic, point-in-time	FB-7-21
03	GPT-4.5 · market-prompted	0.101	Mixed, contaminated	ForecastBench
04	AIA Forecaster · search removed	0.123	None	FB-7-21
05	Naive 0.5 baseline	0.250	None	ForecastBench

▮ With point-in-time retrieval

0.113

AIA Forecaster with agentic dated search. Within 0.002 of the human superforecaster median on FB-7-21.

▮ Same system, no search

0.123

The same model, the same prompts, the same calibration stack, the search call removed. Brier rises by 0.010.

▮ Field note

An undated forecast is unverifiable. ForecastBench scores the source as much as the model.

§04/ 06

NOSIBLE WORLD · the retrieval index

Dated, multilingual, point-in-time retrieval.

100M+Events

300K+Sources

95Languages at source

30 yearsPoint-in-time depth

▮ Named cases

Pandemic
Wuhan · 2020·01
Mandarin-language signals on pneumonia clusters surface in open sources weeks before global risk repricing.
Credit
Shenzhen · 2021·09
Evergrande missed-coupon and onshore-bond signals visible in Chinese filings ahead of the offshore cross-default.
Geopolitical
Kyiv · 2022·02
Russian invasion of Ukraine resolves a public forecasting tournament question with a dated track record.

§05/ 06

What you build

Agents to build.

The builds described in the AIA, Halawi, and Chandak papers. Wire each against a pinned NOSIBLE WORLD API version and replay it months later for an auditor or a research panel.

§01Retrieval · point in time

Point-in-time retrieval-augmented forecaster

Retrieval · point in time

▮ Payoff

Removes the AIA search-removed delta; Brier 0.123 returns to 0.113.

§02Calibration · live trace

Calibration tracking on a live leaderboard

Calibration · live trace

▮ Payoff

Catches calibration drift before a production agent ever ships.

§03Audit · pinned replay

Audit-ready prediction trail

Audit · pinned replay

▮ Payoff

Makes every prediction replayable and auditable months later.

▮ §06 · The disclosure

AI forecasting is benchmarked publicly, and the systems near the top retrieve dated, multilingual evidence at inference. Models with that retrieval beat models without. NOSIBLE WORLD provides it.

Build AI forecasters that
win on ForecastBench.

§06 · Get started

Talk to us about wiring NOSIBLE WORLD into your forecasting agent stack.

Start Trial→

Build AI forecasters thatwin on ForecastBench.

What the recent papers find.

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

AIA Forecaster: Technical Report on AI Judgmental Forecasting

Approaching Human-Level Forecasting with Language Models

Scaling Open-Ended Reasoning To Predict the Future

Pitfalls in Evaluating Language Model Forecasters

Consistency Checks for Language Model Forecasters

LLMs Can Teach Themselves to Better Predict the Future

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

How ForecastBench scores forecasters.

Dated, multilingual, point-in-time retrieval.

Agents to build.

Point-in-time retrieval-augmented forecaster

Calibration tracking on a live leaderboard

Audit-ready prediction trail

Build AI forecasters thatwin on ForecastBench.

What the recent papers find.

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

AIA Forecaster: Technical Report on AI Judgmental Forecasting

Approaching Human-Level Forecasting with Language Models

Scaling Open-Ended Reasoning To Predict the Future

Pitfalls in Evaluating Language Model Forecasters

Consistency Checks for Language Model Forecasters

LLMs Can Teach Themselves to Better Predict the Future

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

How ForecastBench scores forecasters.

Dated, multilingual, point-in-time retrieval.

Agents to build.

Point-in-time retrieval-augmented forecaster

Calibration tracking on a live leaderboard

Audit-ready prediction trail

Build AI forecasters that
win on ForecastBench.

Build AI forecasters that
win on ForecastBench.