Verification

Are our forecasts accurate?

We publish rolling verification scores so you can judge our track record. Every metric is computed from resolved outcomes - not cherry-picked examples.

System-wide summary (rolling 90 days)

0.143

Mean Brier score

Lower is better. 0 = perfect, 0.25 = no skill.

0.772

Mean AUC

Area under ROC curve. 1.0 = perfect discrimination.

+0.17

Mean Brier skill score

Improvement over climatological baseline.

2.8%

Mean calibration error

Average abs difference between predicted and observed rates.

Earthquake verification

Scores

Brier score0.142

Brier skill score+0.18

AUC0.733

Log score-0.384

Sharpness0.072

Forecasts resolved363

Calibration (reliability)

Predicted	Observed	N	Error
0–10%	7.2%	142	+2.2%
10–20%	16.8%	89	+1.8%
20–30%	23.1%	64	-1.9%
30–40%	36.4%	38	+1.4%
40–50%	42.9%	21	-2.1%
50+%	55.6%	9	+0.6%

Hit / miss tally

Outcome	Count	Rate
True positive (event predicted ≥50%, occurred)	5	55.6%
True negative (predicted <20%, did not occur)	131	92.3%
False alarm (predicted ≥40%, did not occur)	14	46.7%
Miss (predicted <20%, occurred)	11	7.7%

Sharpness distribution

How spread out are our forecasts? Sharper forecasts commit to higher or lower probabilities rather than staying near the base rate.

Range	% of forecasts
0–10%	39.1%
10–25%	35.0%
25–40%	16.5%
40–60%	7.0%
60+%	2.4%

Hurricane RI verification

Scores

Brier score0.168

Brier skill score+0.22

AUC0.938

Log score-0.421

Sharpness0.089

Forecasts resolved303

Calibration (reliability)

Predicted	Observed	N	Error
0–10%	5.1%	98	+0.1%
10–20%	18.4%	72	+3.4%
20–30%	27.3%	55	+2.3%
30–40%	33.8%	40	-1.2%
40–50%	46.2%	26	+1.2%
50+%	58.3%	12	+3.3%

Hit / miss tally

Outcome	Count	Rate
True positive	7	58.3%
True negative	90	91.8%
False alarm	12	31.6%
Miss	5	8.2%

Sharpness distribution

Range	% of forecasts
0–10%	32.3%
10–25%	28.7%
25–40%	19.1%
40–60%	13.2%
60+%	6.7%

Tornado verification

Two models are evaluated: the day-ahead formation model (hp-tornado-meso-v1.4.2, AUC 0.644) and the storm-object coherence model (hp-tornado-coherence-v1, AUC 0.894 on 2024 strict temporal holdout).

Scores (day-ahead formation: hp-tornado-meso-v1.4.2)

Brier score0.231

Brier skill score+0.12

AUC0.644

Log score-0.342

Sharpness0.054

Forecasts resolved453

Calibration (reliability)

Predicted	Observed	N	Error
0–10%	8.4%	210	+3.4%
10–20%	14.1%	128	-0.9%
20–30%	22.8%	74	-2.2%
30–40%	31.2%	32	-3.8%
40+%	44.4%	9	+4.4%

Hit / miss tally

Outcome	Count	Rate
True positive	4	44.4%
True negative	196	93.3%
False alarm	18	43.9%
Miss	14	6.7%

Sharpness distribution

Range	% of forecasts
0–10%	46.4%
10–25%	33.6%
25–40%	13.2%
40–60%	4.9%
60+%	1.9%

Storm-object coherence model (hp-tornado-coherence-v1)

Scores ProbSevere storm objects using Coherence Field Theory diagnostics. Research status — accumulating verification data.

AUC0.894 (2024 strict temporal holdout)

Features41

StatusResearch / Accumulating

False alarm debt

We track every false alarm - not to hide them, but to learn from them. High false alarm rates erode trust. We report ours publicly.

Hazard	False alarms (90d)	False alarm rate	Trend
Earthquake	14	46.7%	Improving (-3.2% from prior quarter)
Hurricane RI	12	31.6%	Improving (-1.8%)
Tornado	18	43.9%	Worsening (+2.1%)

False alarm rate = false alarms / (false alarms + true positives) at the 40%+ decision threshold. We aim to reduce this over time through better calibration and model promotion discipline.

How we verify

Resolution process

Every forecast has a defined valid window. After that window closes, we check whether the predicted event occurred using authoritative sources (USGS catalog, NHC best track, SPC storm reports). The outcome is recorded immutably in the verification ledger.

Scoring

We use Brier score (mean squared probability error), log score (information-theoretic), and calibration tables (reliability). AUC measures discrimination ability. Sharpness measures how decisive our forecasts are. All metrics are computed over rolling windows and broken down by hazard type.