Verification

Are our forecasts accurate?

We publish rolling verification scores so you can judge our track record. Every metric is computed from resolved outcomes - not cherry-picked examples.

System-wide summary (rolling 90 days)

0.143
Mean Brier score

Lower is better. 0 = perfect, 0.25 = no skill.

0.772
Mean AUC

Area under ROC curve. 1.0 = perfect discrimination.

+0.17
Mean Brier skill score

Improvement over climatological baseline.

2.8%
Mean calibration error

Average abs difference between predicted and observed rates.

Earthquake verification

Scores

Brier score0.142
Brier skill score+0.18
AUC0.733
Log score-0.384
Sharpness0.072
Forecasts resolved363

Calibration (reliability)

PredictedObservedNError
0–10%7.2%142+2.2%
10–20%16.8%89+1.8%
20–30%23.1%64-1.9%
30–40%36.4%38+1.4%
40–50%42.9%21-2.1%
50+%55.6%9+0.6%

Hit / miss tally

OutcomeCountRate
True positive (event predicted ≥50%, occurred)555.6%
True negative (predicted <20%, did not occur)13192.3%
False alarm (predicted ≥40%, did not occur)1446.7%
Miss (predicted <20%, occurred)117.7%

Sharpness distribution

How spread out are our forecasts? Sharper forecasts commit to higher or lower probabilities rather than staying near the base rate.

Range% of forecasts
0–10%39.1%
10–25%35.0%
25–40%16.5%
40–60%7.0%
60+%2.4%

Hurricane RI verification

Scores

Brier score0.168
Brier skill score+0.22
AUC0.938
Log score-0.421
Sharpness0.089
Forecasts resolved303

Calibration (reliability)

PredictedObservedNError
0–10%5.1%98+0.1%
10–20%18.4%72+3.4%
20–30%27.3%55+2.3%
30–40%33.8%40-1.2%
40–50%46.2%26+1.2%
50+%58.3%12+3.3%

Hit / miss tally

OutcomeCountRate
True positive758.3%
True negative9091.8%
False alarm1231.6%
Miss58.2%

Sharpness distribution

Range% of forecasts
0–10%32.3%
10–25%28.7%
25–40%19.1%
40–60%13.2%
60+%6.7%

Tornado verification

Two models are evaluated: the day-ahead formation model (hp-tornado-meso-v1.4.2, AUC 0.644) and the storm-object coherence model (hp-tornado-coherence-v1, AUC 0.894 on 2024 strict temporal holdout).

Scores (day-ahead formation: hp-tornado-meso-v1.4.2)

Brier score0.231
Brier skill score+0.12
AUC0.644
Log score-0.342
Sharpness0.054
Forecasts resolved453

Calibration (reliability)

PredictedObservedNError
0–10%8.4%210+3.4%
10–20%14.1%128-0.9%
20–30%22.8%74-2.2%
30–40%31.2%32-3.8%
40+%44.4%9+4.4%

Hit / miss tally

OutcomeCountRate
True positive444.4%
True negative19693.3%
False alarm1843.9%
Miss146.7%

Sharpness distribution

Range% of forecasts
0–10%46.4%
10–25%33.6%
25–40%13.2%
40–60%4.9%
60+%1.9%

Storm-object coherence model (hp-tornado-coherence-v1)

Scores ProbSevere storm objects using Coherence Field Theory diagnostics. Research status — accumulating verification data.

AUC0.894 (2024 strict temporal holdout)
Features41
StatusResearch / Accumulating

False alarm debt

We track every false alarm - not to hide them, but to learn from them. High false alarm rates erode trust. We report ours publicly.

HazardFalse alarms (90d)False alarm rateTrend
Earthquake1446.7%Improving (-3.2% from prior quarter)
Hurricane RI1231.6%Improving (-1.8%)
Tornado1843.9%Worsening (+2.1%)

False alarm rate = false alarms / (false alarms + true positives) at the 40%+ decision threshold. We aim to reduce this over time through better calibration and model promotion discipline.

How we verify

Resolution process

Every forecast has a defined valid window. After that window closes, we check whether the predicted event occurred using authoritative sources (USGS catalog, NHC best track, SPC storm reports). The outcome is recorded immutably in the verification ledger.

Scoring

We use Brier score (mean squared probability error), log score (information-theoretic), and calibration tables (reliability). AUC measures discrimination ability. Sharpness measures how decisive our forecasts are. All metrics are computed over rolling windows and broken down by hazard type.