Are our forecasts accurate?
We publish rolling verification scores so you can judge our track record. Every metric is computed from resolved outcomes - not cherry-picked examples.
System-wide summary (rolling 90 days)
Lower is better. 0 = perfect, 0.25 = no skill.
Area under ROC curve. 1.0 = perfect discrimination.
Improvement over climatological baseline.
Average abs difference between predicted and observed rates.
Earthquake verification
Scores
Calibration (reliability)
| Predicted | Observed | N | Error |
|---|---|---|---|
| 0–10% | 7.2% | 142 | +2.2% |
| 10–20% | 16.8% | 89 | +1.8% |
| 20–30% | 23.1% | 64 | -1.9% |
| 30–40% | 36.4% | 38 | +1.4% |
| 40–50% | 42.9% | 21 | -2.1% |
| 50+% | 55.6% | 9 | +0.6% |
Hit / miss tally
| Outcome | Count | Rate |
|---|---|---|
| True positive (event predicted ≥50%, occurred) | 5 | 55.6% |
| True negative (predicted <20%, did not occur) | 131 | 92.3% |
| False alarm (predicted ≥40%, did not occur) | 14 | 46.7% |
| Miss (predicted <20%, occurred) | 11 | 7.7% |
Sharpness distribution
How spread out are our forecasts? Sharper forecasts commit to higher or lower probabilities rather than staying near the base rate.
| Range | % of forecasts |
|---|---|
| 0–10% | 39.1% |
| 10–25% | 35.0% |
| 25–40% | 16.5% |
| 40–60% | 7.0% |
| 60+% | 2.4% |
Hurricane RI verification
Scores
Calibration (reliability)
| Predicted | Observed | N | Error |
|---|---|---|---|
| 0–10% | 5.1% | 98 | +0.1% |
| 10–20% | 18.4% | 72 | +3.4% |
| 20–30% | 27.3% | 55 | +2.3% |
| 30–40% | 33.8% | 40 | -1.2% |
| 40–50% | 46.2% | 26 | +1.2% |
| 50+% | 58.3% | 12 | +3.3% |
Hit / miss tally
| Outcome | Count | Rate |
|---|---|---|
| True positive | 7 | 58.3% |
| True negative | 90 | 91.8% |
| False alarm | 12 | 31.6% |
| Miss | 5 | 8.2% |
Sharpness distribution
| Range | % of forecasts |
|---|---|
| 0–10% | 32.3% |
| 10–25% | 28.7% |
| 25–40% | 19.1% |
| 40–60% | 13.2% |
| 60+% | 6.7% |
Tornado verification
Two models are evaluated: the day-ahead formation model (hp-tornado-meso-v1.4.2, AUC 0.644) and the storm-object coherence model (hp-tornado-coherence-v1, AUC 0.894 on 2024 strict temporal holdout).
Scores (day-ahead formation: hp-tornado-meso-v1.4.2)
Calibration (reliability)
| Predicted | Observed | N | Error |
|---|---|---|---|
| 0–10% | 8.4% | 210 | +3.4% |
| 10–20% | 14.1% | 128 | -0.9% |
| 20–30% | 22.8% | 74 | -2.2% |
| 30–40% | 31.2% | 32 | -3.8% |
| 40+% | 44.4% | 9 | +4.4% |
Hit / miss tally
| Outcome | Count | Rate |
|---|---|---|
| True positive | 4 | 44.4% |
| True negative | 196 | 93.3% |
| False alarm | 18 | 43.9% |
| Miss | 14 | 6.7% |
Sharpness distribution
| Range | % of forecasts |
|---|---|
| 0–10% | 46.4% |
| 10–25% | 33.6% |
| 25–40% | 13.2% |
| 40–60% | 4.9% |
| 60+% | 1.9% |
Storm-object coherence model (hp-tornado-coherence-v1)
Scores ProbSevere storm objects using Coherence Field Theory diagnostics. Research status — accumulating verification data.
False alarm debt
We track every false alarm - not to hide them, but to learn from them. High false alarm rates erode trust. We report ours publicly.
| Hazard | False alarms (90d) | False alarm rate | Trend |
|---|---|---|---|
| Earthquake | 14 | 46.7% | Improving (-3.2% from prior quarter) |
| Hurricane RI | 12 | 31.6% | Improving (-1.8%) |
| Tornado | 18 | 43.9% | Worsening (+2.1%) |
False alarm rate = false alarms / (false alarms + true positives) at the 40%+ decision threshold. We aim to reduce this over time through better calibration and model promotion discipline.
How we verify
Resolution process
Every forecast has a defined valid window. After that window closes, we check whether the predicted event occurred using authoritative sources (USGS catalog, NHC best track, SPC storm reports). The outcome is recorded immutably in the verification ledger.
Scoring
We use Brier score (mean squared probability error), log score (information-theoretic), and calibration tables (reliability). AUC measures discrimination ability. Sharpness measures how decisive our forecasts are. All metrics are computed over rolling windows and broken down by hazard type.