> > >
The calibration log shows what happened. This page shows the work. Every number from the checkpoint gets reproduced from the raw outcomes, then the v0.5 probability table gets fitted in the open. If you want to check our math, this is where you do it.
A public model should survive an audit. Here is the June 18 checkpoint, recomputed from the resolved outcomes.
| Check | Published | Recomputed | Verdict |
|---|---|---|---|
| Players scored | 28 | 28 | Match |
| Reached MLB in 90 days | 14 | 14 | Match |
| Raw Brier score | 0.2692 | 0.2679 | Match, within band rounding |
| Ordering | Every band above the one below it | Confirmed, strictly increasing | Match |
| Top band call-up rate | 100% | 6 of 6 | Match |
The June 18 result said the model ranks players correctly but prints numbers that run too low. The fix is a remap: keep every ranking, redraw the probabilities. Three steps, all shown.
Step one, tame the small samples. The under-10 band went 0 for 4. The 30-plus band went 6 for 6. Neither is the truth. Four players cannot prove a zero and six cannot prove a lock. Before fitting, each band gets pulled two phantom players toward its own original prediction. The 6-for-6 band becomes 85 percent, not 100. The 0-for-4 band becomes 2 percent, not zero.
Step two, fit the line where probabilities behave. Probabilities get squeezed at both ends of the scale, so the fit happens in log-odds space, weighted by band size. The result is one straight line: new log-odds equal 2.90 plus 2.38 times the old log-odds. Two numbers describe the entire correction.
Step three, clamp and label. Nothing prints below 2 percent or above 95. And no March prediction ran higher than about 41 percent, so anything the line produces above that range is an extrapolation. Those get a low-confidence label until real outcomes exist up there.
| Band, March avg | Actually called up | After shrinkage | Corrected odds |
|---|---|---|---|
| 7% | 0 of 4 | 2% | 3% |
| 14% | 2 of 8 | 23% | 20% |
| 22% | 6 of 10 | 54% | 49% |
| 41% | 6 of 6 | 85% | 89% |
Scoring the corrected odds against the same 28 outcomes gives a Brier of 0.148, down from 0.269. That is a 45 percent cut with zero ranking changes. You may notice a rawer version of this exercise scores 0.142. That version treats 6 for 6 as literally 100 percent. We do not publish it, because it flatters the model with sample luck. The 0.148 is the number built to hold up on the next cohort.
The humility band. Aidan Miller taught this one. A player flagged as recovering at run date now gets 30 percent of his number pulled toward a 35 percent prior, and his confidence is marked Low. The model cannot read an injury report from the inside. It should not pretend to.
The extrapolation cap. The new table produces bold numbers at the top, and none of them have been tested against outcomes yet. They stay capped at 95 percent and labeled low confidence until a cohort resolves up there.
The locked ledger. Every run is appended to a chained log where each entry carries a fingerprint of the one before it. Edit any closed entry and the chain snaps in public. Rule one of this model has always been that runs are never rewritten. Now it is enforced by math, not manners.
The correction slope is steep, 2.38, and it was fitted on one unusually aggressive promotion year. If 2027 promotes slower, that slope comes down. One more resolved cohort before we trust it fully. The Kevin McGonigle miss remains a hypothesis, not a fix: elite spring plate discipline for contact hitters may deserve more weight, and Sub-Batch B is scoring it both ways in the background. And v1.0, the genuinely trained model, still waits on the same gate it always has: 50 resolved outcomes. We have 28. Sub-Batch B brings 14 more on September 9.