Calibration Check™ · Public Methodology
Calibration Methodology: How CEOS measures its own predictive accuracy
CEOS produces two scores for every project: a pre-build prediction from the exploration loop, and an independent post-build evaluation after the prototype (or Quick-mode spec) is generated. The gap between them (Δ) is the signal we audit ourselves against. Below is the full dataset, open methodology, anonymized.
Proof of method: synthetic seed batch
57 runs57 plausible startup ideas across 10 industry verticals. The delta distribution below is what the calibration system actually does, measured on a statistically meaningful sample. How the batch is run is defined in Methodology below.
- Mean Δ
- -2.17
- Median |Δ|
- 2.00
- Std dev
- 1.03
- Min / Max
- -4.8 / +0.8
- % with |Δ| ≥ 1.5
- 79%
Delta distribution: synthetic seed batch
Each bar shows how many projects fell into that absolute-delta range. Significant revisions on the right = the post-build evaluation is doing real work, not rubber-stamping.
organic · (synthetic)
Vertical breakdown
Different verticals calibrate differently. Verticals with <10 projects are noisy; treat their stats as preliminary.
| Vertical | Projects | Mean Δ | Median |Δ| |
|---|---|---|---|
| B2B SaaS | 15 | -2.11 | -2.20 |
| Marketplace | 12 | -2.69 | -2.80 |
| Climate / Energy | 10 | -2.04 | -2.20 |
| AI Tool(noisy) | 8 | -1.46 | -1.30 |
| FinTech(noisy) | 8 | -2.43 | -2.00 |
| DeepTech(noisy) | 6 | -1.43 | -1.00 |
| HealthTech(noisy) | 6 | -2.33 | -2.20 |
| EdTech(noisy) | 4 | -1.80 | -1.80 |
| Consumer / B2C(noisy) | 4 | -2.65 | -2.80 |
| Hardware(noisy) | 3 | -2.87 | -2.80 |
Top 10 largest |Δ|
Anonymized: only the vertical, delta, and flags are shown.
| # | Vertical | Δ | |Δ| | Flags |
|---|---|---|---|---|
| 1 | Marketplace | -4.8 | 4.8 | synthetic, simulated |
| 2 | Hardware | -4.0 | 4.0 | simulated |
| 3 | FinTech | -4.0 | 4.0 | synthetic, simulated |
| 4 | B2B SaaS | -3.8 | 3.8 | |
| 5 | Climate / Energy | -3.8 | 3.8 | synthetic, simulated |
| 6 | HealthTech | -3.8 | 3.8 | synthetic, simulated |
| 7 | Marketplace | -3.8 | 3.8 | synthetic, simulated |
| 8 | FinTech | -3.8 | 3.8 | synthetic, simulated |
| 9 | B2B SaaS | -3.6 | 3.6 | synthetic, simulated |
| 10 | Marketplace | -3.5 | 3.5 | simulated |
Organic dataset: growing
n=1919 organic projects collected. Public Calibration Check™ unlocks at n=10, the first point where per-vertical signal stabilises.
Live organic stats: mean |Δ| = 2.13. 19/50 organic projects collected. STOP-condition not yet evaluable.
- Mean Δ
- -2.13
- Median |Δ|
- 2.00
- Std dev
- 1.10
- Min / Max
- -4.0 / -0.2
- % with |Δ| ≥ 1.5
- 74%
STOP-condition: if mean |Δ| on organic data stays below 0.5 after 50+ projects, the calibration step isn't adding signal and we revisit the methodology.
Methodology
Pre-build prediction: the best score across the feasibility loop, generated by the Strategist agent before any prototype work begins.
Post-build evaluation: an independent feasibility agent re-scores the project across 5 criteria after the prototype is built (or, in Quick Reality Check, after the spec is generated without a prototype).
Δ (calibration delta):final_score − preliminary_score. Negative = pre-build was too optimistic. Positive = pre-build underestimated.
Synthetic vs organic: seed projects we ran ourselves to bootstrap the system are flaggedsynthetic_batch=true. The seed batch comprises 57plausible startup ideas distributed across 10 industry verticals (B2B SaaS, AI tool, Fintech, Health tech, Climate, Marketplace, Consumer, Deep tech, plus Ed tech and Hardware). Each ran through the full pipeline so the pre-build and post-build scores reflect identical methodology to a live submission. The STOP-condition (mean |Δ| < 0.5) applies only to organic (real user submissions), so the batch can't accidentally certify the model.
Simulated feasibility: when a project is run in Quick Reality Check (no prototype), the post-build score is generated by an LLM judge on the spec alone, not on a real prototype. These rows are flagged separately.
Updated 08/06/2026, 19:49:11 · 76 total rows · open methodology