Skip to content

Calibration Check™ · Public Methodology

Calibration Methodology: How CEOS measures its own predictive accuracy

CEOS produces two scores for every project: a pre-build prediction from the exploration loop, and an independent post-build evaluation after the prototype (or Quick-mode spec) is generated. The gap between them (Δ) is the signal we audit ourselves against. Below is the full dataset, open methodology, anonymized.

Proof of method: synthetic seed batch

57 runs

57 plausible startup ideas across 10 industry verticals. The delta distribution below is what the calibration system actually does, measured on a statistically meaningful sample. How the batch is run is defined in Methodology below.

Mean Δ
-2.17
Median |Δ|
2.00
Std dev
1.03
Min / Max
-4.8 / +0.8
% with |Δ| ≥ 1.5
79%

Delta distribution: synthetic seed batch

Each bar shows how many projects fell into that absolute-delta range. Significant revisions on the right = the post-build evaluation is doing real work, not rubber-stamping.

[0, 0.5)
2(2)
[0.5, 1.0)
2(5)
[1.0, 1.5)
1(5)
[1.5, 2.0)
2(11)
[2.0, 3.0)
6(24)
≥3.0
6(10)

organic · (synthetic)

Vertical breakdown

Different verticals calibrate differently. Verticals with <10 projects are noisy; treat their stats as preliminary.

Calibration delta by industry vertical: project count, mean delta, and median absolute delta per vertical.
VerticalProjectsMean ΔMedian |Δ|
B2B SaaS15-2.11-2.20
Marketplace12-2.69-2.80
Climate / Energy10-2.04-2.20
AI Tool(noisy)8-1.46-1.30
FinTech(noisy)8-2.43-2.00
DeepTech(noisy)6-1.43-1.00
HealthTech(noisy)6-2.33-2.20
EdTech(noisy)4-1.80-1.80
Consumer / B2C(noisy)4-2.65-2.80
Hardware(noisy)3-2.87-2.80

Top 10 largest |Δ|

Anonymized: only the vertical, delta, and flags are shown.

Top 10 projects by largest absolute calibration delta, anonymized to vertical, delta, and flags.
#VerticalΔ|Δ|Flags
1Marketplace-4.84.8synthetic, simulated
2Hardware-4.04.0simulated
3FinTech-4.04.0synthetic, simulated
4B2B SaaS-3.83.8
5Climate / Energy-3.83.8synthetic, simulated
6HealthTech-3.83.8synthetic, simulated
7Marketplace-3.83.8synthetic, simulated
8FinTech-3.83.8synthetic, simulated
9B2B SaaS-3.63.6synthetic, simulated
10Marketplace-3.53.5simulated

Organic dataset: growing

n=19

19 organic projects collected. Public Calibration Check™ unlocks at n=10, the first point where per-vertical signal stabilises.

Live organic stats: mean |Δ| = 2.13. 19/50 organic projects collected. STOP-condition not yet evaluable.

Mean Δ
-2.13
Median |Δ|
2.00
Std dev
1.10
Min / Max
-4.0 / -0.2
% with |Δ| ≥ 1.5
74%

STOP-condition: if mean |Δ| on organic data stays below 0.5 after 50+ projects, the calibration step isn't adding signal and we revisit the methodology.

Methodology

Pre-build prediction: the best score across the feasibility loop, generated by the Strategist agent before any prototype work begins.

Post-build evaluation: an independent feasibility agent re-scores the project across 5 criteria after the prototype is built (or, in Quick Reality Check, after the spec is generated without a prototype).

Δ (calibration delta):final_score − preliminary_score. Negative = pre-build was too optimistic. Positive = pre-build underestimated.

Synthetic vs organic: seed projects we ran ourselves to bootstrap the system are flaggedsynthetic_batch=true. The seed batch comprises 57plausible startup ideas distributed across 10 industry verticals (B2B SaaS, AI tool, Fintech, Health tech, Climate, Marketplace, Consumer, Deep tech, plus Ed tech and Hardware). Each ran through the full pipeline so the pre-build and post-build scores reflect identical methodology to a live submission. The STOP-condition (mean |Δ| < 0.5) applies only to organic (real user submissions), so the batch can't accidentally certify the model.

Simulated feasibility: when a project is run in Quick Reality Check (no prototype), the post-build score is generated by an LLM judge on the spec alone, not on a real prototype. These rows are flagged separately.

Download dataset as CSV →

Updated 08/06/2026, 19:49:11 · 76 total rows · open methodology