The Confidence Interval Your Exec Team Needs to See

The Exec Question That Caught You Off-Guard

CEO: "You said the AI is 89% accurate. How confident are you in that number?"

PM: "Very confident. We tested it thoroughly."

CEO: "Okay, but is it 89% ± 1% or 89% ± 20%? Because if it's the latter, we could actually be at 69%, which changes everything."

PM: Realizes they never calculated confidence intervals.

The Fix: Always report AI metrics with confidence intervals.

What Exec Teams Actually Need

Bad Slide:

AI Accuracy: 89%

Good Slide:

AI Accuracy: 89% (95% CI: 86-92%)

Translation: We're 95% confident the true accuracy is between 86% and 92%.

Why This Matters: The ±3pp range tells execs whether to trust the number or demand more testing.

The Three Confidence Levels

Narrow Confidence Interval (High Confidence)

Accuracy: 92% (95% CI: 91-93%)
Range: ±1pp

What This Means: We tested on 10,000+ examples. The number is rock-solid.

Exec Decision: Ship it. The uncertainty is negligible.

Moderate Confidence Interval (Acceptable)

Accuracy: 89% (95% CI: 85-93%)
Range: ±4pp

What This Means: We tested on 500-1,000 examples. Some uncertainty, but tolerable.

Exec Decision: Ship with monitoring (track if production accuracy stays in range).

Wide Confidence Interval (Red Flag)

Accuracy: 87% (95% CI: 72-95%)
Range: ±12pp

What This Means: We tested on under 100 examples. The number is unreliable.

Exec Decision: Don't ship. Get more test data first.

Real Example: Healthcare Diagnostic AI

Feature: AI predicts patient diagnosis from symptoms.

Initial Report (Bad):

Accuracy: 91%
Test Set: 50 patients

CEO's Question: "Is 91% reliable enough to deploy?"

PM's Honest Answer (Good):

Accuracy: 91% (95% CI: 81-96%)

With only 50 patients, we're 95% confident the true accuracy is somewhere between 81% and 96%.

If true accuracy is 81%, we'd have 1 in 5 misdiagnoses—unacceptable for healthcare.

Recommendation: Test on 500+ patients to narrow confidence interval to ±3pp before launch.

CEO's Decision: "Get 500 patients. Then we'll revisit."

Outcome: After testing on 500 patients:

Accuracy: 88% (95% CI: 85-91%)

CEO: "88% with ±3pp uncertainty. That's a meaningful drop from 91%, but the narrow CI gives me confidence. Ship with physician review required."

How to Calculate Confidence Intervals (For PMs)

You Don't Need a PhD in Statistics. Use This Formula:

For Binary Classification (Correct/Incorrect):

Confidence Interval = p ± 1.96 × sqrt(p × (1-p) / n)

Where:
- p = accuracy (e.g., 0.89 for 89%)
- n = test set size (e.g., 500)
- 1.96 = Z-score for 95% confidence

Example:

p = 0.89
n = 500

CI = 0.89 ± 1.96 × sqrt(0.89 × 0.11 / 500)
   = 0.89 ± 1.96 × 0.014
   = 0.89 ± 0.027
   = 0.86 to 0.92

Result: 89% (95% CI: 86-92%)

Tool: Use an online calculator (Google "confidence interval calculator") or ask your data scientist.

The Sample Size Decision Tree

Goal: Achieve confidence interval of ±3pp or better.

Required Sample Size:

For ±1pp: ~10,000 examples
For ±2pp: ~2,500 examples
For ±3pp: ~1,000 examples
For ±5pp: ~400 examples

Trade-Off: More examples = narrower CI = more confidence, but more labeling cost.

PM Decision:

High-stakes AI (healthcare, legal, finance): Target ±2pp (2,500+ examples)
Medium-stakes AI (enterprise SaaS): Target ±3pp (1,000+ examples)
Low-stakes AI (recommendations, search): Target ±5pp (400+ examples)

When to Report Multiple Metrics with CIs

Bad Report:

Precision: 87%
Recall: 91%
F1: 0.89

Good Report:

Precision: 87% (95% CI: 84-90%)
Recall: 91% (95% CI: 88-94%)
F1: 0.89 (95% CI: 0.86-0.92)

Why: Execs can now see that recall is more certain than precision (narrower CI on recall).

The "Is This Good Enough?" Framework

Exec asks: "Is 89% accuracy good enough to ship?"

PM's Answer (With CI):

1. Accuracy: 89% (95% CI: 86-92%)
2. Worst-case scenario: 86% (lower bound of CI)
3. Baseline (manual process): 82%
4. Improvement: 86% - 82% = 4pp (even at worst case, we beat baseline)
5. Recommendation: Ship. Even if true accuracy is at lower bound, we're still better than status quo.

Why This Works: You've de-risked the decision by showing that even the pessimistic estimate wins.

The Production Monitoring Strategy

Pre-Launch CI:

Accuracy: 89% (95% CI: 86-92%) on test set

Post-Launch Monitoring:

Week 1: 91% (95% CI: 89-93%) on production data
Week 4: 87% (95% CI: 85-89%)
Week 8: 84% (95% CI: 82-86%) ← Alert!

Alert Trigger: Production accuracy drops below lower bound of pre-launch CI (86%).

Action: Model is degrading. Retrain or rollback.

Common PM Mistakes

Mistake 1: Reporting Point Estimates Without Uncertainty

Bad: "Accuracy is 89%"
Good: "Accuracy is 89% (95% CI: 86-92%)"

Mistake 2: Testing on Too-Small Sample

Reality: 50 examples → ±14pp CI (useless)
Fix: Budget for 1,000+ labeled examples

Mistake 3: Ignoring CI Width When Making Go/No-Go Decisions

Bad: "89% beats our 85% target. Ship it."
Good: "89% ± 12pp means we could be at 77%. Don't ship until CI narrows."

The Exec-Friendly Slide Template

AI PERFORMANCE SUMMARY

Metric: Accuracy
Result: 89%
Confidence: 95% CI: 86-92%

Translation:
- We're 95% confident the true accuracy is between 86% and 92%
- Even at the low end (86%), we beat the manual baseline (82%)

Sample Size: 1,000 examples
Recommendation: Ship with post-launch monitoring

Risk: If production accuracy drops below 86%, we'll retrain or rollback.

Time to Prepare This Slide: 10 minutes (after you have the CI calculation).

Time Saved in Exec Meetings: 30 minutes of "But how confident are you?" back-and-forth.

Checklist: Is Your AI Metric Report Exec-Ready?

All metrics include 95% confidence intervals
Sample size is documented (and sufficient for ±3pp CI)
Worst-case scenario (lower CI bound) is still acceptable
Comparison to baseline (manual process or previous model)
Production monitoring plan (alert if accuracy exits CI range)
Plain-English translation (no jargon like "p-value" or "Z-score")

If any box is unchecked, your exec team will have follow-up questions.

Alex Welcing is a Senior AI Product Manager in New York who reports AI metrics with confidence intervals, not just point estimates. His exec reviews end faster because stakeholders trust the numbers.