The Confidence Interval Your Exec Team Needs to See
Telling your CEO 'accuracy is 89%' isn't enough. They need to know: Is that 89% ± 2pp or 89% ± 15pp? Here's how to communicate AI uncertainty to non-technical stakeholders.
The Exec Question That Caught You Off-Guard
CEO: "You said the AI is 89% accurate. How confident are you in that number?"
PM: "Very confident. We tested it thoroughly."
CEO: "Okay, but is it 89% ± 1% or 89% ± 20%? Because if it's the latter, we could actually be at 69%, which changes everything."
PM: Realizes they never calculated confidence intervals.
The Fix: Always report AI metrics with confidence intervals.
What Exec Teams Actually Need
Bad Slide:
AI Accuracy: 89%
Good Slide:
AI Accuracy: 89% (95% CI: 86-92%)
Translation: We're 95% confident the true accuracy is between 86% and 92%.
Why This Matters: The ±3pp range tells execs whether to trust the number or demand more testing.
The Three Confidence Levels
Narrow Confidence Interval (High Confidence)
Accuracy: 92% (95% CI: 91-93%)
Range: ±1pp
What This Means: We tested on 10,000+ examples. The number is rock-solid.
Exec Decision: Ship it. The uncertainty is negligible.
Moderate Confidence Interval (Acceptable)
Accuracy: 89% (95% CI: 85-93%)
Range: ±4pp
What This Means: We tested on 500-1,000 examples. Some uncertainty, but tolerable.
Exec Decision: Ship with monitoring (track if production accuracy stays in range).
Wide Confidence Interval (Red Flag)
Accuracy: 87% (95% CI: 72-95%)
Range: ±12pp
What This Means: We tested on under 100 examples. The number is unreliable.
Exec Decision: Don't ship. Get more test data first.
Real Example: Healthcare Diagnostic AI
Feature: AI predicts patient diagnosis from symptoms.
Initial Report (Bad):
Accuracy: 91%
Test Set: 50 patients
CEO's Question: "Is 91% reliable enough to deploy?"
PM's Honest Answer (Good):
Accuracy: 91% (95% CI: 81-96%)
With only 50 patients, we're 95% confident the true accuracy is somewhere between 81% and 96%.
If true accuracy is 81%, we'd have 1 in 5 misdiagnoses—unacceptable for healthcare.
Recommendation: Test on 500+ patients to narrow confidence interval to ±3pp before launch.
CEO's Decision: "Get 500 patients. Then we'll revisit."
Outcome: After testing on 500 patients:
Accuracy: 88% (95% CI: 85-91%)
CEO: "88% with ±3pp uncertainty. That's a meaningful drop from 91%, but the narrow CI gives me confidence. Ship with physician review required."
How to Calculate Confidence Intervals (For PMs)
You Don't Need a PhD in Statistics. Use This Formula:
For Binary Classification (Correct/Incorrect):
Confidence Interval = p ± 1.96 × sqrt(p × (1-p) / n)
Where:
- p = accuracy (e.g., 0.89 for 89%)
- n = test set size (e.g., 500)
- 1.96 = Z-score for 95% confidence
Example:
p = 0.89
n = 500
CI = 0.89 ± 1.96 × sqrt(0.89 × 0.11 / 500)
= 0.89 ± 1.96 × 0.014
= 0.89 ± 0.027
= 0.86 to 0.92
Result: 89% (95% CI: 86-92%)
Tool: Use an online calculator (Google "confidence interval calculator") or ask your data scientist.
The Sample Size Decision Tree
Goal: Achieve confidence interval of ±3pp or better.
Required Sample Size:
- For ±1pp: ~10,000 examples
- For ±2pp: ~2,500 examples
- For ±3pp: ~1,000 examples
- For ±5pp: ~400 examples
Trade-Off: More examples = narrower CI = more confidence, but more labeling cost.
PM Decision:
- High-stakes AI (healthcare, legal, finance): Target ±2pp (2,500+ examples)
- Medium-stakes AI (enterprise SaaS): Target ±3pp (1,000+ examples)
- Low-stakes AI (recommendations, search): Target ±5pp (400+ examples)
When to Report Multiple Metrics with CIs
Bad Report:
Precision: 87%
Recall: 91%
F1: 0.89
Good Report:
Precision: 87% (95% CI: 84-90%)
Recall: 91% (95% CI: 88-94%)
F1: 0.89 (95% CI: 0.86-0.92)
Why: Execs can now see that recall is more certain than precision (narrower CI on recall).
The "Is This Good Enough?" Framework
Exec asks: "Is 89% accuracy good enough to ship?"
PM's Answer (With CI):
1. Accuracy: 89% (95% CI: 86-92%)
2. Worst-case scenario: 86% (lower bound of CI)
3. Baseline (manual process): 82%
4. Improvement: 86% - 82% = 4pp (even at worst case, we beat baseline)
5. Recommendation: Ship. Even if true accuracy is at lower bound, we're still better than status quo.
Why This Works: You've de-risked the decision by showing that even the pessimistic estimate wins.
The Production Monitoring Strategy
Pre-Launch CI:
Accuracy: 89% (95% CI: 86-92%) on test set
Post-Launch Monitoring:
Week 1: 91% (95% CI: 89-93%) on production data
Week 4: 87% (95% CI: 85-89%)
Week 8: 84% (95% CI: 82-86%) ← Alert!
Alert Trigger: Production accuracy drops below lower bound of pre-launch CI (86%).
Action: Model is degrading. Retrain or rollback.
Common PM Mistakes
Mistake 1: Reporting Point Estimates Without Uncertainty
- Bad: "Accuracy is 89%"
- Good: "Accuracy is 89% (95% CI: 86-92%)"
Mistake 2: Testing on Too-Small Sample
- Reality: 50 examples → ±14pp CI (useless)
- Fix: Budget for 1,000+ labeled examples
Mistake 3: Ignoring CI Width When Making Go/No-Go Decisions
- Bad: "89% beats our 85% target. Ship it."
- Good: "89% ± 12pp means we could be at 77%. Don't ship until CI narrows."
The Exec-Friendly Slide Template
AI PERFORMANCE SUMMARY
Metric: Accuracy
Result: 89%
Confidence: 95% CI: 86-92%
Translation:
- We're 95% confident the true accuracy is between 86% and 92%
- Even at the low end (86%), we beat the manual baseline (82%)
Sample Size: 1,000 examples
Recommendation: Ship with post-launch monitoring
Risk: If production accuracy drops below 86%, we'll retrain or rollback.
Time to Prepare This Slide: 10 minutes (after you have the CI calculation).
Time Saved in Exec Meetings: 30 minutes of "But how confident are you?" back-and-forth.
Checklist: Is Your AI Metric Report Exec-Ready?
- All metrics include 95% confidence intervals
- Sample size is documented (and sufficient for ±3pp CI)
- Worst-case scenario (lower CI bound) is still acceptable
- Comparison to baseline (manual process or previous model)
- Production monitoring plan (alert if accuracy exits CI range)
- Plain-English translation (no jargon like "p-value" or "Z-score")
If any box is unchecked, your exec team will have follow-up questions.
Alex Welcing is a Senior AI Product Manager in New York who reports AI metrics with confidence intervals, not just point estimates. His exec reviews end faster because stakeholders trust the numbers.
Related Research
The September Retro: What Your AI Team Learned in Q3 (And What to Fix in Q4)
Q3 is over. Time to audit: Which AI features shipped on time? Which got delayed? What patterns emerge? Here's the retrospective template that turns lessons into Q4 action items.
The AI PM's September Checklist: Audit Season Prep for Q4 Compliance
Q4 brings SOC2 audits, HIPAA reviews, and year-end compliance checks. Here's the 30-day checklist to get your AI features audit-ready before November.
The Model Card Template That Passes FDA Pre-Cert Review
FDA's Software Pre-Certification program requires AI transparency. Here's the model card template that gets medical device AI approved faster.