Anyone can prompt a model to play a CISO — conjure a hundred personas, ask if they’d buy, screenshot the percentage. It looks like proof, but nobody checked the number against a real one. The only part worth anything is measuring how far that answer sits from where a real CISO would land, then closing the gap on purpose. That’s how Príncipe is built, end to end.
A synthetic panel is worth exactly what you’ve measured it against, and not one degree more. That’s the bar I hold every part of Príncipe to — and it’s the whole line between an instrument and a clever prompt.
We won’t tell you whether your idea is right. We’ll show you where the sky shifts — and where it doesn’t.
Before the how, the what-you-get. Point the same security question at a single model told to “act as a CISO,” then at Príncipe. Five things come out different — and they’re the five that decide whether you can actually bet on the answer.
| What you’re judging | Prompt one LLM as a CISO | Run it through Príncipe |
|---|---|---|
| The room | One averaged, agreeable voice. | 30–200 CISOs built to disagree — you see where the room actually splits. |
| The number | Unchecked; nobody measured it against a real buyer. | Corrected against real CISO data, residual error reported — mean error fell 47 → 18 points. |
| Confidence | Equally certain about everything. | A sized confidence band, and a “directional” flag where it isn’t calibrated yet. |
| The answer | A vibe and a percentage. | A stance, the ranked objections that block it, and the most-opposed segment. |
| Trust | A black-box one-off you can’t inspect. | Open source and reproducible — every persona, correction, and statistic is checkable. |
None of that is a cleverer prompt; it’s an instrument with an error model. The rest of this piece is how each of those rows actually gets built — typed, populated, calibrated, and kept honest.
The naive version — one prompt, one panel, ask anything — we measured against public CISO surveys where the real answer is already known. It was off by a mean of ~47 percentage points, and the error wasn’t noise. It was framing-dependent: the same panel that over-rejected bold pitches would over-affirm “is this a priority?” and over-hedge a forecast. One prompt can’t correct three opposite biases at once. So we stopped asking the panel a question and started routing it through a pipeline where each stage has exactly one job.
Every question is first typed: PITCH, STRATEGY, PRIORITY,
FORECAST, or FACTUAL. Heuristics first, a small model call only when it’s genuinely
ambiguous, and it never blocks the panel. Type is the lever everything downstream pulls — because the bias is
type-specific.
Each persona’s base prompt defines “agree” in pitch terms. For any other type that definition is wrong and quietly wins — it’s why a factual “do you use AI?” once came back 2% when 89% of real orgs do. So a non-pitch question first revokes the pitch framing and installs the right one.
Between the panel and the map, three reviewers from different seats interrogate the result: which objection actually blocks the deal, is the majority even defensible — and the part that earns its keep, what did the whole panel miss? It’s the peer-review round a real council does, and it’s where the “what the panel almost missed” line in the example below comes from.
A per-type correction learned from paired (panel, real) points, with a confidence band drawn from the residual. It’s gated: it only calls a type “calibrated” with enough data and a tight enough band. Otherwise the answer comes back directional — wide band, no false precision.
A standing rule: every number is computed server-side from the actual votes; the model only writes prose. A stance label can never quietly contradict the percentage beside it.
The panel is variable-N, 30 to 200. Thirty is a hard floor — below it the result is statistically meaningless and the product refuses to pretend otherwise. Each synthetic CISO is assembled deterministically, so a composition is perfectly reproducible — table stakes for calibrating anything.
US, EU-West, UK, EU-Central, APAC, ANZ, MEA — weighted to a realistic global mix, re-weightable per study.
24 buyer segments (GICS-derived, split for security: fintech vs banks, B2B SaaS vs consumer, gov/edu, healthcare, OT-heavy verticals).
150 employees to 20k+, budgets that scale with it; 3 to 15+ years in the chair.
ex-engineer, ex-Big-4, ex-regulator, ex-military/intel, ex-founder, ex-pentester, or career CISO.
Identity alone produced a monolith: a uniformly skeptical panel that collapsed to 0% or 100% where real CISOs split down the middle. Real security leaders live in a tension — enable the business and defend it — and they resolve it differently. We model that on three independent axes.
How hard they interrogate the evidence.
Their security-vs-business worldview, and how confident and resourced their org is.
How far they trust AI to act on its own — the sharp edge of every AI-security pitch.
Their organisational authority and program maturity — a reactive bolt-on with little leverage, or a board-level shaper who can drive change org-wide. It’s the difference between answering “could I execute this?” and “would I commit to it?” — and it’s why the same question splits a real room.
Each persona is grounded in the same 2026 reality — the funding flood, AI as the stated top priority, identity as the dominant attack surface, tightening budgets — as context to react to from their own seat, never a consensus to recite.
And that reality isn’t frozen. A signed knowledge feed updates the panel every day — the latest breaches and incidents, new regulation and guidance, the week’s vendor and threat movements — distilled, never pasted, and tagged by region, industry, and category so it lands on the personas it actually bears on. That tagging is the same ontology: the dimensions that decide who’s in the room also decide what each of them read this morning. A healthcare CISO in the EU reacts to the breach and the directive that touch their seat; an APAC fintech CISO reacts to theirs. The sky you’re measuring against is today’s, not last year’s — which is the only way a synthetic answer stays grounded as the real one moves.
This is the trap the router exists to avoid. “Yes” is not one thing — it’s five. When a CISO says they’re in favour, the panel has to know which “yes” they mean: I’d buy this is a different measurement from I back this direction, from this beats my other priorities, from I predict it’ll happen, from that’s already true of my org. Read the third column across — it’s five genuinely different questions wearing the same word.
| Type | The question really asks… | …so “in favour” means |
|---|---|---|
PITCH | Would you adopt / buy this? | You’re willing to pursue it |
STRATEGY | Is this the right approach? | You back the direction |
PRIORITY | Is X a priority / where to invest? | It beats your other demands |
FORECAST | Will X happen, by when? | Your best prediction is “yes” |
FACTUAL | Do you already do X? | It’s true of your org today |
A panel that answers all five with one notion of “agree” is confidently wrong before it starts — that single confusion is most of why the naive version sat 47 points off. Type the question first, and each “yes” gets measured as the thing it actually is.
You don’t get a vibe and a number. You get a decision, the split that produced it, the objections that block it, and a statistical read on whether the panel was even the right shape for the question. A real example — a “would you let an autonomous AI auto-close tier-1 alerts in production?” pitch, run against a 50-CISO panel:
Confidence: Moderate · 95% CI 25–51% (±13pp · N=50)
One pitch, one panel: the objections lead, the number is explicitly directional, and the statistics say plainly whether this panel could even answer the question.
Notice the hierarchy. The objections lead — sharp, specific, segment-attributed — because for a question like this they’re the most decision-useful thing in the room. The stance and percentage sit just beneath, with the band sized to the confidence behind them. You always know exactly how much weight the number can bear.
“Calibrated” isn’t a feeling. We pin it to three independent legs, and we fix the numeric tolerance after a baseline run, never by guessing the number we’d like.
Answer distributions checked against a corpus of real CISO surveys (Proofpoint Voice of the CISO, Foundry, Cisco Readiness Index, Glilot, and more). Are we in the right neighbourhood?
We put the exact questions to a live panel of real CISOs and compare answer for answer. The harshest test, and the one that produces the uncomfortable numbers.
Run the panel on technologies whose outcome we already know — EDR, Zero Trust, MFA — with leakage controls. Would it have called the winners?
On attitudinal, priority, and factual questions, four measured passes — type-aware framing, a persona disposition axis, an AI-autonomy axis, and 2026 grounding — took mean absolute error from 47 to 18 points, a 62% cut, with the error spread evenly across types instead of piled into one. The panel now splits where real CISOs split, instead of collapsing to a unanimous 0% or 100%.
Calibration isn’t a one-time stamp. The real world moves — new attacks, new tools, new regulation — so we hold the panel to alignment continuously, against several independent sources at once: public CISO surveys, on-task panels of real security leaders, and historical back-tests of technologies whose outcomes are already known. As new data lands, the corrections re-fit. No single dataset gets to define reality, and no calibration is ever “finished.”
When those sources disagree, that’s signal too. A survey can capture what CISOs say they’ll do while the panel reflects what tends to actually get done — and reconciling the two sharpens both. Triangulating against multiple yardsticks, rather than trusting any one, is what keeps the answer anchored to the real thing as it shifts.
And where a question type hasn’t yet earned a tight band, the product simply says so: the answer comes back marked directional, objections first, the number explicitly secondary. That restraint is the whole point — a panel you can trust where it’s confident is one that doesn’t pretend where it isn’t.
Príncipe isn’t a prompt with a logo. It’s an instrument with a documented error model, a reproducible build, and a standing commitment to keep measuring itself against people who can prove it wrong. The personas are engineered to disagree the way a real room does; the pipeline corrects bias where bias enters; the output admits, out loud, what it doesn’t yet know. In a category racing to look certain, the durable edge is being the one you can check.
Finding a hundred real CISOs and getting a straight answer out of them used to take a year of runway. Now it takes an afternoon. Either way, the answer was never the conviction — it’s the shift you can measure.