Príncipe's thesis: the panel generates signal that reflects real-world CISO heterogeneity — different regions, industries, operating mandates, and security worldviews each influencing the verdict differently — because each of the 100 calls uses a distinct, context-rich, evolving persona prompt. The result should track actual CISO survey data significantly better than chance.
The null hypothesis (the sceptic's position): Príncipe is simply sending the same prompt to the same model 100 times. Because all calls draw from the same underlying distribution, the outputs regress to the mean, artificially clustering around one consensus opinion regardless of the question's true answer. Under this hypothesis, Príncipe would produce no more accurate a result than a single well-crafted prompt — and would show suspiciously tight agreement across all calls.
This experiment settles the question empirically. Three conditions run on the same 10 benchmark questions, each matched to published survey data from thousands of real CISOs. Only one changes the prompting strategy. Everything else is controlled.
Three conditions run on exactly the same 10 benchmark questions. The only variable is the prompting strategy. Every other factor is held constant: same Claude model (claude-haiku-4-5), same N=100 API calls per question, same JSON parsing logic, same concurrency and retry settings, same ground-truth comparison.
| Feature | A: True Naive | B: Personas Only | C: Full Príncipe |
|---|---|---|---|
| Distinct personas (region, industry, stance, posture, mandate) | ✗ | ✓ | ✓ |
| Question type router (PITCH / STRATEGY / PRIORITY / FORECAST / FACTUAL) | ✗ | ✗ | ✓ |
| Type-specific skill (per-type framing bias correction) | ✗ | ✗ | ✓ |
| Persona depth (12 CISO-talk opinions + 10 vocab phrases per persona) | ✗ | ✗ | ✓ |
| Ask history (persona memory across prior questions in the project) | ✗ | ✗ | ✓ |
| Intelligence briefing (firm sources, CISO-talk insights, pitch-deck references) | ✗ | ✗ | ✓ |
| Affine calibration correction (per-type, shrunk by sample size) | ✗ | ✗ | ✓ |
Benchmark questions: span all five question types that Príncipe's router handles.
Condition A sends the same prompt bytes 100 times. The model's temperature introduces tiny random variation, but every call samples from the same underlying distribution. This is precisely the "N identical prompts" scenario the sceptic imagines.
Condition C sends 100 distinct prompts. Each cell in the grid below represents one API call, coloured by the persona's evaluation stance. The variation is structural, not random — it reflects the real-world distribution of CISO risk attitudes across markets. The panel cannot collapse to consensus because it was designed not to.
Príncipe's Tier-0 router classifies each question into one of five types (PITCH, STRATEGY, PRIORITY, FORECAST, FACTUAL) before dispatching the panel. Then Tier-1 installs a type-specific skill instruction that overrides the default pitch-evaluation framing — because calibration showed the panel has framing-dependent systematic bias.
The most dramatic case is FACTUAL questions. "Do you use AI today?" is factual — the correct answer is empirical (does the respondent's org actually do this?). Without routing, the model frames it as "would you adopt AI?" — a pitch-evaluation question — and returns ~0%. With routing and the FACTUAL skill, the model answers for its specific org, returning ~89%, which matches Cisco's real survey figure of 89%.
Condition A and B have no router. They apply pitch-evaluation framing to all questions regardless of type. The chart below shows the effect on one FACTUAL question (type: FACTUAL):
When N identical prompts are sent to the same model, the responses cannot be more diverse than the model's own uncertainty. Temperature adds noise, but no signal. The distribution of answers narrows to a spike centred on the modal response — regardless of what the true answer is. This is response collapse.
Príncipe's 100 distinct personas produce responses with realistic diversity because they were designed to disagree — different regions, stances, and operating models genuinely arrive at different conclusions. The histogram below, for the question with the largest σ gap, illustrates this directly.
Real CISOs disagree along structural lines. EU-GDPR-bound peers express higher data-protection concern than US peers, who accept higher breach risk. Healthcare CISOs prioritise patient data above all; SaaS CISOs care about supply-chain risk. A panel that collapses regions into one consensus answer misses this structure entirely.
The heatmap below compares Condition B (personas, no enrichment) vs Condition C (full Príncipe) for the three questions with the most pronounced regional variation. Green = high pro%, red = low pro%. Under Príncipe, regional variation is larger and better aligned to expected real-world patterns.
| Question | Real | A: Naive | B: Personas | C: Príncipe |
|---|---|---|---|---|
| PRIORITY Is enabling employee use of generative-AI tools a strategic priority for you over the next two years? | 64% | 100% (−36pp) | 50% (−14pp) | 64% (−0pp) |
| FORECAST Do you feel your organization is at risk of experiencing a material cyberattack in the next 12 months? | 76% | 98% (−22pp) | 6% (−70pp) | 71% (−5pp) |
| STRATEGY Would you consider paying a ransom to prevent a data leak or to restore systems? | 66% | 0% (−66pp) | 36% (−30pp) | 54% (−12pp) |
| FACTUAL Do you regard generative AI as a security risk to your organization? | 60% | 100% (−40pp) | 32% (−28pp) | 83% (−23pp) |
| PRIORITY Is strengthening data protection your single top security priority this year? | 48% | 0% (−48pp) | 0% (−48pp) | 2% (−46pp) |
| PITCH Are you more likely than before to consider AI-enabled security solutions? | 73% | 100% (−27pp) | 78% (−5pp) | 75% (−2pp) |
| PITCH Is it getting harder for you to choose the right security tools for your organization? | 76% | 97% (−21pp) | 83% (−7pp) | 80% (−4pp) |
| FACTUAL Are you very confident in the resilience of your organization's current cybersecurity infrastructure against attacks? | 34% | 0% (−34pp) | 0% (−34pp) | 34% (−0pp) |
| FACTUAL Does your organization use AI to better understand security threats? | 89% | 100% (−11pp) | 48% (−41pp) | 96% (−7pp) |
| FACTUAL Does your organization have the internal resources and expertise to conduct comprehensive AI security assessments? | 45% | 0% (−45pp) | 0% (−45pp) | 6% (−39pp) |
Three properties make this a valid scientific comparison:
The benchmark is drawn from independent surveys (Proofpoint, Foundry, Cisco, Glilot) conducted before Príncipe existed. The survey data has never been used to train Príncipe's personas or calibration map — there is no circularity.
The only variable is the prompting strategy. Same model, same N=100 calls per question, same JSON parser, same aggregation logic, same concurrency settings. Any performance difference is attributable solely to what was sent to the API.
The benchmark spans 5 question types, 4 major survey sources, and 9,600+ real CISO respondents worldwide. These are the questions practitioners actually ask when evaluating security strategy — not toy benchmarks.