// Playbook 01 — Diagnose
The maturity assessment.
A six-week playbook for the work most transformations skip — the honest diagnosis before the work begins. Score against the six-by-six maturity model. Calibrate openly. Surface the gap between the deck and the desk. Worked example throughout: Halcyon Financial's pre-programme assessment.
· not to deliver
× stages, scored
· frame, gather, score, brief
· not just the exec
Meet Halcyon Financial.
Worked example · the pre-programme assessmentA 700-person fintech with more ambition than evidence.
Halcyon Financial is a regulated digital financial services platform — the kind of mid-cap fintech that has scaled past start-up and is now writing the operating muscle it didn't need before. The board has just signed off on a twelve-month transformation programme. The phrase "be AI-Native within twelve months" appears in the deck.
No one has yet asked the harder question: where does Halcyon actually start from?
The CEO believes they're well-positioned. The CTO has a long list of caveats. The CX team — closest to the work — has neither been asked nor noticed. The CFO is funding the programme on the strength of the deck. None of these views are wrong. None of them are evidence.
This playbook is the work of replacing that ambiguity with a defensible baseline — across six dimensions, scored honestly by a cohort, calibrated openly, briefed back to the executive in a way that lands.
· strategy to people
· unaware to native
· not just the exec
· not a dashboard
// And the people you'll meet · perspectives across the organisation
The Transformation Lead
You, the readerBrought in to run the programme. The role this playbook is written for. Holds the pen on the assessment — but not the answers.
Diana Whitfield
CEO · the sponsorSigned off on the programme. Believes the organisation is ready. The person whose mental model the brief has to actually move.
Sam Patel
CTO · the scepticHas the longest list of caveats. Scores Data & Architecture lower than anyone. Often right — and easy to dismiss as a brake.
Helen Bautista
COO · the operatorRuns the day-to-day. Scores Governance & Risk at 4. Reads the policy; doesn't yet see what's happening at the desks.
Priya Nair
Head of CX · the listenerSits between the frontline and the executive. Brings Maya into the cohort because she knows the desk-level reality the policy can't see.
Maya Chen
CX Officer · the frontlineHas been using ChatGPT to triage tickets for six months. No one has asked. No one knows. Scores Governance & Risk at 2.
Anna Petrović
Head of Compliance · the lineHolds the regulator-tolerance position. Scores Governance & Risk at 3 — between policy and practice. The honest reading.
Daniel Okafor
Head of People & CultureOwns the People Impact dimension. The voice that asks "what does this mean for the twenty-three roles in the call centre?" when no one else does.
Robert Lin
CFO · the funderApproved the budget on the strength of the deck. The brief has to land with him too — and the diagnosis may change the budget shape.
External perspective
Optional · advisor or peerOne outside voice — a non-exec director, a peer transformation lead, an industry advisor — who can score without inheriting the politics.
Four phases. Frame, gather, score, brief.
The framework// The framework is deliberately small. Most assessments fail not because they're too shallow but because they collapse into surveys, dashboards, or political theatre. The discipline is to stay narrow and honest.
Frame
Week 1Decide what you're assessing, against what version of the model, and with which cohort. The frame is the diagnosis-of-the-diagnosis.
Gather
Weeks 2–3Three input streams. Structured interviews. Document review. A short self-assessment survey. The same six dimensions, three ways.
Score
Week 4A calibration session. Honest stages, defended scores. Where the cohort disagrees is the signal — don't average it away.
Brief
Weeks 5–6One short brief. Current state, the gap, the one or two dimensions where the gap matters most. The output is the input to Playbook 02.
The assessment matrix.
Six × six// Six dimensions down the side, six stages across the top. The grid is the instrument the whole assessment runs on — interviews, documents, and the pulse all score against it. The discipline isn't precision. It's a defensible relative position, captured as a range where the cohort splits, never flattened to an average.
Ambition
Sponsorship
Risk
Architecture
Fluency
Impact
// Read it as Halcyon's cohort scored it. Governance & Risk and People Impact carry the widest spreads — policy at one end, practice at the other. Those two bands are why the brief in Phase 4 points at Playbooks 02 and 06. The grid doesn't decide; it makes the disagreement impossible to hide.
Phase one · Frame.
Week 1The first week is the cheapest place to fix a bad assessment. Decide what you're assessing, why, against what version of the model, and with which cohort. The frame is the diagnosis-of-the-diagnosis. Get it wrong here and the rest of the playbook produces a confident answer to the wrong question.
Name the purpose of the assessment.
Is it diagnosis before a new programme? A health-check on a programme already running? A pulse to decide whether to invest? Each has a different cohort, a different time horizon, and a different threshold for honesty. Most assessments fail because they conflate the three. Write a one-sentence purpose at the top of the brief and refuse to drift from it.
Confirm the model fits.
The methodology's six-by-six grid is a starting point, not a contract. Read each dimension and stage with the executive sponsor. Where the language doesn't land for this organisation, adapt the language — not the structure. The dimensions earn their place; the words can change to match the vocabulary of the business. Resist re-engineering the grid itself.
Choose the cohort.
Eight to twelve people. Three rules. One — perspective over rank. A frontline officer who uses AI daily is more useful than a director who reads about it. Two — at least one sceptic. If everyone in the room agrees the organisation is ready, the room is wrong. Three — one external voice. A non-exec director, peer transformation lead, or industry advisor who can score without inheriting the politics.
Set the truth-seeking norms publicly.
Three norms, named in the kick-off. Confidence is part of the score. A 3/5 with 80% confidence is different from a 3/5 with 50%. Disagreement is data. Where the cohort splits, that's the diagnosis — don't average it away. The exec doesn't break ties. When the CEO and a CX officer disagree on Governance, the gap is the finding. Publishing these norms in week one makes them safe to enforce in week four.
A two-page framing memo, signed by three people.
The Transformation Lead lands at Halcyon on a Monday. By Friday of week one: a two-page framing memo, agreed and signed by Diana (CEO), Robert (CFO), and Daniel (Head of P&C). The third signature is deliberate — having P&C sign protects the People Impact dimension from being treated as optional. The memo names what the assessment is for, who's in the cohort, and the three norms above. It is also the document the brief in Phase 4 will be measured against.
- PurposeDiagnosis before the twelve-month programme. Not a health-check on existing AI work.
- Cohort10 people · CEO, CTO, COO, CFO, Head of CX, CX Officer, Compliance, P&C, plus one external advisor
- TimeboxSix weeks · brief delivered by end of week 6
- OutputOne brief, three pages · maturity scores with confidence, two priority dimensions, recommendation on Playbook 02
Most assessments are framed too narrowly — "score us on AI readiness" — and produce surface-level findings. The frame that earns its weight is one stage broader: "score where this organisation actually is, across the six dimensions, with calibrated honesty about confidence." The broader frame invites disagreement; the narrow one buries it. Sources: this pattern shows up across the reputable models — Gartner's five-level, MIT CISR's four-stage, McKinsey's six scaling practices. The breadth matters more than the count.
Phase two · Gather.
Weeks 2–3Three inputs, the same six dimensions. Structured interviews with the cohort. Document review across strategy, comms, and hiring. A short self-assessment survey to a broader sample. The three triangulate. Where they agree, the picture is steady. Where they disagree, you've found the gap that matters.
Run the interviews.
Forty-five minutes per cohort member. Same script for all. For each of the six dimensions: where do you score the organisation today, on the one-to-six scale, and what evidence supports that score? Don't accept a score without evidence. The evidence is the data; the score is the summary. Record the score, the evidence, and one quote per dimension. Interview in pairs — one person leads, one takes notes, switch halfway.
Review the documents.
Six artefacts. Current strategy document. Most recent all-staff comms about AI. A sample of job ads from the last six months. The AI policy (or absence of one). Recent board pack on transformation. A sample of recent staff-survey results. The documents speak for themselves. Score each dimension against the evidence in them. The gap between document scores and interview scores is the gap between what the organisation says and what it does.
Send the pulse survey.
Eight questions. Six dimensions plus two open boxes. "For each of the six dimensions below, where do you think the organisation sits today, on the one-to-six scale? And what's one thing you'd want the assessment to surface?" Send to 50–100 staff across functions and levels — not the cohort. Anonymised. The survey is the broad pulse; the interviews are the depth. Treat the survey as direction, not data.
Hold a frontline shadow.
Half a day each, with three people doing the actual work the transformation will change. Not a meeting. Sit at the desk. Watch the workflow. Note where AI shows up, where it doesn't, what people work around, what tools are open in the background. Most maturity gaps are visible at the desk and invisible in the doc. Shadow notes won't appear in the final scores but they will shape every interpretation in Phase 3.
The Governance & Risk score has a 2-point spread.
By end of week three, the Transformation Lead has run ten interviews, reviewed all six artefacts, and shadowed three frontline workflows. The pulse survey has 68 responses.
The first signal arrives in the Governance & Risk row. The COO (Helen) and CFO (Robert) both score Halcyon at 4 — based on the policy document. Compliance (Anna) scores it at 3 — between policy and practice. The CX Officer (Maya) and the Head of CX (Priya) both score it at 2 — because they've watched staff paste customer data into consumer AI tools for six months and no one in the organisation has noticed.
The document review confirms the policy exists, was last updated in March, and references generative AI in a single paragraph. The frontline shadow confirms the consumer-AI workflows. The pulse survey shows 41% of respondents have used a consumer AI tool for work in the past month.
- Cohort spread2 to 4 across the ten scorers · two-point spread
- Document score4 · policy exists, lightly applied
- Pulse signal41% using consumer AI for work · unmeasured
- The diagnosisStage 2 in practice · Stage 4 on paper · the gap is the work
The most common failure mode in Phase 2 is scoring without evidence. "I'd say we're a 3" — based on what? Senior people are skilled at producing confident scores that are mostly intuition. The rule for the assessor: every score must be defensible against at least one piece of evidence — a document, a workflow observation, a specific interview quote. Where the evidence runs out, the score gets a confidence label of "low" and goes onto the watchlist for Phase 3.
When the CEO scores the organisation high and a frontline officer scores it low, the room often defers to seniority. Don't. The frontline officer is closer to the workflow the score is supposed to describe. Senior scores reflect the deck; junior scores reflect the desk. The gap between them is the most useful single number in the whole assessment. Capture both. Average neither.
Phase three · Score.
Week 4A half-day calibration session with the cohort. Honest stages, defended scores, confidence labels per dimension. The temptation in this phase is to manufacture consensus. Don't. Where the cohort splits, that's the finding — capture it, don't average it. Two scores three points apart are more useful than a single average that hides the gap.
Run the calibration session.
Four hours, the full cohort, in person if possible. The grid on the wall. One dimension at a time. Each cohort member shows their score and the one piece of evidence behind it. The room debates. Scores can move. They are not required to converge. Use the same truth-seeking norms from Phase 1 — confidence stated, dissent rewarded, no exec tie-breaks.
Capture confidence alongside the score.
For each cell, the cohort records two things: the stage (1–6) and the confidence (low / medium / high). A 4-with-low-confidence is a different finding from a 4-with-high-confidence. The first goes onto the watchlist. The second goes into the brief. Confidence is the calibration on the calibration — it tells the executive how much weight to put on each number.
Map the disagreement.
For each dimension, plot the spread of scores from the cohort on the wall. Where the spread is one point — agreement. Where it's two or three — investigate. Where it's four or more — the dimension is doing two different jobs in the organisation, and that itself is the finding. Disagreement maps are the most useful artefact most assessments never produce.
Decline the urge to average.
When the cohort splits, the temptation is to call it a 3 and move on. Resist. The grid records the range, not the mean. The brief in Phase 4 explains the range. The executive can see for themselves where the organisation tells the same story and where it tells two. This is the discipline most assessments quietly drop. Holding it is the difference between a defensible diagnosis and a comfortable one.
The Governance & Risk row gets three numbers.
The calibration session lasts four hours. Diana (CEO) chairs the room but doesn't break ties — the norm is held. By the end of the session, every row on the grid has a stage range, a confidence label, and a one-line interpretation.
The Governance & Risk row is the one the room spends longest on. Helen and Robert hold at 4. Anna moves from 3 to "3 with caveats." Maya holds firm at 2. The room briefly debates landing on 3 to "give the assessment a clean number." Anna, who has read the discipline of the framing memo, calls it out: "if we land on 3, the brief tells the executive nothing they don't already think they know. The 2-to-4 spread is the story."
The room agrees. The Governance & Risk row is logged as Stage 2 – 4, low confidence, spread of 2 points. The interpretation, captured live: "policy at 4, practice at 2; the gap is unmeasured shadow AI and the absence of governance over consumer-tool usage. Highest-priority dimension for Playbook 02 to address."
- Helen (COO)Stage 4 · the policy exists and is current
- Anna (Compliance)Stage 3 with caveats · between policy and practice
- Maya (CX)Stage 2 · staff use consumer AI tools daily, unmeasured
- The brief entryStage 2 – 4 · low confidence · highest-priority gap
Borrowed from Annie Duke's Thinking in Bets. The job of the assessment isn't to produce the "right" score — it's to produce the most defensible score given the evidence available. A 3 with high confidence and ten pieces of evidence is a different finding from a 3 with low confidence and one. Capture both. Resist the temptation, in the year-end review, to judge the assessment by whether the score "turned out to be right." Judge it by whether the reasoning was sound at the time. This is the same discipline the prioritisation forum in Playbook 03 will run. Naming it here builds the muscle for everything downstream.
When a dimension lands at exactly Stage 3 with medium confidence, the cohort has often agreed to disagree without saying so. Pressure-test it. Ask the room: "if we had to defend this score to a sceptical board member tomorrow, what would we say?" If the defence is thin, the score is comfort, not calibration. Split it into a range. Or label the confidence honestly as low.
Phase four · Brief.
Weeks 5–6The assessment is only useful if the brief lands. One short document. Current state, the gap, the one or two dimensions where the gap matters most. The brief isn't a dashboard. It's a decision aid for the executive sponsor and the input to Playbook 02. If the brief can't be read in fifteen minutes and discussed in an hour, it's the wrong length.
Write the three-page brief.
Page one: the scored grid and a one-paragraph headline finding. Page two: the two priority dimensions, each with the evidence and the disagreement spread. Page three: the recommendation — which of Playbooks 02 through 06 the diagnosis points at, and in what order. Nothing else. No appendix. No executive summary that paraphrases what the next page says. The discipline is to make every word load-bearing.
Hold the brief-back session.
90 minutes with the executive sponsor and the cohort. Read the brief together — physically, in the room. Then debate, dimension by dimension, what the brief means for the next six months of investment. Resist the urge to present. The cohort has done the work; the brief speaks for it. The lead's job in this session is to keep the room honest about the findings, not to defend them.
Capture what the sponsor heard.
At the end of the session, ask the sponsor to summarise the diagnosis back to the room — in their own words. If the summary matches the brief, the assessment landed. If it doesn't, you've found the gap between the work and the reception, and that's the gap to close before Playbook 02 starts. This is the single most-skipped step in any assessment, and the one that determines whether the brief becomes action or wallpaper.
Recommend the next playbook.
The output of this playbook is the input to Playbook 02 (Strategic Planning · OKRs). The recommendation names the two dimensions the OKRs should move, the starting stage of each, and the target stage at twelve months. Concrete enough to write OKRs against. Honest enough that the sponsor can defend them. If the diagnosis points at a different playbook — say, Playbook 06 (Change Readiness) — because the people-impact gap dominates, the recommendation says so.
Three pages, three signatures, one clear recommendation.
The Transformation Lead writes the brief on the Tuesday of week six. Three pages. Diana, Robert, and Daniel — the three signatories on the framing memo — receive it forty-eight hours before the brief-back session, so they can read it in advance.
The brief-back session is held on the Friday. Ninety minutes. The cohort is in the room. The grid is on the wall. Diana opens by reading her own summary back: "We're better on strategy and capability than I thought. We're materially worse on governance and people-impact than I thought. Those two are where the OKRs need to focus." The room agrees. The CFO commits to a budget shape that follows the diagnosis. The CTO, who came in expecting to be vindicated on data-architecture risk, finds he's been vindicated — and that he's also being asked to support the governance work he hadn't expected to lead.
The recommendation: Playbook 02 (Strategic Planning) is the next step. The OKRs to come out of it will target Governance & Risk and People Impact as the two priority dimensions. Capability & Fluency is named as the supporting dimension; the other three are flagged as watchlist for the next assessment, twelve months out.
- Headline findingGovernance gap (Stage 2 to 4) and people-impact gap (Stage 2 to 3) are the priority
- RecommendationPlaybook 02 next · OKRs target Governance and People Impact
- Re-assessmentTwelve months · same cohort · same dimensions
- Sponsor readDiana's summary matched the brief · the assessment landed
The most-read line in the brief is usually the worst-written one — the headline finding. Most assessments default to "our overall maturity is Stage 3." That sentence is the work of nothing. The headline that earns its keep is the one that names a specific gap, between two specific perspectives, on a specific dimension, with a specific implication. Halcyon's headline isn't "we're at Stage 3 overall." It's "we score Stage 4 on policy and Stage 2 on practice in Governance & Risk — the gap is forty-one percent of staff using consumer AI tools without measurement." One sentence. Three pieces of evidence. A direction.
Assessments that produce beautiful artefacts and no decisions are worse than assessments that don't happen. The test, three months after the brief: has the diagnosis changed anything? If the brief is in a SharePoint folder and the programme is being run on the original ambition, the assessment was wallpaper. Schedule the three-month check-in at the brief-back session. Put it in the calendar before the room leaves.
The hardest part of an assessment isn't producing the score — it's resisting the urge to average away the disagreement.
Three cadences. Quarterly, annual, programme-end.
After the first assessment// A maturity assessment is a project, not a system. But the diagnosis goes stale, and the gap between this assessment and the next is where most transformations quietly lose the plot. Three cadences keep the assessment doing work after the brief lands.
The check-in.
- Has the diagnosis changed anything?
- Is the priority dimension still the priority?
- Any new evidence on the watchlist?
- Three lines · captured, circulated
The re-assessment.
- Same cohort, same dimensions
- Score against the prior year's grid
- Which stages moved · and which didn't
- Lighter than the first · same discipline
The retrospective.
- Where did the assessment get it right?
- Where did it miss?
- Judge the reasoning · not the outcome
- What we'd ask differently next time
Four ways this fails.
Common pitfalls// Every maturity assessment that didn't change anything failed in one of these four ways. Watch for them across the six weeks — and especially in the brief-back session.
The comfort average.
The cohort scores split badly on a dimension, and the room agrees to call it a 3 to "produce a clean number." The split was the diagnosis; the average is wallpaper. By the time anyone notices, the OKRs have been written against a fiction.
Make the disagreement-map a required artefact. The brief reports ranges where the cohort splits. The discomfort of writing "Stage 2–4, low confidence" is the discomfort that protects the diagnosis.
The executive halo.
The CEO scores high, the frontline scores low, and the room defers to seniority. The senior view describes the deck; the frontline view describes the desk. Averaging them produces a number that's accountable to neither.
Capture both scores by name in the brief. The CEO sees that the gap is real — and that closing it is what the programme is for. The norm "the exec doesn't break ties" was published in Phase 1 for this exact moment.
The evidence-free score.
Confident scores produced by intuition alone. The cohort says "I'd put us at a 4" and the room nods. Three months later, the OKRs aren't moving and no one can remember why the score was a 4. The assessment was an opinion poll dressed as a diagnosis.
No score without one piece of evidence behind it. The evidence column is required, not optional. Where the evidence runs out, the confidence label drops to "low" — and the dimension goes onto the watchlist for the re-assessment.
The shelved brief.
The brief is delivered, the room nods, the SharePoint folder receives it. Three months later, the programme is running on the original ambition. The diagnosis didn't lose to a better argument — it lost to inertia.
Schedule the quarterly check-in inside the brief-back session, before the room leaves. Capture the sponsor's own summary — in their words — and circulate it. The next session opens with "three months ago, you said..."
Three disciplines underneath.
What the methodology page covers in full// The four phases are the mechanics. These three are the thinking habits that keep them honest. They're surfaced in full on the methodology page; named here so the playbook is honest about what it rests on.
Score the evidence, not the intuition.
- Every score, one piece of evidence
- Where evidence runs out, label low
- Documents vs. desk · triangulate
- Don't score what can't be defended
Judge the assessment, not the number.
- A good call can be a wrong score
- A lucky guess can read as right
- Judge the reasoning at the time
- Retrospective on the assessment, annually
Make the cohort truth-seeking.
- State confidence levels
- Reward dissent · the gap is data
- Accuracy over agreement
- The exec doesn't break ties
By week six, you should have.
Starter checklist// If you can tick all fifteen, the assessment is defensible. Anything missing at week six is debt — it'll surface in month three when the OKRs land and the cohort has moved on.
Using this in practice.
CloserThe assessment is the floor, not the ceiling.
This playbook is a starting point, not a prescription. Every organisation has its own gravity — political, technical, cultural — that bends the assessment in different ways. Halcyon Financial is one shape. Yours will be different.
What travels is the discipline, not the artefacts. The framing memo will be a different two pages. Your cohort will have your faces around the table, your sceptic, your CEO who scored too high. The shape of the work — frame, gather, score, brief — is the thing that doesn't move. Neither does the discipline underneath it: score the evidence, judge the assessment, keep the cohort honest.
If you're running an assessment and want to talk through where it's getting stuck — particularly the calibration session, which is where most assessments quietly collapse — I'm happy to.