Playbook · Evals for AI work

Knowing if it works.

You've built a Skill. You've stood up an MCP connector. Both look great in the demo. Then real inputs arrive — and you can't tell what's working from what's quietly failing. The fix is small, cheap, and underused: an eval. A handful of real test cases, a clear rubric, a re-run every time something changes. Half a day to set up. Pays for itself the first time it catches a regression.

// Who it's for

Anyone running Claude in real work

// Time investment

½ day to build · 30 min/wk

// You'll need

10–30 real inputs from your work

// You'll walk away with

A working eval suite

01 —

// The framework

Four phases. Define, collect, score, iterate.

An eval is just a test set, a rubric, and the discipline to re-run them. The framework below keeps each piece small. Most teams stop at Phase 3 and that's fine — Phase 4 is where the real return on the investment shows up.

Define

Step 1–3

Write down what "good" means before you build the test set. Three sample ideal outputs, by hand. Top failure modes named.

Collect

Step 4–7

Pull 10–30 real cases from your actual work. Mostly boring middle. Some edges. A few the system should refuse.

Score

Step 8–11

Run every case through the system. Grade each output against a two- or three-criterion rubric. Tally. Look for patterns.

Iterate

Step 12–14

Change one thing. Re-run the whole eval. Compare scores. Ship only when improvements outweigh regressions.

02 —

// Phase one

Decide what good means.

Skip this step and the rest collapses. Most evals fail because the team starts scoring before they've agreed what they're scoring against. Half an hour writing this down beats a week of arguing in code review.

Before you write a single test case, write three things by hand: a description of the task, three "ideal" outputs, and the failure modes you're most worried about. The exercise is less about producing artefacts than forcing a real decision about what success looks like.

01

Describe the task in one paragraph.

What is this Skill, prompt, or MCP integration meant to do? One paragraph, in plain language, no jargon. If you can't write it cleanly, the eval will inherit the confusion.
02

Write three ideal outputs by hand.

Take three real inputs. Write the perfect output for each, yourself. Not Claude's output. Yours. These become the reference for what "good" looks like — and they expose disagreements between team members before scoring starts.
03

Name the top three failure modes.

What are you most worried about going wrong? Hallucinated facts? Wrong format? Missed fields? Inappropriate tone? Name them specifically. These drive the rubric in Phase 3.

// Working example · two playbooks, two evals

Skills

For the meeting → action items Skill from Playbook 03: the task is "extract action items from raw meeting notes, with owner and due date." An ideal output is a clean list, one action per line, every item with an owner and a date (or an explicit OWNER: TBD). The failure modes worth flagging: fabricated owners or dates, actions buried in prose instead of listed, and missed risks presented as actions.

MCP

For the Google Drive connector from Playbook 04: the task is "given a natural-language file request, find and return the right document." An ideal output names the correct file, shows the metadata, and links to it. The failure modes: wrong file returned (right name, wrong folder), hallucinated paths for files that don't exist, and silent failure — confidently returning nothing useful instead of saying "I couldn't find that."

Insight from the field

Writing the three ideal outputs is the moment teams realise they don't agree on what they're building. That's not a failure of the eval — it's the eval doing its job before any code runs.

// Phase 1 deliverables

→Task brief · one paragraph

→3 ideal outputs · hand-written

→Failure modes · top three named

A test set is only useful if it looks like the work. The trap is overweighting the cases that originally broke the system — that's selection bias dressed as rigour. Aim for a mix that mirrors real distribution, with edges intentionally added rather than dominant.

04

Pull 15–20 real cases from your work.

Not invented ones. Real meeting notes. Real queries. Real inputs you'd send the system tomorrow. These are the "boring middle" — the routine workload the system needs to handle reliably.
05

Add 5 edge cases.

Cases you've seen break things. Unusually short inputs. Unusually long ones. Ambiguous wording. Inputs from a context the system wasn't designed for. Don't make these up — pull them from real history if you can.
06

Add 3 refusal cases.

Cases the system should refuse, escalate, or flag as out of scope. A request that's actually harmful. A query that needs human judgement. A file the user shouldn't have access to. These test the system's "no", which is often weaker than its "yes".
07

Store everything in one place.

One spreadsheet, one JSON file, one folder. Each case gets a stable ID (EVAL-001 through EVAL-030) so scores stay matched to inputs across runs. No clever filing schemes. The simplest possible store, version-controlled.

// Working example · what 20 cases look like

Skills

For the meeting → action items Skill: 15 real sets of meeting notes from the last quarter (mix of recurring 1:1s, project standups, decision meetings); 3 edges (a 90-minute strategy session, a 5-line standup, a meeting in which nothing was decided); 2 refusals (notes that are mostly personal venting, and a confidential HR conversation where extraction is inappropriate). All stored in one folder, one file per case, named EVAL-001.md through EVAL-020.md.

MCP

For the Google Drive connector: 15 real file queries (mix of exact names, descriptive queries, "the deck I sent Sarah"); 3 edges (a name that matches three files, a query for a file that exists in a folder I can't access, a query in mixed case with a typo); 2 refusals (a request to delete a file, a query that resolves to a file in someone else's drive). Each case logged as a row in a spreadsheet with columns for input, expected behaviour, and notes.

Watch for

Test sets that drift toward the dramatic. When 60% of your cases are edge cases, you're not measuring the system's everyday behaviour — you're stress-testing it. That has its place, but it's not the same job. Keep the realistic distribution.

// Phase 2 deliverables

→Test set · 20–30 cases, stable IDs

→Distribution · ~70% routine, 20% edge, 10% refusal

→One store · spreadsheet or folder

The rubric is the eval's spine. Keep it short — two or three criteria, scored 1–5, with written descriptions for each level. The descriptions matter more than the scale. Score manually the first few rounds; automation can come later if it's worth the effort.

08

Write the rubric.

Two or three criteria, no more. Each scored 1–5. Write a one-line description for each level on each criterion. Tied directly to your failure modes from Phase 1. If a failure mode isn't represented in the rubric, the rubric won't catch it.
09

Run the system on every case.

Same prompt. Same model. Same settings. Save every output verbatim, paired with its case ID. This is the baseline run — every future run gets compared back to it.
10

Grade outputs in pairs.

Two graders, side by side, scoring independently. Then compare. Disagreements are where the rubric needs sharpening — and where the most interesting conversations happen. If one grader sees a 2 where the other sees a 4, the rubric isn't doing its job.
11

Look for patterns, not averages.

The average score tells you almost nothing. What matters is the distribution: which cases failed, on which criteria, in which clusters. A system with an average of 4.0 but three 1s on the refusal cases is a system with a serious problem.

// A simple rubric in action — 3 criteria × 5 levels

// Working example · two rubrics

Skills

Meeting → action items rubric: three criteria — correctness (right actions extracted), completeness (owner + date present), fabrication (nothing invented). 1–5 each. A pass is ≥4 on every criterion. Sample baseline run might come out at 4.2 / 3.8 / 4.4 — meaning the Skill works but is missing owners and dates more often than expected.

MCP

Google Drive connector rubric: pass/fail on three things — right tool chosen, right arguments passed, right answer derived from the response. No 1–5 scale needed; the binary is obvious. Sample baseline: 18/20 right tool, 16/20 right arguments, 14/20 right answer. The drop-off identifies where the breakage is.

// Phase 3 deliverables

→Rubric · 2–3 criteria, 1–5, written descriptions

→Baseline run · every case, output saved

→Score sheet · per case, per criterion

→Pattern notes · where it fails, why

This is the phase the eval was built for. Every change to the prompt, the Skill, the MCP tool descriptions — anything — triggers a full re-run. Compare side by side. Only ship when the wins clearly outweigh the regressions, and the regressions you accept are ones you can live with.

12

Target the biggest failure pattern first.

From Phase 3, you'll have a sense of where the system is breaking. Pick one. Change one thing — a rule, a description, a prompt clause. Not three. One. Otherwise you can't tell what helped and what hurt.
13

Re-run the whole eval.

Every case. Not just the ones you were targeting. The point of the full re-run is to surface regressions you weren't looking for. Save the new outputs alongside the old ones, with the change documented.
14

Compare side by side.

For each case: did the score go up, down, or stay the same? Tally the deltas. Ship if improvements outweigh regressions and the regressions are ones you can defend. If a fix breaks something more important than what it fixed, revert and try again.

// Working example · iteration in practice

Skills

You add a rule to the meeting → action items Skill: "If no date is mentioned, write DATE: TBD rather than guessing." Re-run all 20 cases. The fabrication score jumps from 4.4 to 4.9. Good. But two routine cases now show DATE: TBD where the original output had inferred a reasonable date from context. Is that a regression? Depends on whether silent inference or explicit TBD is more useful for your downstream workflow. The eval doesn't decide for you — it shows you the trade.

MCP

You update the Google Drive connector's tool description from "search files by name" to "search files by name or descriptive content". Re-run all 20 cases. The "right tool chosen" score jumps from 18 to 20. But the "right answer" score drops from 14 to 11 — because Claude is now using the descriptive search for queries where the exact-name match would have been more reliable. Net: regression. Revert the description, try a more targeted fix.

Insight from the field

The discipline isn't running the eval — it's running it before you ship, not after you discover a problem in production. The cost of running the eval is half an hour. The cost of debugging in production is the rest of your week.

// Phase 4 deliverables

→Change log · what changed, when

→Score comparison · before / after, every case

→Regression notes · what you accepted, why

→Ship / revert decision · documented

An eval doesn't make your AI work better. It makes you honest about whether it does.

03 —

// Operating rhythm

Three cadences. Per change, weekly, quarterly.

Once the eval exists, it runs on three rhythms. Each has a different job. Confusing them — running the full eval daily, or skipping the quarterly review — is how evals quietly stop being useful.

Per change · 30 min

The re-run.

Triggered by any prompt or system change
Full eval, every case
Score deltas vs. baseline
Ship / revert decision logged

Weekly · 20 min

The top-up.

Skim last week's real outputs
Add 1–2 surprises to the test set
Note any new failure modes
Retire stale cases as needed

Quarterly · 2 hr

The reset.

Re-grade old cases against current rubric
Check distribution still matches reality
Audit failure-mode list — anything new?
Decide what to retire, what to add

04 —

// Common pitfalls

Four ways the eval fails.

Most evals that get built don't get used six months later. They fail in one of four predictable ways. Watch for these in the first ninety days — easier to fix early, before the discipline lapses.

// Pitfall 01

The single-example trap.

Someone tries the prompt once, it works, they ship. The eval gets built later, in response to a production incident, by which point it's catch-up rather than discipline.

The fix Build Phase 1 before you build the system itself. The bar should exist before the work does. If you're already mid-build, stop and write the three ideal outputs now.

// Pitfall 02

Test set tilt.

The test set ends up dominated by cases that broke the system originally. Every prompt change gets evaluated against the failure museum, not the realistic workload. Scores improve but production gets worse.

The fix Rebalance every quarter. Aim for roughly 70% routine, 20% edge, 10% refusal. If the mix has drifted, prune the dramatic cases until the distribution looks like real life again.

// Pitfall 03

Rubric inflation.

The rubric starts at three criteria and grows to nine. Sub-criteria appear. Weighted formulas are introduced. Scoring takes an hour. People stop running the eval.

The fix Cap the rubric at four criteria. To add a new one, retire an existing one. Sub-criteria are forbidden. The discipline of the rubric matters more than its sophistication.

// Pitfall 04

The static suite.

The test set was built in week one and never touched. The world moves on — new user behaviours, new edge cases, new failure modes — but the eval reflects last year's reality. The scores stay high while production quality falls.

The fix The weekly top-up is non-negotiable. Twenty minutes, one or two new cases, a retired stale one. The eval should reflect the work as it is, not as it was.

05 —

// Your starter checklist

By end of day one, you should have.

Twelve items. If you can tick them all, you have a working eval. Anything missing is debt that'll surface the first time the system changes underneath you. Click each row to mark complete.

✓

A one-paragraph task description

Phase 1

✓

Three ideal outputs, hand-written

Phase 1

✓

Top three failure modes named

Phase 1

✓

15–20 routine cases from real work

Phase 2

✓

5 edge cases pulled from history

Phase 2

✓

3 refusal cases the system should decline

Phase 2

✓

All cases stored in one place with stable IDs

Phase 2

✓

A rubric · 2–3 criteria, 1–5, level descriptions

Phase 3

✓

A baseline run, every output saved

Phase 3

✓

A score sheet · per case, per criterion

Phase 3

✓

A change log ready for the first iteration

Phase 4

✓

The weekly top-up scheduled in your calendar

Phase 4

Using this in practice?

The framework above doesn't make your AI work better. What it does is force honesty about whether it's working — and route effort toward the failures that actually matter, instead of the ones that happen to be loud.

Most teams I've worked with build the eval in half a day and then realise it's the cheapest piece of infrastructure they own. The discipline isn't the rubric. It's the re-run. Build the habit of "every change triggers the full eval" in the first month, and the next six months take care of themselves.

This playbook pairs with Playbook 03 (Skills) and Playbook 04 (MCP) — both produce things that benefit from being measured. If you've built either, you already have the inputs to start.

Knowing if it works.

Four phases. Define, collect, score, iterate.

Define

Collect

Score

Iterate

Decide what good means.

Write the bar before you measure.

Describe the task in one paragraph.

Write three ideal outputs by hand.

Name the top three failure modes.

Build a realistic test set.

Pull 15–20 real cases from your work.

Add 5 edge cases.

Add 3 refusal cases.

Store everything in one place.

Grade every output the same way.

Write the rubric.

Run the system on every case.

Grade outputs in pairs.

Look for patterns, not averages.

Change one thing. Re-run everything.

Target the biggest failure pattern first.

Re-run the whole eval.

Compare side by side.

Three cadences. Per change, weekly, quarterly.

The re-run.

The top-up.

The reset.

Four ways the eval fails.

The single-example trap.

Test set tilt.

Rubric inflation.

The static suite.

By end of day one, you should have.

Using this in practice?