Write the bar before you measure.
Before you write a single test case, write three things by hand: a description of the task, three "ideal" outputs, and the failure modes you're most worried about. The exercise is less about producing artefacts than forcing a real decision about what success looks like.
-
01
Describe the task in one paragraph.
What is this Skill, prompt, or MCP integration meant to do? One paragraph, in plain language, no jargon. If you can't write it cleanly, the eval will inherit the confusion.
-
02
Write three ideal outputs by hand.
Take three real inputs. Write the perfect output for each, yourself. Not Claude's output. Yours. These become the reference for what "good" looks like — and they expose disagreements between team members before scoring starts.
-
03
Name the top three failure modes.
What are you most worried about going wrong? Hallucinated facts? Wrong format? Missed fields? Inappropriate tone? Name them specifically. These drive the rubric in Phase 3.
OWNER: TBD). The failure modes worth flagging: fabricated owners or dates, actions buried in prose instead of listed, and missed risks presented as actions.Writing the three ideal outputs is the moment teams realise they don't agree on what they're building. That's not a failure of the eval — it's the eval doing its job before any code runs.