Tests Are an API for Agents: The Fastest Path to Reliable Diffs

At 9:18 a.m., I asked a coding agent for a tiny auth fix in a Rails app: block users from viewing invoices outside their own account. The first diff looked reasonable. The second diff, with the same prompt and same repo, moved the check to a different layer and quietly changed behavior in another endpoint. By lunch, I had code that read well, passed a few shallow checks, and still made me nervous.

That is the moment most teams start polishing prompts. I used to do that too. I would add more caveats, more context, more reminders, and watch the prompt grow into a fragile wall of words.

The real question is simpler: what actually makes agent output reliable when the repo is changing every day? My answer is direct: treat tests as the contract between you and the agent, not as a final gate at the end.

Legibility is how quickly you can see what changed, why it changed, and whether it is safe.

When tests carry the contract, legibility goes up. When prompts carry the contract, legibility drifts because language is flexible and code is not.

Why don't prompts give agents enough certainty?

Prompts are great for direction. They are weak as long-term agreements. They age quickly because your codebase changes faster than your carefully written instructions.

Even when a prompt is detailed, the agent still has to choose where to act. In Rails alone, one behavior change can be applied in a controller, policy, model scope, service object, serializer, or UI path. Several of those choices may look correct in isolation. Only one may be correct for your local rules.

I learned this the expensive way on a Turbo flow where account permissions were split across two endpoints. The prompt said, "enforce tenant isolation across invoice views and exports." The agent did exactly what I asked in the primary endpoint, then widened access in the export path because the query object was reused differently there. The signals disagreed: green tests on one path, risky behavior on another.

Prompt quality still matters. It just cannot be the only source of truth for behavior that can harm users, data, or money.

What changes when tests become the contract?

When tests become the contract, the agent stops guessing policy and starts proving compliance. That one shift changes the entire loop.

Think of prompts as directions spoken over the phone, and tests as a signed delivery receipt. Directions help the driver move. The receipt confirms the package reached the right door. If you only keep directions, disputes are inevitable.

The practical chain looks like this:

Name one behavior change in plain language.
Add or update one test that fails for exactly that behavior.
Ask the agent to make only that test pass with the smallest useful diff.
Run the whole relevant suite to catch nearby breakage.
Keep the passing test as the living rule for future edits.

This leads to smaller diffs because the target is narrow. Smaller diffs lead to faster review because fewer assumptions are hidden. Faster review leads to fewer retries because you resolve policy questions early, before more code piles on top.

A quick comparison makes the trade-off obvious:

If tests are only a gate	If tests are the contract
Agent writes broad code first	Agent has a narrow target first
Review hunts for intent mismatch	Review checks policy alignment
Failures arrive late in CI	Failures appear at the behavior edge
Retry cost grows with diff size	Retry cost stays bounded

Where does this fail in real teams?

It fails when tests are present but vague.

One sprint, I asked for a change in trial-to-paid billing transitions. Users who reactivated after cancellation needed a different proration path. I had tests, but they were heavy setup tests named after screens, not rules. The agent made them pass by adjusting fixture data and a helper method, while the real proration logic stayed wrong for one reactivation edge.

What changed was subtle: a date boundary shifted after midnight in one timezone and moved users into the wrong branch. The confusing part was cost, not crash. Money logic stayed "working" until support tickets proved otherwise.

I fixed it by cutting the broad tests into small rule tests: one for reactivation timing, one for proration calculation, one for invoice line item output. Then I removed hidden time dependence by freezing the clock in each test. The failures became obvious in minutes instead of hours.

Since then, I keep a standing rule: if a test name does not read like a policy sentence, it is not ready to guide an agent.

A second failure mode is flaky tests. Agents learn from feedback loops. If the loop is noisy, they start optimizing for lucky passes instead of clean behavior.

I saw this in a background job flow that sent duplicate notifications under retry pressure. The suite had intermittent failures because one spec depended on queue timing. The agent kept producing different "fixes" that passed occasionally and failed later in CI. I was paying iteration tax without moving forward.

The fix was plain and boring: make the queue adapter deterministic in test, assert idempotency directly, and split async orchestration checks from business-rule checks. Once the suite stopped wobbling, the same agent converged quickly.

Now I treat flake as contract corrosion. If I would not trust the signal myself, I should not expect an agent to trust it either.

Do tests slow down delivery when agents are fast?

A fair pushback is that writing tests first adds friction, especially for small teams shipping under pressure. If the agent can draft code in minutes, why add ceremony?

That pushback is valid for low-risk edits. I do not write heavy tests for copy changes, harmless refactors, or one-line admin tweaks. Speed matters, and not every task deserves the same guardrails.

But risky behavior is different. Skipping contract tests does not remove cost; it moves cost into review churn, production surprises, and repeated agent runs. In agent-driven workflows, retries can be deceptively expensive because each retry produces another plausible diff to evaluate.

The strongest argument against "test first" is that it can turn into ritual. I agree. The answer is not maximal process. The answer is targeted discipline.

My sticky decision rule is simple:

If a behavior change can affect access, money, or data integrity, I update the test first and implementation second.

That rule is tight enough to follow on busy weeks and broad enough to catch the failures that actually hurt.

What does an agent-friendly test suite look like?

Agent-friendly tests are fast, deterministic, and written at the boundary where the rule matters. They do not need to be fancy. They need to be unmistakable.

Use this checklist when a suite feels noisy:

Single rule: each test proves one behavior, not three.
Clear names: test names read like policy sentences.
Real boundary: assert at request, domain, or job boundary where harm occurs.
Stable time: freeze clocks and remove hidden timezone drift.
Deterministic async: no timing guesses, no race-dependent assertions.
Actionable failures: failure output points to a rule, not a novel.
Fast feedback: keep hot-path tests quick enough to run repeatedly.

If I can only change one thing, I choose naming. Clear names force clear thinking, and clear thinking makes better prompts, better tests, and better diffs.

How should you start if your suite is already messy?

Do not pause work for a giant cleanup pass. Pick one risky flow this week and tighten that contract end to end.

A practical start looks like this: choose a behavior that already causes review debate, write one failing test at the right boundary, ask the agent for the smallest fix, and keep the new test as a permanent guardrail. Then repeat on the next risky flow.

This is how quality compounds in real repos. Not by one perfect prompt, and not by a heroic rewrite of the whole suite. By turning repeated ambiguity into executable rules that survive the next sprint.

When agents write code, your prompt starts the conversation, but your tests decide the truth. Build for that reality, and reliability stops feeling like luck.