Compound Engineering for Codex and Rails Without the Rework

I asked Codex to add one billing guard to a Rails app before coffee. By lunch I had three clean diffs, two different assumptions about authorization, and a review thread that felt longer than the task itself. Nothing looked reckless. The problem was that every run guessed a different local rule.

That day settled the question for me. The win condition is not “get a good diff today.” The win condition is “make tomorrow’s diff easier, safer, and faster than today’s.”

So the real question is direct: how do I make compound engineering for Codex and Rails actually work inside a live repo? My answer is simple: treat the repo as operational memory, not just source code.

Operational memory is the set of decisions your repo can recall under pressure.

When that memory is thin, you can still ship. You just pay for it in retries, hand-holding, and avoidable regressions. When that memory is strong, Codex starts feeling less like a talented intern and more like a reliable teammate.

What is compound engineering for Codex and Rails?

Compound engineering means each task leaves behind something reusable. Not a giant doc nobody opens. One small artifact that prevents one repeat mistake.

In Rails work, those mistakes cluster around the same pressure points: auth boundaries, money logic, migrations, background jobs, Turbo updates, and external APIs. Rails gives you excellent defaults, but defaults do not answer local policy. If local policy is not written down, Codex fills the gap with a plausible guess.

That is why I stopped measuring only output quality. I started measuring memory quality. Did this task produce an asset that makes the next risky task cheaper?

The analogy that clicked for me came from kitchens. A busy kitchen does not rely on “remembering where things are.” It relies on mise en place: ingredients prepped, tools in known places, sequence understood. Compound engineering is mise en place for software delivery.

Why do Codex runs drift in Rails projects?

Drift usually starts with innocent requests: “make invoices account-safe,” “add a feature gate,” “backfill this column without downtime.” A human hears those and mentally supplies missing policy. An agent hears them and must choose among valid paths.

Rails offers many good paths on purpose. You can solve account scope in a policy object, in query composition, in controller filtering, or in service boundaries. You can ship a backfill as a migration, a job, or both. Without local rules, multiple answers look right in isolation and conflict in combination.

Then code review gets expensive for the wrong reason. You are not debating code style. You are reconciling unstated policy after the code already exists.

One week exposed this hard. A product flow on a Turbo screen changed twice, then changed again. Codex adapted fast each time, tests were green, and user behavior still broke at the seam between a Stimulus controller and a partial refresh path. Everything looked fine except the real click path.

The fix was not heroic:

freeze scope for one pass
lock the auth boundary with one request spec
lock the UI behavior with one system spec
write one note in docs/solutions/ about where this feature gate lives and why

Since then, I do not approve risky work without a named behavior contract. That one rule removed most “looks good, behaves wrong” failures.

How does compound engineering cut retries and regressions?

It works by moving ambiguity earlier, while changes are still cheap.

Here is the chain I use:

Define the change in file-level terms with one risk note.
Ask Codex to update or add the test that proves the behavior.
Let Codex implement until that behavior passes.
Review for policy alignment, not prose confidence.
Save one artifact from the surprise: note, checklist edit, or test.
Require the next run to load that artifact first.

Each step is ordinary. The compounding effect comes from sequence. If you skip the artifact step, every new task starts from scratch and you keep paying the same tax.

Before I worked this way, I relied on long prompts and memory in my own head. I would paste caveats, hope I did not forget one, and review diffs that mixed core intent with accidental invention. After I moved to explicit repo memory, diffs got smaller, review got faster, and edge failures dropped because policy moved from my brain to files.

The bridge between old and new was small:

keep AGENTS.md current with local decisions
keep docs/plans/ short and task-specific
keep docs/solutions/ sharp and searchable

Same model family, different outcomes, because the operating memory changed.

Doesn’t this add too much process for small teams?

A fair pushback is that this can become paperwork theater. If every typo fix demands ritual, the system fails. If notes are longer than the diff they protect, the tax is too high.

I agree with that pushback. The point is not to force heavy process onto every task. The point is to use extra discipline exactly where hidden risk lives.

I keep one sticky decision rule:

If failure would hurt users, money, or data, the merge needs a reusable artifact.

That rule keeps me honest. Copy edits and obvious one-line fixes move fast. Auth, billing, migrations, and async flows get explicit guardrails.

This is a trade-off, not ideology. You give up a little speed on risky tasks to avoid days of rework later. For small teams, that trade is usually positive because context switching is the real budget killer.

What should go into the repo this week?

If your current setup feels noisy, do not rebuild everything. Add a small operating baseline that can survive busy weeks.

Use this checklist:

Risk map: auth, billing, data writes, jobs, third-party calls
Decision file: one source for local Rails choices
Plan note: touched files, likely failure mode, validation path
Contract test: one test per risky behavior change
Review lens: check policy drift before style polish
Surprise log: one line on what failed and why
Carry-forward: link today’s lesson in the next task brief

If this list feels heavy, reduce it instead of abandoning it. A small repeatable habit beats an ambitious process that dies after one sprint.

What does “reality vs brochure” look like in production?

The first time this really paid off was during a denormalized backfill. The original code looked clean and passed local checks. Then write traffic rose and lock contention appeared where local testing stayed quiet.

The confusing part was that nothing looked obviously wrong in code review. The signals disagreed: tests were green, logs were noisy, latency climbed, and retries stacked. I lost time because the change looked safe on paper.

I resolved it with plain steps:

split schema change and data move into separate deploys
move data fill to batched jobs with idempotent guards
set a stop condition before launch
document rollback criteria in one runbook note

Now I keep a standing policy: no production backfill ships without a stop condition in plain language. That single policy is not flashy, but it has saved me from repeat confusion.

A second moment came from authorization drift. A new endpoint reused an internal query object and silently widened access in one branch. The app still “worked,” and that was the trap. I added a request spec for cross-account access, moved the scope check into one consistent boundary, and added a short decision note about where account isolation must live. Since then, similar edits tend to converge fast because the repo answers the policy question before coding starts.

How do I know compound engineering is working?

You do not need fancy analytics to tell. Watch practical signals you can inspect in normal work.

Healthy signs:

risky diffs get smaller over time
review comments shift from policy confusion to concrete edge checks
repeated failures show up once, then stop recurring
onboarding discussions get shorter because decisions are in files

Unhealthy signs:

every task starts with the same context dump
review threads keep rediscovering old choices
tests pass but production behavior surprises you in familiar ways

The fix for unhealthy signs is usually the same: improve operational memory, not prompt length.

If you already run Codex in Rails, this is the practical stance that tends to hold up: ask for less cleverness, demand clearer contracts, and make every risky task leave one breadcrumb the next task can follow.

Design your repo so the next correct change is easier than the last one, or you will keep paying for the same lesson twice.