How To Keep AI Coding Agent Changes Small Enough To Review

Published · Programming

AI coding agents can produce more code than your review process can absorb.

That is the useful part and the dangerous part.

The same agent that can trace a bug across five files, update tests, adjust documentation, and clean up a few nearby rough edges can also turn a small request into a pull request nobody can review with confidence. The diff looks productive. The summary sounds reasonable. The tests may even pass. But the reviewer is now staring at a pile of changes that mixes the actual fix with formatting, refactoring, opportunistic cleanup, generated tests, dependency touches, and a few "while I was here" decisions nobody explicitly asked for.

That is not a tooling problem by itself. It is a scope-control problem.

If you use AI coding agents seriously, you need a way to keep their changes small enough to review. Not small for aesthetic reasons. Small enough that a human can understand the intent, inspect the behavior, validate the evidence, and decide whether the change should ship.

This article builds on How to Use AI Coding Agents Without Losing Engineering Judgment, Designing Guardrails for AI-Generated Pull Requests, and When To Trust AI Coding Agent Refactors. Those pieces are about judgment, guardrails, and refactor risk. This one is about the practical mechanics of diff budgets, prompt constraints, patch slicing, and reviewable AI-assisted work.

Reviewability Is A Product Requirement

Engineers sometimes treat reviewability as politeness.

It is more than that. Reviewability is part of the product quality system. A change that cannot be reviewed is a change that cannot be trusted, even if the author was diligent and the agent was impressive.

A reviewable change has a few properties:

  • The intent is obvious.
  • The behavior change is bounded.
  • The touched files make sense together.
  • The tests demonstrate the risk being addressed.
  • The reviewer can separate mechanical changes from semantic changes.
  • The pull request description explains what changed and what did not.
  • The validation evidence is specific enough to reproduce.

AI-generated changes stress all of those properties because agents are very good at continuing. They do not get tired. They do not feel the social cost of a 1,400-line pull request. They can keep improving adjacent code long after the original problem was solved.

That means the human has to bring the brakes.

Start With A Diff Budget

Before asking an agent to implement anything, decide how large the first patch is allowed to be.

This does not need to be a fake universal rule like "all pull requests must be under 300 lines." Some changes are naturally large. Generated code, migrations, API renames, and dependency updates can be bigger than a normal bug fix. But the agent should know the expected shape before it starts editing.

Useful diff-budget constraints sound like this:

Implement the smallest patch that fixes the failing test.

Constraints:
- Touch at most 3 production files unless you explain why first.
- Do not reformat unrelated code.
- Do not rename public symbols.
- Do not add new dependencies.
- Add or update only the tests needed to cover this bug.
- Stop after the first complete patch and report what remains out of scope.

That prompt is not about micromanaging the model. It is about making scope a first-class requirement.

For exploratory work, use a different budget:

Inspect the code and propose a plan.

Do not edit files yet.
Return:
- the likely root cause,
- the files you would change,
- the smallest safe implementation slice,
- the validation command.

The distinction matters. Exploration and implementation are different modes. If you blur them, the agent may discover a large plan and immediately start executing it.

Make The First Patch Disposable

The first AI-generated patch should be easy to throw away.

That sounds pessimistic, but it is freeing. If the first patch is a small, focused attempt, you can evaluate it honestly. If it is wrong, you revert or redirect without emotional investment. If it is close, you ask for a narrower follow-up. If it reveals that the problem is larger than expected, you can stop and redesign the work.

The failure mode is letting the first patch become the architecture.

An agent may begin with a bug fix, notice duplication, extract a helper, update tests, change error handling, rewrite a wrapper, and leave you with a patch that is hard to reject because parts of it are genuinely useful. Now review becomes negotiation against sunk cost.

Keep the first patch disposable by asking for:

  • One failing test or reproduction first.
  • One production behavior change.
  • One validation command.
  • No opportunistic cleanup.
  • A short list of follow-up ideas outside the patch.

This is especially important when the agent is working in unfamiliar code. Let it learn, but do not let its learning process become a mixed-purpose pull request.

Separate Investigation From Editing

AI agents are excellent at investigation. They can read surrounding code, summarize patterns, search for similar implementations, and identify likely places to change.

That does not mean every investigation should end with immediate edits.

For many non-trivial tasks, use a two-step loop:

  1. Ask the agent to investigate and propose the smallest safe patch.
  2. Review the proposed scope.
  3. Ask the agent to implement only that patch.

This works well for bugs, test failures, CI problems, and unfamiliar modules. It also keeps the human in the decision path before the diff exists.

For example:

We have a bug where expired sessions are sometimes accepted.

First, inspect the session validation path and identify the smallest patch.
Do not edit files.
Call out:
- where expiration is checked,
- which behavior is ambiguous,
- what test should fail before the fix,
- which files you would touch.

After that, the implementation prompt can be much tighter:

Implement the session-expiration fix using the plan above.

Scope:
- edit only `session_validator.py` and its existing tests,
- preserve public method names,
- do not change logging or metrics,
- run `uv run pytest tests/test_session_validator.py -q`.

That is a reviewable assignment. The reviewer can check whether the patch obeyed the contract.

Ban Opportunistic Cleanup By Default

"While I was here" is where small agent tasks go to become large reviews.

Cleanup is not bad. Formatting, naming, helper extraction, test simplification, and dead-code removal are all useful. The problem is mixing cleanup with a behavior change unless the cleanup is required to make the behavior change safe.

Default rule: no opportunistic cleanup in the first patch.

Say it explicitly:

Do not perform unrelated cleanup. If you see cleanup opportunities, list them
under "follow-ups" instead of editing them.

This one constraint removes a surprising amount of diff noise. It also gives the agent a productive outlet. It can still notice messy code. It just reports the mess instead of silently turning the pull request into a cleanup campaign.

When cleanup is necessary, isolate it:

  • Mechanical cleanup first, then behavior change.
  • Behavior change first, then cleanup.
  • Separate commits when the repository uses meaningful commits.
  • Separate pull requests when the change crosses ownership or risk boundaries.

Reviewers are much better at approving "rename this helper everywhere" and "fix this expiration bug" as separate things than as one blended diff.

Use File And Concept Boundaries

A good agent assignment names both the file boundary and the concept boundary.

File boundaries are concrete:

  • Edit only files under payments/refunds/.
  • Do not touch generated files.
  • Do not edit build configuration.
  • Do not modify public API clients.
  • Keep test changes in the existing test file.

Concept boundaries are just as important:

  • Fix retry handling, not validation.
  • Add the new parser case, not a parser redesign.
  • Improve the error message, not the error taxonomy.
  • Update this endpoint, not the whole service layer.
  • Change cache-key construction, not cache invalidation policy.

Agents can follow local patterns, but they can also overgeneralize. If a bug appears in one endpoint, the agent may search for similar endpoints and update all of them. Sometimes that is exactly right. Sometimes it creates a broad behavior change that reviewers were not prepared to evaluate.

The prompt should say which mode you want:

Fix only the failing endpoint. If you find similar bugs elsewhere, list them as
follow-up candidates and do not edit them in this patch.

Or:

Search for the same bug pattern across this package. Before editing, report all
matching call sites and propose a patch-slicing plan.

Both are valid. The mistake is leaving the choice implicit.

Ask For A Patch Plan Before Large Mechanical Work

Mechanical changes can be large and still reviewable if the mechanism is clear.

Renaming a symbol across a typed codebase may touch many files. Updating a generated API client may produce a big diff. Moving a package may change imports everywhere. Those changes are not automatically bad, but they need a plan that reviewers can validate.

Before large mechanical work, ask for a patch plan:

Propose the patch sequence for renaming `AccountManager` to `AccountService`.

Return separate steps for:
- symbol rename,
- file move,
- import updates,
- documentation updates,
- tests or build commands.

Do not edit files yet.

Then choose the smallest useful slice.

The patch plan should also identify tooling support. A compiler-backed rename is different from a text search. A build-system query is different from guessing dependencies by filename. A generated client update should name the generator command and expected generated output.

If the agent cannot explain how the mechanical change will be validated, the change is not mechanical enough to trust casually.

Keep Test Changes Honest

AI agents often update tests in the same patch as production code. That can be good. It can also hide the bug.

The review question is simple: would any test fail without the production change?

For bug fixes, I like this sequence:

  1. Ask for a failing test or reproduction.
  2. Run or inspect it before the fix when practical.
  3. Ask for the smallest production change.
  4. Run the focused test.
  5. Run the broader validation command.

You will not always get a perfect red-green workflow, especially in legacy code. But the agent should be able to explain the behavioral claim:

Test evidence:
- Added `test_rejects_expired_session_at_boundary`.
- This covers the bug because the old code used `>` instead of `>=`.
- Focused command: `uv run pytest tests/test_sessions.py::test_rejects_expired_session_at_boundary -q`
- Broader command: `make test-sessions`

That is much better than "added tests."

For more on this specific problem, see Reviewing AI-Written Tests Without Fooling Yourself. Tests written by an agent are still code. They deserve review, especially when they were generated to justify the same patch they accompany.

Require A Scope Report In The PR Description

The pull request description should make review easier, not merely narrate that work occurred.

For AI-assisted changes, I want a scope report:

## Scope

This PR fixes expired session validation in `SessionValidator`.

Changed:
- Adds a boundary-case test for expiration equality.
- Updates the comparison from `>` to `>=`.
- Preserves logging, metrics, public method names, and storage behavior.

Not changed:
- Session refresh behavior.
- Token parsing.
- Database schema.
- Retry behavior.

Validation:
- `uv run pytest tests/test_sessions.py -q`
- `make check`

AI assistance:
- Agent inspected related session code and drafted the patch.
- Human author selected the scope and reviewed the final diff.

That description gives reviewers handles. They can inspect the declared scope, look for accidental out-of-scope changes, and verify the stated commands.

The "not changed" section is surprisingly valuable. It forces the author to say which boundaries matter. It also makes accidental drift easier to spot.

Watch For The Reviewability Smells

Some AI-generated changes are technically correct but review-hostile.

Common smells include:

  • The title says "fix bug" but the diff mostly refactors.
  • Tests were rewritten more heavily than production code.
  • The agent changed formatting across unrelated files.
  • The PR adds a helper with a broader abstraction than the task requires.
  • Error messages, metrics, or logs changed without being mentioned.
  • Build files changed as a side effect of a local import decision.
  • The summary is confident but does not name the validation commands.
  • The patch touches several ownership areas.
  • The agent left TODOs that are really design questions.
  • The reviewer cannot tell which lines are mechanical and which are semantic.

When you see those smells, do not try to heroically review the whole thing. Ask for a smaller patch.

That is not anti-AI. That is ordinary engineering hygiene applied to a tool that can produce code quickly.

Use Agents To Slice Their Own Work

One of the best uses of an AI coding agent is asking it to make its own work more reviewable.

After an agent produces a large patch, ask:

Split this work into the smallest reviewable sequence.

For each proposed patch, include:
- goal,
- files touched,
- behavior risk,
- tests to run,
- whether it can ship independently.

Do not edit files yet.

Then use the answer to decide what to keep, revert, split, or redo.

Agents are often good at summarizing the structure of a diff after they create it. The human still owns the decision, but the agent can reduce the cost of understanding the blast radius.

This is also useful before a pull request exists. If the task feels broad, ask the agent to propose the sequence first. You may discover that the "one small change" is really three changes:

  • Add observability.
  • Fix behavior.
  • Clean up the abstraction that made the behavior hard to see.

Those may all be worth doing. They do not have to happen in one review.

A Practical Prompt Template

Here is the prompt shape I reach for when I want reviewable agent work:

Task:
  <specific behavior or issue>

Goal:
  Produce the smallest reviewable patch that solves this task.

Scope:
  - Touch only <files/directories> unless you stop and explain why.
  - Do not perform unrelated cleanup.
  - Do not change public APIs, logs, metrics, or data formats unless required.
  - Keep tests focused on the behavior being changed.

Workflow:
  1. Inspect the code.
  2. Report the planned files and validation command.
  3. Implement the patch.
  4. Run the focused validation if available.
  5. Report what changed, what did not change, and what remains out of scope.

Review constraints:
  - Separate mechanical changes from semantic changes.
  - List follow-up cleanup instead of editing it.
  - If the patch grows beyond the scope, stop and ask for a smaller slice.

You can make this stricter for risky code:

  • "Do not edit tests until after proposing the production fix."
  • "Do not touch authentication, authorization, billing, or persistence outside the named function."
  • "Do not add dependencies."
  • "Do not change concurrency behavior."
  • "Do not update generated files without naming the generator command."

The point is not to find a magic prompt. The point is to make reviewability part of the assignment.

The Human Still Owns The Change

The agent can draft the patch. The human owns the pull request.

That ownership includes scope. If an AI coding agent produces a broad diff, the author should not outsource responsibility to the tool with "the agent did that." The author asked, accepted, edited, and submitted the change. The author can also ask for a smaller version.

Good AI-assisted engineering is not about trusting the agent more. It is about building a workflow where the agent can be useful without overwhelming the human review system.

Keep the changes small. Keep the contracts explicit. Keep mechanical and semantic work separate. Preserve the validation evidence. Make the pull request description useful.

AI coding agents make it cheap to generate code. Senior engineering judgment is still about deciding which code deserves to survive review.

For more practical software engineering notes, visit Slaptijack.

Slaptijack's Koding Kraken