Reviewing AI-Written Tests Without Fooling Yourself

AI-written tests are dangerous in exactly the way good-looking tests are always dangerous: they can make you feel safer without actually reducing much risk.

That is not an argument against using AI coding agents to write tests. I use them for test scaffolding, edge-case enumeration, fixture cleanup, and regression coverage. They are especially useful when a codebase has consistent test patterns and the boring part is finding the right imports, factory helpers, mock setup, and assertion style.

But a test suite is not better because a model added 400 lines to it. A pull request is not safer because the diff has a satisfying green checkmark next to new test files. The question is sharper than that:

Would these tests fail for the bug or regression we actually care about?

If the answer is "I assume so," the review is not done.

This is the testing-focused companion to Designing Guardrails for AI-Generated Pull Requests and How to Use AI Coding Agents Without Losing Engineering Judgment. Those pieces are about workflow and team controls. This one is about the very specific review discipline required when an AI agent gives you tests that look reasonable at a glance.

Start With The Behavior, Not The Diff

The first mistake is reviewing AI-written tests by reading them from top to bottom and asking, "Does this code look normal?"

That is necessary. It is not sufficient.

Before reading the test body, write down the behavior the tests are supposed to protect. A useful test has a job. It should make a specific future mistake harder to ship.

For a bug fix, the review question is:

What was the broken behavior?
What input, state, or sequence triggered it?
What should happen instead?
What would the old code have done?
Does at least one test fail on the old code and pass on the new code?

For a feature, the question is similar:

What user-visible or caller-visible behavior is now supported?
What contract is being established?
What inputs are valid?
What inputs are rejected?
What behavior must not change for existing callers?

AI agents are very good at producing tests that mirror the patch. They are less reliable at identifying the real behavioral contract unless you force that contract into the task. If the implementation changed from foo() to bar(), the agent may write a test proving that bar() was called. That might be useful in a narrow interaction test. It is often just an assertion that the implementation looks like itself.

Good review starts one layer above the code. What should be true after this change? Then read the tests against that answer.

Beware Tests That Assert The Implementation

Implementation-detail assertions are the most common way AI-written tests create false confidence.

You see this pattern when a test checks:

A private helper was called.
A specific method was invoked once.
An internal collection has a certain intermediate shape.
A log line contains a phrase that is not part of the contract.
A mock received exactly the same parameters the implementation just assembled.
A function returns the value the mock was already configured to return.

Some of those assertions can be legitimate. If the public behavior is "send this message to that dependency," then interaction matters. If the code is an adapter around a third-party API, checking the outbound request may be exactly the point.

The smell is different: the test would pass even if the user-visible behavior were wrong, as long as the implementation kept performing the same dance.

For example, this kind of test is often weak:

def test_refresh_user_calls_repository(mocker):
    repo = mocker.Mock()
    service = UserService(repo)

    repo.fetch.return_value = {"id": 123, "name": "Ada"}

    result = service.refresh_user(123)

    repo.fetch.assert_called_once_with(123)
    assert result == {"id": 123, "name": "Ada"}

Maybe that is fine for a thin wrapper. But if the bug was that inactive users were being refreshed, the test misses the important question. It proves that the method asks the repository for data. It does not prove the service enforces the rule.

A stronger test names the behavior:

def test_refresh_user_rejects_inactive_users(user_factory):
    user = user_factory(active=False)
    service = UserService(FakeUserRepository([user]))

    with pytest.raises(InactiveUserError):
        service.refresh_user(user.id)

The better test is not better because it has fewer mocks. It is better because it protects the rule someone will care about six months from now.

When reviewing AI-written tests, keep asking: if I refactor the implementation without changing behavior, do these tests still pass? If the answer is no, the test may be pinning the wrong thing.

Make The Test Prove It Would Have Failed

For regression tests, I want evidence that the test fails without the fix.

That does not always need to be preserved in the final commit, but the author should be able to say how they verified it. The fastest way to fool yourself is to let an agent write a test after the fix and assume it covers the original bug. It may simply encode the new implementation.

Good verification options include:

Run the new test against the old code before applying the fix.
Temporarily revert the production change and confirm the test fails.
Mutate the fixed line back to the old behavior and confirm the test fails.
Use mutation testing for high-value code paths when the project supports it.
Explain the exact failing condition in the pull request.

The review does not need ceremony. It needs a signal.

I like a short PR note:

Regression verification:
The new `test_refresh_user_rejects_inactive_users` test fails before the guard
in `refresh_user` and passes after the fix.

That one sentence is worth more than a generic "Added tests" bullet. It tells the reviewer the test is not just decorative.

AI agents can help here. Ask the agent to identify which test should fail before the fix, then verify it yourself. Do not outsource the judgment. Outsource the tedious setup.

Watch For Mock Theater

Mocks are useful. Mock theater is different.

Mock theater happens when the test creates a miniature stage production around the implementation, complete with fake collaborators, expected calls, and carefully arranged return values, but never exercises a meaningful contract. The test looks sophisticated because it has setup. It is weak because the setup is the test.

AI agents drift into mock theater because mocks are easy to generate from the code in front of them. If a function calls three collaborators, the agent can mock all three, assert all three calls, and produce something that resembles a careful unit test.

The reviewer should ask:

Could this be tested with a real value object instead of a mock?
Would a fake repository or in-memory adapter make the behavior clearer?
Is the mock hiding the bug the test is supposed to catch?
Is the test asserting call order when the order is not part of the contract?
Does the test duplicate the implementation's branching logic?

There are good reasons to mock external services, slow systems, nondeterminism, payments, email, queues, cloud APIs, and process boundaries. But if the test is mostly mocking local code that is cheap and deterministic, the AI may be protecting the shape of the current implementation instead of the behavior.

One useful review move is to ask the author or agent for the same test with fewer mocks. Sometimes the answer is worse. Often it reveals a simpler test.

Look For Missing Negative Cases

AI-written tests often cover the happy path first and stop there.

That is understandable. The happy path is visible in the implementation. The negative paths require understanding the domain, the caller contract, and the real ways the system fails.

For most nontrivial changes, review for at least one of these:

Invalid input.
Empty input.
Missing optional data.
Permission denied.
Duplicate request.
Timeout or dependency failure.
Boundary values.
Existing data that should not be overwritten.
A feature flag disabled path.
A backwards compatibility path for old callers.

You do not need every edge case in every PR. You do need the cases that map to the risk of the change.

If an AI agent adds five happy-path tests that differ only in names and fixture values, I would rather keep one strong happy-path test and add a negative case that could catch a real bug.

Coverage reports can make this worse. A generated test suite may push line coverage upward while leaving the risky branch untested. Coverage is useful as a map of what was executed. It is not proof that the right assertions exist.

Check That Assertions Are Strong Enough

Weak assertions are another subtle failure mode.

Common examples:

assert result is not None
assert response.status_code == 200 with no body check
assert len(items) > 0
assert "error" in message
Snapshot assertions for unstable or irrelevant output
Assertions that only check the type, not the value or behavior

Those assertions are not automatically bad. Sometimes a smoke test is exactly what you want. But if the pull request is claiming behavioral coverage, the assertion should prove the behavior.

For API code, check the status code and the meaningful fields. For business logic, check the state transition. For parsing, check the parsed structure and the rejection path. For authorization, check both allowed and denied behavior. For idempotency, call the thing twice.

This is where senior review matters. The AI can generate assertion syntax. The reviewer has to decide whether the assertion earns its keep.

Keep Test Data Honest

Generated tests love generated data.

That can be fine, but unrealistic fixtures hide problems. A user with every optional field populated is not the same as a user from a real database row. A timestamp with no timezone issue is not the same as production time data. A perfectly clean string is not the same as user input.

When reviewing test data, ask:

Is the fixture smaller than the behavior requires?
Are optional fields intentionally present or absent?
Does the test name explain why the data matters?
Are factories hiding defaults that make the test pass accidentally?
Does the test use realistic boundary values?

I prefer boring, explicit test data for behavior that matters. Factories are great for reducing noise, but the values relevant to the behavior should be visible in the test. If a discount applies only after 10 seats, I want to see seats=10 in the test, not discover it in a factory default three files away.

Do Not Let The Agent Rewrite The Whole Test Suite

AI agents are sometimes too helpful.

You ask for a regression test and get a broad cleanup of fixtures, renamed helpers, reorganized imports, and a new assertion style. Some of that may be good work. It is also a review tax.

For AI-written tests, scope control matters:

Keep regression tests close to the changed behavior.
Avoid test framework migrations in feature PRs.
Do not rename shared fixtures unless the rename is the point.
Do not update snapshots unrelated to the behavior.
Avoid broad "cleanup" unless the author can explain the payoff.

This is not about distrusting the agent. It is about keeping the review legible. When a PR changes production code, tests, fixtures, snapshots, and helper libraries all at once, the reviewer has to separate behavioral coverage from test-suite churn.

Small, boring test diffs are underrated.

A Practical Review Checklist

Here is the checklist I would use for AI-written tests in a normal code review:

The test name describes behavior, not implementation trivia.
At least one test would fail without the production change.
The important assertion checks the contract the user or caller depends on.
Mocks are used for boundaries, not as a substitute for exercising local logic.
The test includes the highest-value negative or boundary case.
Fixtures expose the data that matters to the behavior.
The test does not duplicate the implementation's control flow.
The diff does not churn unrelated test helpers, snapshots, or formatting.
The test is deterministic and does not depend on time, ordering, network, or shared state unless those are controlled.
The PR description explains what human verification was performed.

That checklist is intentionally practical. It does not require a testing philosophy dissertation. It gives reviewers a way to slow down when a generated test suite looks plausible.

What To Ask The AI Agent

If you are using an agent to draft tests, give it constraints that make good tests more likely.

Instead of:

Add tests for this change.

Try:

Add the smallest regression test that fails before this fix and passes after it.
Prefer behavior-level assertions over implementation-detail assertions. Avoid
changing unrelated fixtures or snapshots. Include one negative case if the
changed behavior has a meaningful failure mode.

For review, ask:

Explain which test would fail on the old code and why. Identify any assertions
that depend on implementation details. Suggest one version with fewer mocks if
that would still test the behavior.

The point is not that the agent's answer is automatically correct. The point is that it gives the human reviewer better material to inspect.

The Human Job Does Not Go Away

AI-generated tests can be a real productivity win. They can save time, improve coverage, and make it easier to capture regressions while the context is fresh. I would much rather have engineers use agents to write better tests than use agents to write larger unreviewed patches with no tests at all.

But test review is where engineering judgment shows up.

A good reviewer is not asking whether the test file looks like the rest of the repository. A good reviewer is asking whether the test protects the behavior, whether it would have caught the bug, whether it will survive reasonable refactoring, and whether it makes the system easier to change with confidence.

Green tests are useful. Meaningful tests are better.

When an AI agent writes the first draft, treat that draft as a starting point. Make it prove something.

For more practical engineering notes, see Slaptijack.