How To Design CI Output That Humans Can Actually Debug

Published · Programming

CI output is part of your developer experience.

That sounds obvious until you look at the average failed build. A pull request goes red, the developer opens the CI job, and the first thing they see is a scrollback landfill: dependency installation noise, folded shell wrappers, progress bars, warnings from unrelated packages, test runner chatter, retry messages, and somewhere in the middle, the one line that explains the failure.

If the person is lucky, the important line is near the bottom. If they are not, they get to play build-log archaeology while already context-switched away from the code they were trying to ship.

That is not a small annoyance. Bad CI output makes failures slower to diagnose, harder to reproduce, and easier to misinterpret. It wastes reviewer time. It punishes new engineers. It also confuses AI coding agents, because agents depend on stable diagnostic surfaces just as much as humans do.

This is the next layer after Making Local CI Commands Boring Enough for Humans and AI Agents. That article focused on giving the repository a stable validation interface. This one focuses on what happens when validation fails. A good CI job should not merely say "red." It should make the next debugging step obvious.

Start With The Reader

CI logs are usually written by tools, but they are read by people.

More specifically, they are read by people in a hurry:

  • The author of a pull request trying to decide whether the failure is theirs.
  • A reviewer checking whether the change is safe.
  • A build cop looking for a systemic problem.
  • A release engineer deciding whether to block a rollout.
  • A new teammate who has never seen this failure before.
  • An AI coding agent trying to repair a small patch without broadening the diff.

Those readers do not need every internal detail presented with equal weight. They need a useful answer to a short set of questions:

  • What failed?
  • Where did it fail?
  • Is it likely related to this change?
  • What command reproduces it?
  • Where are the artifacts?
  • What should I inspect next?

If your CI output does not answer those questions quickly, the output is not designed. It is emitted.

Give Every Phase A Name

The first improvement is almost embarrassingly simple: name the phases.

Do not make readers infer where they are in the job from a shell prompt, a tool banner, or a half-folded YAML step. Print clear phase boundaries that match the mental model of the validation path.

For example:

==> setup: install dependencies
==> lint: ruff
==> test: python unit tests
==> test: frontend components
==> build: documentation

That is not fancy. It is useful.

The phase names should be stable enough that people can talk about them in code review:

  • "The lint: ruff phase failed."
  • "The test: python unit tests phase is flaky."
  • "The build: documentation phase needs an artifact link."

Stable names also help agents. If an agent sees the same phase labels locally and in hosted CI, it can connect the failure to the repository's validation contract instead of guessing from raw command output.

Put The Failing Command Near The Failure

A CI failure without the command that failed is a bad diagnostic interface.

When a job runs make ci, just check, bazel test //..., npm test, or a custom script, show the command. When a wrapper delegates to another tool, show the important delegated command too.

The output should make this easy to copy:

FAILED: test: python unit tests
Command:
  uv run pytest tests/unit -q

Reproduce locally:
  make test-python

That small block saves real time. It also reduces the social cost of asking for help. A developer can paste the failure into a thread and everyone starts from the same command instead of reverse-engineering CI YAML.

Be careful with secrets and tokens, obviously. Do not print credentials, signed URLs, or private environment details. But do print enough command structure to make the failure reproducible.

Separate Signal From Chatter

CI output has two jobs that often conflict:

  • Preserve enough raw output for deep diagnosis.
  • Make the common failure obvious.

The solution is not to hide everything behind a summary. Summaries can lie by omission. The solution is to create layers.

A good failure presentation has:

  • A short top-level failure summary.
  • The command and phase that failed.
  • The most relevant error excerpt.
  • Links or paths to full logs and artifacts.
  • Raw output available for deeper inspection.

For example:

CI FAILED

Phase: test: python unit tests
Command: uv run pytest tests/unit -q
Failure: tests/unit/test_parser.py::test_rejects_empty_input

Short error:
  AssertionError: expected ParseError, got None

Artifacts:
  junit: artifacts/pytest-unit.xml
  full log: artifacts/pytest-unit.log

That is the shape you want. The summary gets the reader moving. The artifact keeps the evidence available.

What you do not want is a 12,000-line log where the failure summary appears only because the test runner happened to print one before exiting.

Use Artifacts Deliberately

Artifacts are where CI output grows up.

A console log is a poor home for everything. Test reports, coverage output, screenshots, browser traces, build scans, benchmark results, and generated diagnostics usually belong in artifacts with predictable names.

Useful artifact names are boring:

  • artifacts/junit/unit-tests.xml
  • artifacts/logs/unit-tests.log
  • artifacts/screenshots/playwright/
  • artifacts/coverage/index.html
  • artifacts/bazel/execution-log.json
  • artifacts/build/repro.txt

The key is predictability. If every CI job invents its own artifact layout, readers still have to hunt.

For test failures, publish machine-readable reports when the tool supports it. JUnit XML is not glamorous, but many CI systems understand it. For browser tests, screenshots and traces are often worth more than another thousand lines of console output. For build-system failures, a captured command, effective configuration, and selected diagnostic logs can turn "CI is broken" into an actual investigation.

This matters even more for failures that are hard to reproduce. If the job failed in a particular container image, shard, platform, or dependency state, the artifact should preserve enough context to keep the evidence from evaporating.

Make Reproduction A First-Class Output

A failed CI job should tell the developer how to reproduce the failure locally when local reproduction is realistic.

That does not mean every hosted CI job must be perfectly reproducible on a laptop. Some jobs depend on deployment credentials, large services, remote execution, specialized hardware, or production-like infrastructure. Fine. Say that clearly.

For the common path, include a reproduction block:

Reproduce:
  git fetch origin pull/123/head:pr-123
  git switch pr-123
  make check

Focused:
  uv run pytest tests/unit/test_parser.py::test_rejects_empty_input -q

The focused command is especially valuable. It tells the developer where to start without pretending the focused command is the whole validation story.

For Bazel, include the target:

Reproduce:
  bazel test //services/parser:parser_test --test_output=errors

For frontend browser tests, include the project, browser, and trace location if those matter:

Reproduce:
  npm run test:e2e -- --project=chromium tests/login.spec.ts

The point is not to turn CI into a tutorial. The point is to remove avoidable friction from the first debugging move.

Preserve Exit Codes And Stop Lying

CI scripts should be boring about failure semantics.

If a required command fails, the job should fail. If a non-required command is allowed to fail, the output should say so. If a job continues after failures to collect more results, the final summary should still make the failed phases unmissable.

Common traps:

  • Shell pipelines that lose the original command's exit code.
  • Scripts that print "failed" but exit 0.
  • Test wrappers that continue past a required failure without a final summary.
  • Retry loops that hide the first useful error.
  • Cleanup steps that overwrite the meaningful failure status.

Use shell strictness where appropriate, but do not treat set -e as a complete CI design. Be explicit around pipes and cleanup:

#!/bin/sh
set -eu

echo "==> test: python unit tests"
uv run pytest tests/unit -q

If you need tee, make sure your shell preserves the failing command's status. In Bash, that often means set -o pipefail. In POSIX sh, it may mean a slightly more deliberate wrapper. The detail matters because a green CI job with a hidden failure is worse than a loud red one.

Distinguish Product Failures From Infrastructure Failures

Not all red builds are the same.

A unit test assertion failure is different from a package registry outage. A lint error is different from a worker running out of disk. A browser test failure is different from a CI image failing to pull. If all of those appear as "job failed," your readers have to classify the failure themselves.

When possible, label failure type:

Failure Type Example Reader's First Move
Code failure Test assertion, type error, lint finding Inspect the change and reproduce locally.
Test instability Retry passes, timeout, nondeterministic ordering Check recent flake history and isolate.
Environment failure Missing dependency, disk full, image pull failure Inspect CI platform or image change.
External dependency Package registry, SaaS API, network outage Confirm service status and retry policy.
Configuration drift Local command differs from CI path Compare wrapper, flags, and environment.

You will not classify every failure perfectly. That is fine. Even a rough classification helps the reader avoid the wrong first move.

This is also where CI design and build reproducibility meet. If failures often cannot be classified, that is a smell. It usually means the job does too many things at once, hides tool output behind wrappers, or depends on ambient state that nobody has named.

Be Careful With Retries

Retries are useful. Retries are also dangerous.

A retry can separate a transient network failure from a deterministic test failure. It can also teach a team to ignore red builds until the machine gets lucky.

If a job retries, print the retry policy:

Retrying test: frontend components
Attempt: 2 of 3
Reason: previous attempt timed out after 120s

Then keep the earlier attempt's evidence. The first failure is often the most useful one. If the final attempt passes, record that the phase passed after a retry. A "green" job with hidden retries is still carrying information about system health.

For flaky tests, the output should make the flake visible:

FLAKY: tests/login.spec.ts::password reset
Passed on retry 2 of 3.
Trace: artifacts/playwright/login-password-reset-retry1.zip

That kind of output lets teams track the problem instead of letting retries launder it away.

Design For Review, Not Just Execution

CI output should support code review.

Reviewers do not need to watch the whole job run. They need to understand whether the change passed the right checks and, when it failed, whether the failure changes their review decision.

That means the pull request surface should show:

  • Which required checks passed.
  • Which optional checks failed.
  • Which failures are new versus known flaky behavior.
  • Links to focused artifacts.
  • Enough summary to avoid opening five tabs for a routine failure.

Do not make the PR page pretend all checks are equal. A formatting failure, a unit test failure, a security scan finding, and a deployment dry-run failure should not require the same mental parsing.

If your CI provider supports annotations, use them carefully. Inline test or lint annotations can be excellent when they point to the relevant file and line. They become noise when they flood the review with generated files, duplicate findings, or low-value warnings.

Make Output Friendly To AI Agents Without Making It Weird

You do not need an "AI log format." You need good logs.

The same properties that help a tired human help an agent:

  • Stable phase names.
  • Explicit commands.
  • Clear failure summaries.
  • Predictable artifact paths.
  • Focused reproduction instructions.
  • Machine-readable reports where appropriate.
  • No giant undifferentiated walls of output.

If you want to go one step further, add a small ci-summary.txt artifact:

status: failed
failed_phase: test: python unit tests
command: uv run pytest tests/unit -q
primary_failure: tests/unit/test_parser.py::test_rejects_empty_input
reproduce: uv run pytest tests/unit/test_parser.py::test_rejects_empty_input -q
artifacts:
  junit: artifacts/junit/unit-tests.xml
  log: artifacts/logs/unit-tests.log

That file is useful for humans, bots, and agents. It is also easy to attach to support requests or paste into issue comments.

The trap is overengineering the summary while neglecting the underlying validation. A clean summary of a sloppy job is just a nicer wrapper around a bad system.

A Practical CI Output Checklist

If you want to improve CI output without turning it into a platform project, start here:

  • Give each major phase a stable human-readable name.
  • Print the command that failed, with secrets removed.
  • Preserve the failing tool's exit code.
  • Put a short failure summary near the end of the job.
  • Publish full logs and reports as artifacts.
  • Use predictable artifact names and paths.
  • Include a local reproduction command when possible.
  • Label infrastructure failures differently from code failures.
  • Keep retry evidence visible.
  • Avoid dumping unrelated warnings into the primary failure path.
  • Use annotations only when they point to actionable code.
  • Make CI call the same repository commands developers use locally.

That last point is worth repeating. CI output gets much easier to understand when CI is not its own special universe. If the repository's local interface is make check, the hosted job should call make check unless there is a concrete reason not to. Then the output vocabulary stays consistent across laptops, review, and automation.

The Standard Is Debuggability

The goal is not pretty logs.

Pretty logs can still be useless. The goal is debuggable output: output that helps a competent reader move from "the build failed" to "I know what to inspect next" with minimal ceremony.

That is a craft problem. It sits somewhere between build engineering, developer productivity, observability, and plain respect for other people's time.

Good CI output does not eliminate failures. It makes failures cheaper. It keeps the team from rediscovering the same diagnostic path every week. It gives new engineers a map. It gives reviewers better evidence. It gives AI coding agents the structure they need to make smaller, more reviewable fixes.

And it sends a useful cultural signal: when the system says no, it should also help you understand why.

For more practical engineering notes, visit Slaptijack.

Slaptijack's Koding Kraken