AI-Generated Tests: A Quality Audit

Test generation is one of the most-cited applications of AI coding tools. The output looks comprehensive. The reality, when actually audited, is more nuanced.

Jyme Newsroom·October 13, 2025·Oct 13

One of the most consistent claims for AI coding tools through 2025 is that they generate tests well. Vendors cite test generation as a high-value, low-risk use case. Engineering managers cite it as one of the safest places to start adopting AI assistance. The output, on first inspection, looks comprehensive: a function turns into ten test cases covering the obvious branches, plus a few edge cases the human author had not thought of.

The interesting question is what happens when the AI-generated tests are audited carefully against the standards a senior engineer would apply to hand-written tests. The pattern, across teams that have done this audit, is more nuanced than the marketing suggests, and the implications for how tests should actually be generated and reviewed are worth understanding.

What AI-generated tests get right

The first observation from auditing AI-generated test suites is that the surface coverage is genuinely good. A coding agent asked to generate tests for a function will reliably cover the obvious cases: the happy path, the empty input, the boundary conditions, the basic error cases. For functions with simple, well-defined behavior, the resulting tests are often comparable to what a competent engineer would write.

The agents are also good at generating the boilerplate that humans tend to skip when they are bored: setting up test fixtures, mocking dependencies cleanly, structuring the test file in a way that follows the project's existing conventions. These are real wins, and for teams that previously had thin test coverage, an aggressive sweep of AI-generated tests can be transformative.

Anthropic and Cursor's recommendations for using their tools for test generation emphasize this surface-level value, and the recommendations are honest about it. The tools work well for the cases they describe.

What AI-generated tests get wrong

The deeper audit reveals a few consistent failure modes that surface across teams using these tools at scale.

The first is over-mocking. AI-generated tests tend to mock more aggressively than is ideal, with the result that the tests pass even when the real implementation has subtle bugs. A test that mocks out the database call and asserts that the function returns the mocked value tells the reader almost nothing about whether the function is correct. The pattern is depressingly common in AI-generated test suites.

The second is asserting on implementation details rather than behavior. AI tools often generate tests that check that specific functions were called with specific arguments, rather than checking that the observable behavior matches the expectation. Tests written this way break whenever the implementation is refactored, even if the behavior is unchanged, which is the opposite of what good tests should do.

The third is missed edge cases that require domain knowledge. The AI knows about generic edge cases (empty strings, zero values, large inputs) but does not know about the domain-specific cases that matter for the actual application: the specific patterns of bad data the production system tends to receive, the specific timing edge cases that have caused incidents in the past, the specific business rules that have been refined over time. These are the cases that hand-written tests often catch and AI-generated tests usually miss.

The fourth is false comprehensiveness. A test file with twenty tests looks more thorough than one with five, but if the twenty tests are mostly redundant variations of the same basic case while the five are well-chosen probes of distinct behaviors, the smaller suite is better. AI tools tend toward the larger, redundant version.

The reviewer's problem

The pattern these failure modes share is that they are hard to catch by reading the test file in isolation. A reviewer looking at twenty tests with reasonable assertions and clear setup will tend to approve the file unless they specifically look for the problems above. The tests run, they pass, the coverage metric goes up, and the team feels good about the change.

The issues only surface later, in two ways. First, when a real bug ships despite the test coverage, and post-mortem investigation reveals that the relevant case was either over-mocked or untested. Second, when a refactor breaks a large number of tests in ways that turn out to have nothing to do with behavior changes, costing engineering time to fix the brittle tests.

Both of these costs are easy to attribute to the original AI generation if the team is paying attention. They are easier to miss if the team is not.

What good AI test generation looks like

The teams that get the most value from AI test generation have learned a few practices that mitigate the failure modes.

First, they treat AI-generated tests as a draft to review carefully, not as production-ready code. This sounds obvious but is often skipped, because the output looks plausible enough that the temptation to nod and merge is high. The same review standards that apply to AI-generated implementation code should apply to AI-generated tests.

Second, they write the test specifications themselves before asking the AI to generate the tests. A prompt that says "generate tests for this function" produces generic output. A prompt that says "generate tests covering these specific scenarios with these specific edge cases" produces output that is actually targeted at the team's domain. The skill of writing the test spec is the load-bearing skill, and it cannot be delegated to the AI.

Third, they avoid the pattern of generating tests after the fact for code that already shipped. Tests written this way tend to test what the code does rather than what it should do, locking in any bugs that exist. The better pattern is to generate tests before or alongside the implementation, which forces the team to think about expected behavior rather than observed behavior.

Fourth, they audit their test suites periodically for the failure modes above, treating brittle or over-mocked tests as a real cost worth paying down. This is exactly the kind of unsexy maintenance work that gets deferred forever in busy teams, and the teams that prioritize it have meaningfully healthier test suites.

The broader implication

The AI-generated test story illustrates a broader pattern in AI-assisted engineering. The tools are genuinely useful for the surface form of a task: producing something that looks like the artifact a human would have produced. They are less reliable for the deeper purpose of the task: producing the artifact that actually serves its intended function. The gap between surface form and deep purpose is exactly where review judgment matters, and where the discipline of the engineering team determines whether AI assistance is a net positive or a net negative.

This is the same pattern visible in AI-generated implementation code, AI-generated documentation, and AI-generated review comments. The form is reliable; the substance requires human attention. Teams that internalize this calibration get value out of AI tools. Teams that do not tend to ship a lot of artifacts that look good and do not work.

The summary

AI test generation is a real productivity gain when used carefully and a real source of hidden technical debt when used carelessly. The discipline that separates the two is treating the AI as a strong but inexperienced colleague whose work needs the same review standards a junior engineer's would receive.

The vendor positioning around test generation tends to emphasize the upside without acknowledging the failure modes. The practitioner perspective, visible across Hacker News and engineering manager peer groups, is more measured. The deeper point: AI test generation is an IDE-tier discipline question for teams maintaining hand-written codebases. The growing population of builders shipping finished products from prompts — Lovable for web, Orbie for native mobile — operates at a layer where the test-suite-quality conversation is replaced by a synthesis-quality conversation. That is the more interesting frontier, and the one that determines who actually ships.

Sources

Orbie — Lovable for games — native iOS, Android, and web.