Blog/Technical

We Audited Our Own AI-Written Codebase. Here's What We Found.

We ran a systematic audit on Koalr's own AI-assisted codebase after a viral Product Hunt post about AI code quality. 13 issues found across 5 categories. Here's exactly what we fixed.

Andrew McCarron·March 29, 2026·12 min read

A founder posted on Product Hunt this week: "We let Claude write 100% of our code for 7 days. Here's what broke first." We read it. Then we looked at each other. Koalr is also AI-assisted — a lot of our codebase was written with Claude Code. So we spent a day running the same audit on ourselves. We found 13 issues. Here's exactly what they were and how we fixed them.

The Hypothesis

AI coding assistants are excellent at writing code that's correct for a single, well-specified call. They're less reliable at writing code that stays correct at scale — under concurrent requests, against large datasets, across multiple background workers running simultaneously.

The PH post framed this as "AI bugs." We think it's more precise than that. These are patterns that feel right locally but fail at production load. They don't show up in tests. They don't throw exceptions. They just quietly degrade.

We found five categories of them.

Category 1: N+1 Query Patterns

The hotspot coverage page — 16 queries per request

Our coverage hotspot detection was fetching an 8-week trend. The "obvious" implementation: loop over weeks, query the database inside each iteration.

// Before: 2 queries per week × 8 weeks = 16 serial round-trips
for (const week of weeks) {
  const snapshot = await this.prisma.coverageSnapshot.findFirst({
    where: { repositoryId, capturedAt: { gte: week.start, lte: week.end } },
    orderBy: { capturedAt: 'desc' },
  });
  const churn = await this.prisma.pullRequest.count({
    where: { repositoryId, mergedAt: { gte: week.start, lte: week.end } },
  });
}

Fix: 4 parallel queries for the full window, weekly bucketing in memory.

// After: 4 queries total, regardless of window length
const [allSnapshots, allPrs, failureRateMap, allTimeChurn] = await Promise.all([
  this.prisma.coverageSnapshot.findMany({
    where: { repositoryId, capturedAt: { gte: from, lte: to } },
  }),
  this.prisma.pullRequest.findMany({
    where: { repositoryId, mergedAt: { gte: from, lte: to } },
  }),
  // ...
]);
// bucket in memory

Team member import — N upserts

When importing teams from GitHub or Jira, we were running one upsertper team member in a loop. For a team of 30, that's 30 sequential database writes.

// Before: one round-trip per member
for (const userId of userIds) {
  await this.prisma.teamMember.upsert({ ... });
}
// After: one INSERT ... ON CONFLICT DO NOTHING
await this.prisma.teamMember.createMany({
  data: userIds.map((userId) => ({ teamId, userId })),
  skipDuplicates: true,
});

Category 2: Silent Exception Catches

The pattern that makes dashboards show zeros

This one is the most insidious because nothing visibly breaks. You just get incorrect data silently.

// Before: silently returns empty on any failure
try {
  const result = await externalApiCall(params);
  return processResult(result);
} catch {
  return [];
}

When a GitHub API call hit a rate limit, we'd catch the error and return an empty array. The dashboard would show zero contributors, zero PRs, zero everything — and no error in any log. The customer would see stale data and assume the sync hadn't run yet.

The fix: log the error, rethrow it so BullMQ's retry logic kicks in, and let the job fail visibly rather than silently succeeding with wrong data.

// After: logs, rethrows, lets retry logic handle it
try {
  const result = await externalApiCall(params);
  return processResult(result);
} catch (err) {
  this.logger.error(`Sync failed for ${context}: ${(err as Error).message}`);
  throw err;
}

Silent catches are the hardest class to find. They don't appear in error logs. They don't cause alerts. They only show up when a customer notices their data is wrong — which may be days after the underlying failure. If you're auditing an AI-written codebase, search for catch { and catch (e) followed immediately by return [] or return null.

Category 3: Unbounded Queries

Loading the entire queue to get a count

Our weekly report was fetching all open pull requests to compute queue depth — loading every row, counting the array in memory.

// Before: fetches every open PR just to count them
const openQueue = await this.prisma.pullRequest.findMany({
  where: { organizationId, state: 'OPEN' },
  take: 2_000, // even this "limit" is too high
});
const queueDepth = openQueue.length;

Replace with three parallel aggregation queries that transfer zero rows:

// After: 3 parallel aggregations, no row transfer
const [totalOpen, compliantCount, oldestOpenPr] = await Promise.all([
  this.prisma.pullRequest.count({ where: openPrWhere }),
  this.prisma.pullRequest.count({
    where: { ...openPrWhere, openedAt: { gte: slaWindowStart } },
  }),
  this.prisma.pullRequest.findFirst({
    where: openPrWhere,
    orderBy: { openedAt: 'asc' },
    select: { openedAt: true },
  }),
]);

Category 4: Missing Database Indexes

The most embarrassing category, because the fix is a one-liner.

Our most common query pattern across DORA, flow, and PR summary endpoints is:

WHERE organization_id = ? AND state = 'MERGED' AND merged_at BETWEEN ? AND ?

We had separate indexes on organization_id, on state, and on merged_at — but not a composite index covering all three. Postgres was doing a full organization scan on every metric request and filtering down from there.

Four composite indexes added:

@@index([organizationId, state, mergedAt])
@@index([organizationId, state, openedAt])
@@index([organizationId, repositoryId, mergedAt])
@@index([organizationId, authorGithubLogin, mergedAt])

Why AI misses this:AI generates queries that are logically correct in isolation. It doesn't simulate the query planner against a multi-million-row table. The missing index is invisible until you run EXPLAIN ANALYZE on a production-sized dataset — which AI has no access to during generation.

Category 5: Race Conditions in Background Workers

Two sync jobs fighting over the same data

Our sync worker had no protection against concurrent runs. If a user triggered a manual "sync now" while a scheduled job was already running for the same organization and provider, both jobs would proceed — both would write to the same rows, both would compute metrics from partially-updated state.

The fix: a Redis distributed lock acquired at job start, released on completion, with a 35-minute TTL as a dead-man switch for crashed processes.

const lockKey = `sync:lock:${organizationId}:${provider}`;
const acquired = await redis.set(lockKey, jobId, 'EX', 2_100, 'NX');
if (!acquired) {
  throw new Error(`Sync already in progress — BullMQ will retry`);
}
try {
  await this.route(type, organizationId, integrationId, options);
} finally {
  await redis.del(lockKey);
}

The Count

CategoryIssues FoundCommits to Fix
N+1 query patterns43
Silent exception catches32
Unbounded queries21
Missing DB indexes41
Race conditions11
Total139

What This Means

None of these are "AI wrote wrong code." The logic was correct. The data structures were correct. The test coverage was solid — our test suite has 3,200+ tests and they all passed before and after.

What AI doesn't naturally optimize for is what happens at the boundary: when a loop runs 8 times instead of once. When an exception handler returns silently at 2 AM. When two jobs start within seconds of each other on the same org. When your fastest query becomes your slowest query because your dataset grew by 10×.

This is a calibration problem, not a quality problem. AI assistants write for the present call. Production code needs to be written for the 10,000th call, the worst-case dataset, the concurrent request, the partially-degraded dependency.

The pattern to audit: Search for any loop that contains an await. Search for any findMany without a take limit on a table that grows with usage. Search for any catch block that returns a default value without logging. These three searches will surface the most impactful issues in most AI-assisted codebases.

The Product Angle

We build Koalr to surface patterns in engineering data. Deployment frequency that drops quietly. Error rates that climb slowly. Lead time that stretches in a specific part of the pipeline.

Running this audit made us realize: the patterns we found aren't unique to AI codebases. They're universal. But they're more common in AI-assisted codebases because the generation is faster than the review. Code that would have taken 2 hours to write and 30 minutes to review now takes 5 minutes to generate and — if you're not careful — 30 seconds to merge.

The right response isn't to slow down. It's to make the review layer better. That's what we're building.

Running an AI-assisted codebase?

If you're running an AI-assisted codebase and want to understand your deployment risk profile, Koalr connects to your GitHub in 2 minutes. No agents, no integrations beyond your git provider.

Try Koalr free — no credit card →