What reviewing real codebases taught me about code review

I built an open-source code review runbook with 900+ checks, tiered by project complexity. It looked solid on paper. Then I used it on an actual advisory engagement — reviewing a security startup’s full-stack product, Python/FastAPI backend and Next.js/TypeScript frontend — and the real education began.

The checklist did its job. But most of what I learned was about the process of conducting code reviews, not about the checklist itself. Some of these lessons seem obvious in retrospect. None of them were obvious in the moment.

We reviewed the wrong branch

Our first mistake was assuming main was the right target. We ran the full review — all 15 categories, ~150 checks (the project scored Tier 2, Standard) — produced detailed reports, generated PDFs, and shared them with the CEO and head of engineering.

Then the engineering lead told us they wanted the develop branch reviewed. Their develop branch was hundreds of commits ahead of main on both repos. The team used a git-flow model where main only gets updated on releases. We’d reviewed a stale snapshot.

The fix is simple: ask which branch represents the current working state before starting. Don’t assume main is current. Check the commit dates. If the most recent commit on main is weeks old and there’s an active branch with hundreds of commits ahead, you’re probably looking at the wrong target.

This is now a pre-flight step in the runbook. It should have been from the start.

The mistake turned into the most valuable part

We initially treated the wrong-branch review as a setback. We’d wasted time and had to re-run the review against develop. But when we actually did the second review, having the first review as a baseline transformed the output.

Instead of just listing findings, we could track each one: fixed, partially fixed, still present, worse, or new.

The backend team had fixed nearly half the previous findings and added serious security infrastructure — CSRF middleware, structured audit logging, token management overhaul. The frontend team had fixed none of theirs fully, partially addressed a few, and their largest component file had actually grown since the first review.

A standalone review says “here are your problems.” A delta review says “here’s your trajectory.” The second framing is far more useful for engineering leadership because it shows direction, not just current state. “Still present” means the team hasn’t gotten to it. “Worse” means the codebase is actively moving the wrong direction. Leadership needs that distinction because the response is different — deferred work can wait; accelerating problems can’t.

If I could only give one piece of advice about code reviews, it would be this: structure them as comparisons against a prior baseline whenever possible. If no prior review exists, the first one becomes your baseline. Date-stamp your reports in folders and you’ll thank yourself when the next review happens.

What the systematic approach catches

The most obvious value of a checklist is that it’s systematic. You check things you wouldn’t think to check manually. Some of what the runbook caught on this engagement:

random.choices() for MFA codes. The backend used Python’s random module — which isn’t cryptographically secure — to generate multi-factor authentication codes. The fix is secrets.choice(). A human reviewer focused on architecture or logic flow would likely not catch this. The runbook’s security section explicitly checks for it.

AI coding tool configuration files committed to git. Both repos had .cursorrules files tracked in version control. These files, created by Cursor to help the AI understand the codebase, contained a complete infrastructure inventory: server IPs across three environments, EC2 instance IDs, security group IDs, CloudFront distribution IDs, SSH key paths, deployment directory structures. One file included a developer’s full local filesystem path.

This is a new category of security finding that didn’t exist two years ago. Developers create these context files to help their AI tools work better. Rich context means better AI output. But rich context also means sensitive context, and if it’s committed to git, it’s an attacker’s complete reconnaissance dossier. Traditional security checklists don’t cover this. Ours does now.

Tests that verified imports, not behavior. Both repos had test suites that looked comprehensive at first glance — dozens of test files, structured directories. But reading the actual test code revealed that many frontend tests only verified modules were importable. API route tests mocked fetch globally, effectively testing the mock infrastructure rather than the route handler. Backend tests depended on pre-seeded database data and silently skipped if the data was missing.

“Tests exist” and “tests pass” are insufficient checks. You have to read a few tests and ask: would this catch a real regression? The runbook now includes a specific check: read 3-5 test files and verify they test actual behavior, not just infrastructure.

Cross-repo patterns tell a bigger story

When you review multiple repos from the same team, patterns emerge that individual reviews miss.

On this engagement, several findings appeared in both the backend and frontend repos: hardcoded JWT secret fallbacks, infrastructure IP addresses in source code, default database credentials in docker-compose.production.yaml, missing LICENSE files, deploy workflows that skip CI checks.

When the same issue appears across repos, it’s not a one-off mistake by one developer. It’s a team practice or a missing process. The remediation should target the process, not just the individual code fixes. We framed these as process recommendations in the combined executive report, which landed differently than isolated repo-level findings.

We also noticed something subtler. The backend team fixed several critical findings between reviews and added substantial security infrastructure. The frontend team’s critical issues remained, and some got worse. That asymmetry wasn’t a criticism — the team may have deliberately prioritized backend security. But making the asymmetry visible gave leadership data to make a deliberate resource allocation decision rather than drifting.

Security debt plays whack-a-mole

Here’s something that surprised me. The backend team fixed several critical and high findings from the first review. They did real work — CSRF middleware, audit logging, token management overhaul. But they also introduced new critical findings: API keys for an observability service committed in plaintext in workflow comments, and the .cursorrules file with the full infrastructure inventory.

The net number of critical findings barely changed.

The root cause: the team addressed known patterns (JWT secrets, MFA codes) but had no systematic process for preventing new secrets from entering the codebase. Every new feature is an opportunity for a developer to paste a credential into a config file, a comment, or a CI workflow.

Fixing existing secrets is necessary but insufficient. What matters is whether a prevention mechanism exists — pre-commit secret scanning like gitleaks or truffleHog, or GitHub’s built-in secret scanning. The runbook now checks for prevention, not just absence.

And a related finding I hadn’t anticipated: commented-out code containing real secrets. A deploy workflow had a commented-out section documenting a planned migration. The comments included actual production API keys. Developers treat comments as “not real code” and relax their caution about what goes there. But git tracks comments the same as executable code. The runbook now explicitly notes that comments, TODOs, and commented-out blocks are in scope for credential scanning.

How you deliver findings matters

This sounds like a soft skill, but it affected how the review was received.

We deliberately structured the feedback: strengths first, then critical issues, then high-priority, then structural improvements. This wasn’t diplomatic padding. By demonstrating that we understood and appreciated the team’s good work — the multi-tenant architecture, the assessment pipeline design, the CI/CD discipline — the critical findings were received as constructive guidance rather than an attack.

A review that only lists problems gets dismissed defensively. A review that first shows deep understanding of what the team built well, and then identifies specific gaps, gets treated as expert input. I’ve seen this pattern in every advisory engagement where I’ve had the option to choose the structure. It works.

Format matters too. We initially produced markdown files. The CEO immediately asked for “nicely formatted PDFs I can share via Slack.” Markdown reads as internal engineering artifact. A well-formatted PDF communicates professional advisory output. When the audience includes non-engineers, invest in the presentation. Same content, different reception.

What this changed about the runbook

These lessons aren’t theoretical. Each one fed back into the runbook as a specific improvement:

Branch selection pre-flight check
Delta review mode with five-point status tracking (fixed, partially fixed, still present, worse, new)
AI coding tool configuration file check
Test quality verification (read actual tests, not just check they exist)
Comment-aware credential scanning
Strengths-first output format for all tiers
Pre-commit secret scanning prevention check
Longitudinal folder structure for deliverables

The runbook is better because it was used in anger on a real codebase. I expect it will keep improving with every engagement. That’s the point of making it open source — every team that uses it and reports what’s missing makes it better for the next team.

This article is part of the code review series. See also: Code review that scales (introducing the runbook) and When a code review runbook becomes an upgrade playbook (using the runbook to fix, not just find).

Rajiv Pant is President of Flatiron Software and Snapshot AI, where he leads organizational growth and AI innovation. He is former Chief Product & Technology Officer at The Wall Street Journal, The New York Times, and Hearst Magazines. Earlier in his career, he headed technology for Conde Nast’s brands including Reddit. Rajiv coined the terms “synthesis engineering” and “synthesis coding” to describe the systematic integration of human expertise with AI capabilities in professional software development. Connect with him on LinkedIn or read more at rajiv.com.