Mastering Code Reviews | Article 4: The 3:00 AM Test — Reviewing for Production

I have a simple rule: When I open a Pull Request, I imagine I’m the one getting paged at 3:00 AM because this specific code just crashed production.

Junior devs review for Features (Does it work?). As a seasoned engineer, I review for Operations (Can we support it?). But let’s be honest: you don’t have time to do a deep dive into every single line. Being a Staff Engineer is about Risk Management.

I use these 7 Pillars of Production as a mental filter. I don’t check every pillar for every PR, but the higher the risk, the more pillars I apply. Below, I’ll show you not just the principles behind each pillar but also how you can practically apply them during a code review.

1. Observability: The “Traceability” Test

When the system fails, you only have the artifacts it left behind.

The Check:

  • Look at the logging strategy: Are logs meaningful and contextual? If there’s a failure, can you easily trace it back to the user or transaction?
  • Ensure that user identifiers (like user_idorder_id, or trace_id) are included in key log messages.
  • Check log levels: Make sure that INFOWARN, and ERROR are used appropriately.
  • Look for existing monitoring: Does the PR integrate with tools like PrometheusGrafana, or Sentry?

The Question:

“If this fails in a burst of 10,000 transactions, can I find the needle in the haystack without redeploying?”

Practical Steps:

  • Search the PR for loggerlog, or trace to verify that logs are being used in key places.
  • Ensure that logs include contextual data (e.g., user_idtrace_id) for failure scenarios.
  • Look for structured logging formats (JSON, for example), which are easier to query in production.

2. Resilience: The “Blast Radius” Test

In a distributed Java system, failure is guaranteed.

The Check:

  • Assume failure: Check for failure-handling logic. Does the system handle timeouts, retries, or fallbacks properly?
  • Look for isolation—does a failure in one feature take down the entire system?
  • Examine try-catch blocks: Are exceptions handled and logged, or are they just swallowed?

The Question:

“What is the ‘Plan B’? Does the code fail fast, or does it hang until the whole system stalls?”

Practical Steps:

  • If the code interacts with external systems, check for timeout settings and circuit breakers.
  • Look for resilience patterns like retries, circuit breakers (e.g., in Spring, @Retryable@CircuitBreakerannotations), or graceful degradation mechanisms.
  • Look for timeouts in API calls, database queries, and external services.

3. Reliability: The “Negative” Test Case

Code coverage is a vanity metric. 100% coverage on the “Happy Path” is useless in a crisis.

The Check:

  • Test scenarios should cover edge cases, like:
    • High latency (simulated timeouts or delays)
    • Invalid inputs (e.g., empty strings, null values)
    • Expired or incorrect tokens in authentication code

The Question:

“Do these tests prove the code is safe when the environment is hostile, or only when everything is perfect?”

Practical Steps:

  • Check the unit tests: Do they cover the failure paths? Look for tests that simulate timeouts500 errors, or malformed data.
  • If the code interacts with a service, ensure that tests simulate service failures, like delayed responses or unreachable endpoints.
  • If you don’t see tests for failure scenarios, flag it and encourage adding them.

4. Resource Economy: The “Silent Killer” Test

In Java, we are always at the mercy of the Garbage Collector and the Connection Pool.

The Check:

  • Look for resource leaks: Check if resources (e.g., files, database connections) are properly closed.
  • Look for unbounded collections: Are lists or maps growing indefinitely without clearing?
  • Examine N+1 queries in Hibernate or JPA, which can degrade performance.

The Question:

“Are we slowly ‘bleeding’ resources that will force a reboot in three days?”

Practical Steps:

  • Review code that opens resources (e.g., file streams, network connections) and ensure it uses try-with-resources.
  • Look for large collections (lists, maps) that aren’t capped, or check if there’s a strategy for pagination in queries.
  • Check for N+1 queries by reviewing ORM (Hibernate/JPA) fetch strategies (e.g., using JOIN FETCH instead of lazy loading).
  • Use tools like YourKit or VisualVM to spot memory leaks and resource usage.

5. Security: The “Least Privilege” Test

In 2026, the most expensive engineering mistakes are security vulnerabilities.

The Check:

  • Look for input validation: Are user inputs being sanitized and validated properly to avoid SQL injectionXSS, or other vulnerabilities?
  • Check if sensitive data (like passwords or credit card info) is properly hashed or encrypted.
  • Ensure that no sensitive data is being logged or exposed in error messages.

The Question:

“Could an attacker exploit this method to compromise our data or cause system failures?”

Practical Steps:

  • Ensure user inputs are sanitized with parameterized queries or ORM validation.
  • Review password handling: Check that passwords are hashed with a secure algorithm (e.g., bcrypt).
  • Look through the code for any accidental logging of sensitive information, like stack traces or SQL queries with user data.

6. Change Safety: The “Rollback” Test

A feature isn’t “done” until it is successfully running—and potentially safely removed.

The Check:

  • Backward compatibility: If we revert this change five minutes after deployment, can the old code still read the new data in the database?
  • Look for proper versioning of APIs, and schema migrations that can be rolled back safely.

The Question:

“Is this a ‘One-Way Door’? If we roll back, will we corrupt the system?”

Practical Steps:

  • If the PR modifies database schemas, ensure that migrations are properly versioned and backward-compatible.
  • Check for rollback mechanisms—is there a way to undo a schema change or feature flag if something goes wrong in production?

7. Cognitive Load: The “Maintenance” Test

If the code is too “clever” for the team to fix while you’re on vacation, it’s a failure.

The Check:

  • Is the code over-engineered? Does it introduce complex abstractions, unnecessary patterns, or magic reflection that would confuse future developers?
  • Simple, clear code is preferred. Are the naming conventions and code structure intuitive?

The Question:

“Can a mid-level dev understand this in five minutes? Is this code a gift or a curse to the next person who touches it?”

Practical Steps:

  • Review the code for over-engineering: Is there a simpler way to implement this functionality? If so, suggest it.
  • Look for magic or complex patterns that will make it hard for anyone to maintain.
  • Ensure the code is well-documented and has clear comments where needed.

Why This Saves Your Life

Reviewing for production isn’t about being “extra.” It’s about self-preservation.

  • Better logs mean shorter outages and less time playing detective.
  • Better resilience means fewer pages at night because the system healed itself.
  • Better safety means you don’t have to stay up late doing stressful, manual data cleanups after a failed deployment.

The Bottom Line

When you review, don’t just ask “Does it work?” Ask:
“If this breaks, how will I know, and how fast can I fix it?”

If you can’t answer those questions in five seconds, the code isn’t ready for production.

In the next article, we’ll talk about the “Hardest” part of being a senior: Picking your Battles. I’ll show you how I decide when to let “ugly” code go and when to stop the line.

1

Leave a Reply

Your email address will not be published. Required fields are marked *