Mastering Code Reviews | Article 4: The 3:00 AM Test — Reviewing for Production

I have a simple rule: When I open a Pull Request, I imagine I’m the one getting paged at 3:00 AM because this specific code just crashed production.

Junior devs review for Features (Does it work?). As a seasoned engineer, I review for Operations (Can we support it?). But let’s be honest: you don’t have time to do a deep dive into every single line. Being a Staff Engineer is about Risk Management.

I use these 7 Pillars of Production as a mental filter. I don’t check every pillar for every PR, but the higher the risk, the more pillars I apply. Below, I’ll show you not just the principles behind each pillar but also how you can practically apply them during a code review.

1. Observability: The “Traceability” Test

When the system fails, you only have the artifacts it left behind.

The Check:

Look at the logging strategy: Are logs meaningful and contextual? If there’s a failure, can you easily trace it back to the user or transaction?
Ensure that user identifiers (like user_id, order_id, or trace_id) are included in key log messages.
Check log levels: Make sure that INFO, WARN, and ERROR are used appropriately.
Look for existing monitoring: Does the PR integrate with tools like Prometheus, Grafana, or Sentry?

The Question:

“If this fails in a burst of 10,000 transactions, can I find the needle in the haystack without redeploying?”

Practical Steps:

Search the PR for logger, log, or trace to verify that logs are being used in key places.
Ensure that logs include contextual data (e.g., user_id, trace_id) for failure scenarios.
Look for structured logging formats (JSON, for example), which are easier to query in production.

2. Resilience: The “Blast Radius” Test

In a distributed Java system, failure is guaranteed.

The Check:

Assume failure: Check for failure-handling logic. Does the system handle timeouts, retries, or fallbacks properly?
Look for isolation—does a failure in one feature take down the entire system?
Examine try-catch blocks: Are exceptions handled and logged, or are they just swallowed?

The Question:

“What is the ‘Plan B’? Does the code fail fast, or does it hang until the whole system stalls?”

Practical Steps:

If the code interacts with external systems, check for timeout settings and circuit breakers.
Look for resilience patterns like retries, circuit breakers (e.g., in Spring, @Retryable, @CircuitBreakerannotations), or graceful degradation mechanisms.
Look for timeouts in API calls, database queries, and external services.

3. Reliability: The “Negative” Test Case

Code coverage is a vanity metric. 100% coverage on the “Happy Path” is useless in a crisis.

The Check:

Test scenarios should cover edge cases, like:
- High latency (simulated timeouts or delays)
- Invalid inputs (e.g., empty strings, null values)
- Expired or incorrect tokens in authentication code

The Question:

“Do these tests prove the code is safe when the environment is hostile, or only when everything is perfect?”

Practical Steps:

Check the unit tests: Do they cover the failure paths? Look for tests that simulate timeouts, 500 errors, or malformed data.
If the code interacts with a service, ensure that tests simulate service failures, like delayed responses or unreachable endpoints.
If you don’t see tests for failure scenarios, flag it and encourage adding them.

4. Resource Economy: The “Silent Killer” Test

In Java, we are always at the mercy of the Garbage Collector and the Connection Pool.

The Check:

Look for resource leaks: Check if resources (e.g., files, database connections) are properly closed.
Look for unbounded collections: Are lists or maps growing indefinitely without clearing?
Examine N+1 queries in Hibernate or JPA, which can degrade performance.

The Question:

“Are we slowly ‘bleeding’ resources that will force a reboot in three days?”

Practical Steps:

Review code that opens resources (e.g., file streams, network connections) and ensure it uses try-with-resources.
Look for large collections (lists, maps) that aren’t capped, or check if there’s a strategy for pagination in queries.
Check for N+1 queries by reviewing ORM (Hibernate/JPA) fetch strategies (e.g., using JOIN FETCH instead of lazy loading).
Use tools like YourKit or VisualVM to spot memory leaks and resource usage.

5. Security: The “Least Privilege” Test

In 2026, the most expensive engineering mistakes are security vulnerabilities.

The Check:

Look for input validation: Are user inputs being sanitized and validated properly to avoid SQL injection, XSS, or other vulnerabilities?
Check if sensitive data (like passwords or credit card info) is properly hashed or encrypted.
Ensure that no sensitive data is being logged or exposed in error messages.

The Question:

“Could an attacker exploit this method to compromise our data or cause system failures?”

Practical Steps:

Ensure user inputs are sanitized with parameterized queries or ORM validation.
Review password handling: Check that passwords are hashed with a secure algorithm (e.g., bcrypt).
Look through the code for any accidental logging of sensitive information, like stack traces or SQL queries with user data.

6. Change Safety: The “Rollback” Test

A feature isn’t “done” until it is successfully running—and potentially safely removed.

The Check:

Backward compatibility: If we revert this change five minutes after deployment, can the old code still read the new data in the database?
Look for proper versioning of APIs, and schema migrations that can be rolled back safely.

The Question:

“Is this a ‘One-Way Door’? If we roll back, will we corrupt the system?”

Practical Steps:

If the PR modifies database schemas, ensure that migrations are properly versioned and backward-compatible.
Check for rollback mechanisms—is there a way to undo a schema change or feature flag if something goes wrong in production?

7. Cognitive Load: The “Maintenance” Test

If the code is too “clever” for the team to fix while you’re on vacation, it’s a failure.

The Check:

Is the code over-engineered? Does it introduce complex abstractions, unnecessary patterns, or magic reflection that would confuse future developers?
Simple, clear code is preferred. Are the naming conventions and code structure intuitive?

The Question:

“Can a mid-level dev understand this in five minutes? Is this code a gift or a curse to the next person who touches it?”

Practical Steps:

Review the code for over-engineering: Is there a simpler way to implement this functionality? If so, suggest it.
Look for magic or complex patterns that will make it hard for anyone to maintain.
Ensure the code is well-documented and has clear comments where needed.

Why This Saves Your Life

Reviewing for production isn’t about being “extra.” It’s about self-preservation.

Better logs mean shorter outages and less time playing detective.
Better resilience means fewer pages at night because the system healed itself.
Better safety means you don’t have to stay up late doing stressful, manual data cleanups after a failed deployment.

The Bottom Line

When you review, don’t just ask “Does it work?” Ask:
“If this breaks, how will I know, and how fast can I fix it?”

If you can’t answer those questions in five seconds, the code isn’t ready for production.

In the next article, we’ll talk about the “Hardest” part of being a senior: Picking your Battles. I’ll show you how I decide when to let “ugly” code go and when to stop the line.

1. Observability: The “Traceability” Test

The Check:

The Question:

Practical Steps:

2. Resilience: The “Blast Radius” Test

The Check:

The Question:

Practical Steps:

3. Reliability: The “Negative” Test Case

The Check:

The Question:

Practical Steps:

4. Resource Economy: The “Silent Killer” Test

The Check:

The Question:

Practical Steps:

5. Security: The “Least Privilege” Test

The Check:

The Question:

Practical Steps:

6. Change Safety: The “Rollback” Test

The Check:

The Question:

Practical Steps:

7. Cognitive Load: The “Maintenance” Test

The Check:

The Question:

Practical Steps:

Why This Saves Your Life

The Bottom Line

Leave a ReplyCancel Reply