Mastering Code Reviews | Article 4: The 3:00 AM Test — Reviewing for Production
I have a simple rule: When I open a Pull Request, I imagine I’m the one getting paged at 3:00 AM because this specific code just crashed production.
Junior devs review for Features (Does it work?). As a seasoned engineer, I review for Operations (Can we support it?). But let’s be honest: you don’t have time to do a deep dive into every single line. Being a Staff Engineer is about Risk Management.
I use these 7 Pillars of Production as a mental filter. I don’t check every pillar for every PR, but the higher the risk, the more pillars I apply. Below, I’ll show you not just the principles behind each pillar but also how you can practically apply them during a code review.
1. Observability: The “Traceability” Test
When the system fails, you only have the artifacts it left behind.
The Check:
- Look at the logging strategy: Are logs meaningful and contextual? If there’s a failure, can you easily trace it back to the user or transaction?
- Ensure that user identifiers (like
user_id,order_id, ortrace_id) are included in key log messages. - Check log levels: Make sure that INFO, WARN, and ERROR are used appropriately.
- Look for existing monitoring: Does the PR integrate with tools like Prometheus, Grafana, or Sentry?
The Question:
“If this fails in a burst of 10,000 transactions, can I find the needle in the haystack without redeploying?”
Practical Steps:
- Search the PR for
logger,log, ortraceto verify that logs are being used in key places. - Ensure that logs include contextual data (e.g.,
user_id,trace_id) for failure scenarios. - Look for structured logging formats (JSON, for example), which are easier to query in production.
2. Resilience: The “Blast Radius” Test
In a distributed Java system, failure is guaranteed.
The Check:
- Assume failure: Check for failure-handling logic. Does the system handle timeouts, retries, or fallbacks properly?
- Look for isolation—does a failure in one feature take down the entire system?
- Examine try-catch blocks: Are exceptions handled and logged, or are they just swallowed?
The Question:
“What is the ‘Plan B’? Does the code fail fast, or does it hang until the whole system stalls?”
Practical Steps:
- If the code interacts with external systems, check for timeout settings and circuit breakers.
- Look for resilience patterns like retries, circuit breakers (e.g., in Spring, @Retryable, @CircuitBreakerannotations), or graceful degradation mechanisms.
- Look for timeouts in API calls, database queries, and external services.
3. Reliability: The “Negative” Test Case
Code coverage is a vanity metric. 100% coverage on the “Happy Path” is useless in a crisis.
The Check:
- Test scenarios should cover edge cases, like:
- High latency (simulated timeouts or delays)
- Invalid inputs (e.g., empty strings, null values)
- Expired or incorrect tokens in authentication code
The Question:
“Do these tests prove the code is safe when the environment is hostile, or only when everything is perfect?”
Practical Steps:
- Check the unit tests: Do they cover the failure paths? Look for tests that simulate timeouts, 500 errors, or malformed data.
- If the code interacts with a service, ensure that tests simulate service failures, like delayed responses or unreachable endpoints.
- If you don’t see tests for failure scenarios, flag it and encourage adding them.
4. Resource Economy: The “Silent Killer” Test
In Java, we are always at the mercy of the Garbage Collector and the Connection Pool.
The Check:
- Look for resource leaks: Check if resources (e.g., files, database connections) are properly closed.
- Look for unbounded collections: Are lists or maps growing indefinitely without clearing?
- Examine N+1 queries in Hibernate or JPA, which can degrade performance.
The Question:
“Are we slowly ‘bleeding’ resources that will force a reboot in three days?”
Practical Steps:
- Review code that opens resources (e.g., file streams, network connections) and ensure it uses try-with-resources.
- Look for large collections (lists, maps) that aren’t capped, or check if there’s a strategy for pagination in queries.
- Check for N+1 queries by reviewing ORM (Hibernate/JPA) fetch strategies (e.g., using JOIN FETCH instead of lazy loading).
- Use tools like YourKit or VisualVM to spot memory leaks and resource usage.
5. Security: The “Least Privilege” Test
In 2026, the most expensive engineering mistakes are security vulnerabilities.
The Check:
- Look for input validation: Are user inputs being sanitized and validated properly to avoid SQL injection, XSS, or other vulnerabilities?
- Check if sensitive data (like passwords or credit card info) is properly hashed or encrypted.
- Ensure that no sensitive data is being logged or exposed in error messages.
The Question:
“Could an attacker exploit this method to compromise our data or cause system failures?”
Practical Steps:
- Ensure user inputs are sanitized with parameterized queries or ORM validation.
- Review password handling: Check that passwords are hashed with a secure algorithm (e.g., bcrypt).
- Look through the code for any accidental logging of sensitive information, like stack traces or SQL queries with user data.
6. Change Safety: The “Rollback” Test
A feature isn’t “done” until it is successfully running—and potentially safely removed.
The Check:
- Backward compatibility: If we revert this change five minutes after deployment, can the old code still read the new data in the database?
- Look for proper versioning of APIs, and schema migrations that can be rolled back safely.
The Question:
“Is this a ‘One-Way Door’? If we roll back, will we corrupt the system?”
Practical Steps:
- If the PR modifies database schemas, ensure that migrations are properly versioned and backward-compatible.
- Check for rollback mechanisms—is there a way to undo a schema change or feature flag if something goes wrong in production?
7. Cognitive Load: The “Maintenance” Test
If the code is too “clever” for the team to fix while you’re on vacation, it’s a failure.
The Check:
- Is the code over-engineered? Does it introduce complex abstractions, unnecessary patterns, or magic reflection that would confuse future developers?
- Simple, clear code is preferred. Are the naming conventions and code structure intuitive?
The Question:
“Can a mid-level dev understand this in five minutes? Is this code a gift or a curse to the next person who touches it?”
Practical Steps:
- Review the code for over-engineering: Is there a simpler way to implement this functionality? If so, suggest it.
- Look for magic or complex patterns that will make it hard for anyone to maintain.
- Ensure the code is well-documented and has clear comments where needed.
Why This Saves Your Life
Reviewing for production isn’t about being “extra.” It’s about self-preservation.
- Better logs mean shorter outages and less time playing detective.
- Better resilience means fewer pages at night because the system healed itself.
- Better safety means you don’t have to stay up late doing stressful, manual data cleanups after a failed deployment.
The Bottom Line
When you review, don’t just ask “Does it work?” Ask:
“If this breaks, how will I know, and how fast can I fix it?”
If you can’t answer those questions in five seconds, the code isn’t ready for production.
In the next article, we’ll talk about the “Hardest” part of being a senior: Picking your Battles. I’ll show you how I decide when to let “ugly” code go and when to stop the line.