By now, everyone has read the book that defines this concept. It has become mandatory for any developer, because someday you'll receive a review saying you didn't follow principle X of Clean Code™ and should refactor that solution that's been sitting in the backlog for weeks.
Unfortunately, discussing the topic is complex: after all, no one wants a repository with "unclean" code, and discussions seem to revolve around abstract concepts of what is "clean," rather than what actually belongs to the Clean Code™ methodology. The book covers dozens of topics: SOLID, formatting, classes, concurrency, emergent design, and more. For this analysis, these topics were grouped into the general principles that represent them. For example, "Small and Focused Functions" encompasses the Single Responsibility Principle (SRP); "Polymorphism" encompasses discussions about inheritance, interfaces, and dependency inversion; "Clean Tests" covers both test writing and testability-oriented design (TDD). The principles used are:
- Meaningful Names
- Small and Focused Functions
- Minimal and Precise Comments
- Error Handling
- Clean Tests
- Polymorphism
These groupings are deliberate simplifications to enable the analysis. SRP, for example, is treated within "Small and Focused Functions" because, in the book's practice, both converge to the same recommendation: each unit of code should do one thing. Similarly, inheritance, interfaces, and dependency inversion are grouped under "Polymorphism" because they share the central mechanism of delegating behavior through abstraction. I recognize that SOLID purists may disagree; the goal here is not to redefine these concepts, but to organize them into analyzable categories.
Given a framework without clear empirical validation, the most direct way to evaluate the trade-off is to measure what it costs in practice. If Clean Code™ charges a performance price, what's the size of that bill? To make this discussion concrete, it's worth reproducing an experiment that puts the numbers on the table.
Two Groups, Two Purposes
We can categorize these principles into two major groups.
The first group is dedicated exclusively to code maintainability: meaningful names, minimal and precise comments, and clean tests. None of these categories run in production; they exist to make the developer's life easier. The second group brings together what actually impacts the running application: small and focused functions, error handling, and polymorphism.
At its core, the idea of Clean Code™ is to place the developer at the center of the software development process, facilitating and increasing team productivity.
Group 1: Maintainability
The first group truly follows this ethos. They are general guides to improve the lives of developers, concepts that don't directly affect the code running the application; they merely help to read, understand, and iterate on a repository. Given this, the only trade-off in adopting this part of the framework is developer adaptability and how much these principles actually increase production velocity within a team.
Group 2: Code Design and Structure
The second group presents choices that determine a development pattern with direct impact on the application. By opting for polymorphism, specific function styles, and error handling, you are conditioning a trade-off: production speed now weighs more than application performance.
This is a totally reasonable choice, and often necessary when producing software. However, like all data-based decisions, you need to see the numbers to understand what the limit would be: how much performance are you willing to lose to increase developer productivity.
And here's the point: if Clean Code™ were a framework with proven results, there would be solid data to support it. But there is no clear definition for what "quality code" is; what you'll find are ideas that developers themselves don't agree on. You can try using definitions like readability and structure, but these are vague concepts that change meaning depending on the team, language, and domain. Worse: the traditional metrics that should capture this quality (cyclomatic complexity, coupling, lines of code) simply don't reflect what developers perceive as improvement. Clean Code™ proposes to solve a problem that the academic community hasn't even been able to define properly.
The Experiment
This trade-off has been previously tested by Casey Muratori in a C++ setup, with code extracted from the book itself, and he found a difference of 10 to 25 times faster by ignoring some of the book's concepts.
Making an adaptation of this process for Python, I used the following setup for the same evaluation.
Clean Code™ vs. Performance: Area Calculation Benchmark
Casey's debate questions whether Clean Code™ patterns, especially polymorphism, carry unnecessary performance costs. This experiment reproduces the central idea in Python, comparing three approaches to calculating the total area of a collection of geometric shapes.
The Approaches
- OOP (Polymorphism): each shape is a class with its own
area()method. The loop callsshape.area()and the runtime resolves which method to execute. - Procedural (If/Elif): all shapes live in a flat struct with type, width, and height. A function uses
if/elifto choose the correct formula. - Lookup Table: all formulas share the same structure,
coefficient × width × height. A coefficient table[1.0, 1.0, 0.5, π]eliminates all branching.
Setup
GCP instance e2-standard-2 (2 vCPUs, 8 GB RAM), Ubuntu 22.04 LTS. Process pinned to one core (taskset -c 0). 10 million randomly generated shapes (uniform distribution among 4 types, dimensions between 0 and 1). Fixed seed (42) for reproducibility. 20 runs per approach, 2 warmups discarded. Measurement with time.perf_counter().
Benchmark Code
import math
import time
import random
import statistics
def bench(fn, runs=20):
fn(); fn()
times = []
for _ in range(runs):
t0 = time.perf_counter()
fn()
times.append(time.perf_counter() - t0)
return times
def print_results(results, baseline_key):
baseline = statistics.mean(results[baseline_key])
print(f"\n{'Approach':<35} {'Mean (s)':>10} {'StdDev':>10} {'Min (s)':>10} {'vs base':>8}")
print("-" * 75)
for name, times in results.items():
avg = statistics.mean(times)
std = statistics.stdev(times) if len(times) > 1 else 0
mn = min(times)
speedup = baseline / avg
print(f"{name:<35} {avg:>10.5f} {std:>10.5f} {mn:>10.5f} {speedup:>7.2f}x")
NUM_SHAPES = 10_000_000
NUM_RUNS = 20
random.seed(42)
types_list, widths_list, heights_list = [], [], []
for _ in range(NUM_SHAPES):
t = random.randint(0, 3)
types_list.append(t)
if t in (0, 3):
v = random.random()
widths_list.append(v); heights_list.append(v)
else:
widths_list.append(random.random()); heights_list.append(random.random())
class ShapeBase:
def area(self): pass
class Square(ShapeBase):
__slots__ = ('side',)
def __init__(self, s): self.side = s
def area(self): return self.side * self.side
class Rectangle(ShapeBase):
__slots__ = ('width', 'height')
def __init__(self, w, h): self.width = w; self.height = h
def area(self): return self.width * self.height
class Triangle(ShapeBase):
__slots__ = ('base', 'height')
def __init__(self, b, h): self.base = b; self.height = h
def area(self): return 0.5 * self.base * self.height
class Circle(ShapeBase):
__slots__ = ('radius',)
def __init__(self, r): self.radius = r
def area(self): return math.pi * self.radius * self.radius
class ShapeUnion:
__slots__ = ('type', 'width', 'height')
def __init__(self, t, w, h): self.type = t; self.width = w; self.height = h
COEFF = [1.0, 1.0, 0.5, math.pi]
clean_shapes, union_shapes = [], []
for i in range(NUM_SHAPES):
t, w, h = types_list[i], widths_list[i], heights_list[i]
if t == 0: clean_shapes.append(Square(w))
elif t == 1: clean_shapes.append(Rectangle(w, h))
elif t == 2: clean_shapes.append(Triangle(w, h))
elif t == 3: clean_shapes.append(Circle(w))
union_shapes.append(ShapeUnion(t, w, h))
def py_oop():
accum = 0.0
for s in clean_shapes: accum += s.area()
return accum
def py_switch():
accum = 0.0
for s in union_shapes:
t = s.type
if t == 0: accum += s.width * s.width
elif t == 1: accum += s.width * s.height
elif t == 2: accum += 0.5 * s.width * s.height
elif t == 3: accum += math.pi * s.width * s.width
return accum
def py_table():
accum = 0.0
c = COEFF
for s in union_shapes: accum += c[s.type] * s.width * s.height
return accum
def main():
print(f"{'='*60}")
print(f" Benchmark: {NUM_SHAPES:,} shapes, {NUM_RUNS} runs each")
print(f"{'='*60}")
results = {}
results["OOP (polymorphism)"] = bench(py_oop, NUM_RUNS)
results["Procedural (if/elif)"] = bench(py_switch, NUM_RUNS)
results["Lookup Table"] = bench(py_table, NUM_RUNS)
print_results(results, "OOP (polymorphism)")
print()
if __name__ == "__main__":
main()
It's important to note that Python is an interpreted language, where interpreter overhead dominates execution time; compilation optimizations like inlining or branch elimination simply don't exist. Even in this scenario, where performance is not the language's primary goal, the lookup table approach was 2x faster than polymorphism. If in an environment that naturally levels everything from below we already see this gain, in compiled languages (where the compiler could optimize, but polymorphism prevents it) the difference tends to be even greater, as demonstrated by Casey Muratori in C++ (10–25x), exploring other points beyond the cost of polymorphic dispatch, such as the impact of excessively small functions.
The AI Factor in the Equation
A central premise of Clean Code is that code should be optimized for human readability. In 2008, this made unrestricted sense: humans were the only readers and writers of code. In 2025, the landscape has changed. Tools like GitHub Copilot, Claude Code, and Cursor act as co-authors and reviewers, and LLMs can navigate complex code with ease comparable to "clean" code. This doesn't invalidate readability as a value, but changes the calculation: if an AI can understand and refactor a 50-line procedural function as well as six 8-line polymorphic functions, part of Clean Code's benefit becomes redundant. The performance trade-off discussed in this article gains weight when the main argument on the other side—human readability—now has a partial replacement.
To illustrate, consider the two area calculation implementations used in this article. The Clean Code™ version distributes the logic across four classes with individual area() methods; the procedural version concentrates everything in a single function with if/elif. If you ask an LLM like Claude to "explain what this code does," both versions produce equally accurate answers. If you ask to "add support for a hexagon," the LLM generates the new polymorphic class or the new procedural branch with equal ease. Human readability was the central argument for justifying polymorphism's indirection; when an AI can navigate that indirection instantly, the remaining benefit on the Clean side of the scale loses weight, and the performance cost on the other side remains intact.
But Not Everything Needs to Be Fast
In practice, most production systems (web APIs, CRUDs, data pipelines) have their bottlenecks in I/O, network, and database, not in method dispatch. This benchmark's scenario (a hot loop iterating over 10 million objects in memory) is real but restricted: signal processing, game engines, numerical simulations. For the rest, polymorphism's cost will rarely be the limiting factor.
The benefit of choosing Clean Code™ would be greater feature delivery capability and improved general repository comprehension. However, this comprehension is not yet a scientific assessment of Clean Code™ usage state. While we have research showing that most professionals recognize Clean Code™ principles as good practices, few consider them essential in their daily work; many report preferring to iterate quickly with "dirty" code and refactor later, when (and if) necessary. When analyzing the impact of refactoring on code quality metrics (modifiability and analyzability), no statistically significant difference is found after applying refactoring techniques aligned with Clean Code™. It's important to note that both studies have limited scope (small samples and specific contexts), but they point in the same direction: the practical effectiveness of Clean Code™ as a productivity framework still lacks empirical validation.
And perhaps the most fundamental problem is prior to the performance debate: no one agrees on what "quality code" really means. When researchers ask experienced developers what defines good code, answers gravitate around "readability" and "structure," terms that each team interprets differently. In practice, I've seen teams spend entire sprints refactoring functional code to "be Clean," without bug reduction, without delivery speed gains, just to satisfy a convention that no one could justify with data. Clean Code™ presents itself as a universal answer, but what is "clean" for a Go backend team is not the same as for a Python data science team. When the framework doesn't recognize this, it doesn't standardize: it rigidifies. Standardization that works emerges from the team's context, language, and domain, not from a book written in 2008 for Java.
This article showed three things. First, that Clean Code™ principles divide into two groups with different natures: maintainability (which don't cost performance) and design (which do). Second, that the performance cost is measurable: even in Python, the lookup table approach was 2x faster than polymorphism, and in compiled languages this difference amplifies. Third, that the promised benefit on the other side of the scale—improved quality and productivity—still lacks empirical validation.
Seeking to standardize a repository, making it reproducible and understandable, is a daily challenge in developers' lives. This quest for clean code cannot be confused with making code Clean, because the latter carries clear trade-offs: it is a widely known concept that makes the application slower and has no scientific basis for improving the attributes it claims to improve. The trade-off of a new option would be unknown, and the unknown isn't always better. But the question remains: clean code doesn't need to be Clean Code™.
References
- MURATORI, Casey. Clean Code, Horrible Performance. Computer Enhance, 2023. Available at: https://www.computerenhance.com/p/clean-code-horrible-performance
- MURATORI, Casey. misc. GitHub, 2023. Available at: https://github.com/cmuratori/misc/tree/main
- LJUNG, K.; GONZALEZ-HUERTA, J. "To Clean Code or Not to Clean Code": A Survey Among Practitioners. In: INTERNATIONAL CONFERENCE ON PRODUCT-FOCUSED SOFTWARE PROCESS IMPROVEMENT (PROFES), 23., 2022, Jyväskylä. Lecture Notes in Computer Science, v. 13709, p. 298–315. Springer, 2022. DOI: 10.1007/978-3-031-21388-5_21
- KANNANGARA, S. H.; WIJAYANAYAKE, W. M. J. I. An Empirical Evaluation of Impact of Refactoring on Internal and External Measures of Code Quality. International Journal of Software Engineering & Applications (IJSEA), v. 6, n. 1, p. 51–67, Jan. 2015. DOI: 10.5121/IJSEA.2015.6105
- KOLLER, H. G. Effects of Clean Code on Understandability: An Experiment and Analysis. 2016. Thesis (Master's in Informatics) — Department of Informatics, University of Oslo, Oslo, 2016.
- RACHOW, P.; SCHRÖDER, S.; RIEBISCH, M. Missing Clean Code Acceptance and Support in Practice — An Empirical Study. In: AUSTRALASIAN SOFTWARE ENGINEERING CONFERENCE (ASWEC), 25., 2018, Adelaide. Proceedings [...]. IEEE, 2018. p. 131–140. DOI: 10.1109/ASWEC.2018.00026
- BÖRSTLER, J. et al. Developers talking about code quality. Empirical Software Engineering, v. 28, n. 6, art. 128, p. 1–31, 2023. DOI: 10.1007/s10664-023-10381-0