Comparing the Effects of Programming Practices on Code Quality

Written by Annika Diener

(This work is also available online and interactively under https://nikadev.net/r/code-quality)

Index

Introduction

Goals and Criteria

The goal of this text is to investigate the effects that certain programming concepts have on code bases. "Paradigms" are discussed indirectly, since the concepts are the building blocks that make up the paradigms and are not mutually exclusive between paradigms.

As the basic metric for evaluation counting is used, if not stated otherwise. Simply counting work has already been shown to be an accurate enough first indicator, in the context of work planning .

The criteria by which each technique gets measured are:

Performance
Editability
Coupling
Cohesion
Complexity
Mental Load
Testability

This work tries to evaluate each concept holistically since many common programming advices focus only on one area. Often to the detrement of others. For example "clean code" completely neglets performance .

Performance

To enable coparrison of the different techniques, each technique got implemented in C once and compiled with the same compiler (gcc) . and the same flags (-O3 & -O0) on the same machine using unity builds . This enables a basic comparrison without any technique specific compiling optimisations.

hyperfine was used to take the measurements using the following configuration: -w 5 -r 100 -u millisecond -N. A full table of taken measurements per implemented language is provided, for all languages that have been used.

In addition to the time measured on the test system, the amount of assembly instructions was recorded for each language, that offers this feature on Compiler Explorer .

Editability

Editability is defined by the number of edits to a given project that have to be made to perform a change. Edits are changes, addition or removal of tokens, as defined by the lexer of the languages compiler. A token is considered changed, if it still belongs to the same class (integer stays integer, variable stays variable, ...). Otherwise it is considered as one removal of the old toke and one addition of the new token. The higher the number of edits, the worse the editability. The best editability is "1".

Measurements will be given including and excluding newlines. Newlines will be treated as "1" edit when included. The tables will include the values in the format<without newlines>/<with newlines>, since some might consider them cosmetic.

// Old
x = a + 1;
// New
x = a + 2;

The code example above has a editability of 1/1 since a single token got changed.

// Old
x = a + 1;
// New
x = 2 * (a + 1);

The code example above has a editability of 6/6 since a 2 tokens got changed, and 4 tokens got added.

// Old
x = vec.x * (vec.y + vec.z);
// New
x = sprt(
    sum_vec(
        vec
    )
);

The final code example above has a editability of 6/10 since a no token got changed, 4 tokens got added and 10 tokens got removed. There were 4 newlines introduced, which puts the second editability score to 10

Coupling

If a change in one part of the code requires the change in a different part of the code, the two pieces of code are coupled. This contrasts with editability in so far, that editability counts the total amount of changes needed, while coupling is merely concerned with the fact that a change needed to happen in the first place.

The coupling relations form a graph. To better understand not only the coupling relations them selves, but also the strength of the coupling, the coupling score $κ$ for a particular change $C$ , will be the sum of the distance $d$ and the sum of all related coupling scores $κ$ for all related changes $Κ$ .

κ (C) = \sum_{x \in Κ_{C}}^{} d (x) + (\sum_{e \in Κ_{x}}^{} κ (e))

The following are some example dependency graphs and their resulting coupling. Since coupling will be measured on the levels of functions, Files/Class' and Modules, feel free to interpret each box as one of those. The X marks the Element that changed. The other elements that had to change bacause of the change in X will be annotated with their respective distance. Where there was no change, the element will receive a 0. In cases where the coupling score is not a sum away, the series of graphs depicts the summations until the final score can be calculated.

Example for a coupling of 4 and its reductions

Example for a coupling of 32 and its reductions

Cohesion

Cohesion describes the proximity of changes, if changes take place. This is measured in this text by average and maximum of the distance in lines. A file/class change counts as 100 lines.

I'll be following the common advice to have each class (if needed) in it's own file. Hence the distinction between classes and files is not needed when analyzing coupling.

Metrics as presented in "Class Cohesion Metrics for Software Engineering" are not suited for this analysis since many examples can be accomplished in one file. This results in identical cohesion metrics between the contrasting implementations in many cases.
Additionally some of the mentioned metrics are only applicable in so far that they warn the programmer from low cohesion, but can't asses the cohesion meaningfully after a certain threshold has been reached. Many metrics also seem to measure coupling instead of cohesion.

Let's consider the following example in which a constant gets renamed that is used across multiple files.

// inside definitions.h
const Vec ZERO_VEC = {0};

// inside file1.c
Vec a = ZERO_VEC;
a.x += 10;
// ...

// inside file2.c
Vec b = ZERO_VEC;
b.y += 20;
// ...

In case if the above example would rename ZERO_VEC to VEC_ZERO, the resulting change would span 2 files, not counting the definition. The average distance is hence (100+100)/2 = 100. The maximum distance is 100. Meaning, the resulting cohesion score is 100/100.

// inside file1.c
const Vec ZERO_VEC = {0};
// ... 20 other lines ...
Vec a = ZERO_VEC;
a.x += 10;
// ... 20 more lines ...
Vec b = ZERO_VEC;
b.y += 20;
// ...

In this example, the change is contained to one file. The change happens on relative line 21 and 42. The average distance is hence (21+42)/2 = 31.5. The maximum distance is 42. Resulting in a cohesion score of 31.5/42.

// inside definitions.h
const Vec ZERO_VEC = {0};

// inside file1.c
Vec a = ZERO_VEC;
a.x += 10;
// ...

// inside file2.c
Vec b = ZERO_VEC;
b.y += 20;
// ... 20 more lines ...
Vec c = ZERO_VEC;
b.z += 30;
// ...

This last example combines the changes in one file and the distance across files. The average distance is hence (100+100+121)/3 = 107. The maximum distance is 121. Resulting in a cohesion score of 107/121.

Complexity

Complexity is not formulated in this text using metrics like Cyclomatic Complexity . Instead complexity is measured with the goal of measuring the degree to with elements of the program are intertwined (complected) .

Each operation (function call, arithmetic operation, comparison, ...) can be represented as a function that takes $| I |$ inputs, where I is the set of inputs, and returns $| O |$ outputs, where O is the set of outputs. Each element x has its own complexity number $c (x)$ determined by the inputs used in its construction and the amount of places it is used in.

c (x) = (\sum_{i \in I_{x}}^{} c (i)) \cdot (\sum_{o \in O_{x}}^{} o)

Consider the following example:

int x = a + b * c;
int y = x + 1;
int z = x - 1;

c (x) = (\sum_{i \in I_{x}}^{} c (i)) \cdot (\sum_{o \in O_{x}}^{} o)

\Leftrightarrow (\sum_{i \in I_{x}}^{} c (i)) \cdot (2)

\Leftrightarrow (c (a) + c (b) + c (c)) \cdot (2)

\Leftrightarrow (1 + 1 + 1) \cdot (2)

\Leftrightarrow 3 \cdot 2 = 6

This method of measuring complexity also works for functional concepts like partial function application. It also accounts for value complecting through controll structures like for loops:

int sum = 0
for (int i = 0; i < 3; i++) {
    sum += i;
}

For brevities sake, the outputs will be ignored in the formulas, since they are "1" thoughout this example.

c (sum) = \sum_{i \in I_{sum}}^{} c (i)

\Leftrightarrow c (sum) + 1

\Leftrightarrow (\sum_{i \in I_{sum}}^{} c (i)) + 1

\Leftrightarrow (c (sum) + 1) + 1

\Leftrightarrow ((\sum_{i \in I_{sum}}^{} c (i)) + 1) + 1

\Leftrightarrow ((c (sum) + 1) + 1) + 1

\Leftrightarrow ((1 + 1) + 1) + 1 = 4

As apparent in this example, the complexity for code that works on arbitrary lengths of input has to be measured in relation to the length of the input and can not be resolved to a single number. This also applies to concepts like recursion.

Mental Load

Mental Load is defined by the amount of context that a programmer needs to remember to actively work with a piece of code.

The load is formed by the sum of all items ( $X$ ) multiplied by the changes ( $N_{x}$ ) this item goes through. An item is a named element which is not a literal itself (for example: function, variable, constant, ...).

(\sum_{x \in X}^{} x \cdot (\sum_{n \in N_{x}}^{} n))

As an example, consider the mental load for the following function:

int my_fun(int a, int b, int c) {
    a = a ** 2 + b ** 2 + c ** 2;
    int ans = a / 2;
    ans = a / (c == 0 ? 1 : c);
    return ans;
}

The mental load would be:

(\sum_{x \in my_fun}^{} x \cdot (\sum_{n \in N_{x}}^{} n))

\Leftrightarrow (\sum_{n \in N_{a}}^{} n) + (\sum_{n \in N_{b}}^{} n) + (\sum_{n \in N_{c}}^{} n) + (\sum_{n \in N_{ans}}^{} n)

\Leftrightarrow 2 + 1 + 1 + 2 = 6

Testability

The Testability is measured by the amount of work that has to be done to test the functionality of a piece of code. This involves counting the steps in each test.

Each line only contains one action. So even if my_function(1, a++); could be written in one line, it will be separated into two my_function(1, a); a += 1;. This enables comparisons for languages like C, where initialization shorthands are only available with separate initialization.

The lower the score the better. The best possible score is 3 in manual memory managed languages. One line setup. One line test. And one line teardown. The best possible score is 2 in automatic memory managed languages. One line setup. One line test.

char is_zero(const char* x) { return 0 == *x; }
charDa da_res = new_charDa(0);
charDa da = da_res.result;
test_char(all_charDa(da, is_zero), 1, "all succeeds for the empty case");
free_Da(&da);

The following test in C test the function all_charDa has setup in line 1, 2 and 3 and one assertion (test_char) on line 4, followed by one line of teardown. To test this function, there are 3 lines/actions of setup and 1 line of teardown required. The testability of this function is hence 4.

Abilities of Large Language Models

Since advances in the field of Artificial Intelligence (AI) have created Large Language Models (LLMs) with high fidelity. The applicability of those models to perform analysis along the outlined criteria is of interest. Also the ability to suggest improvements to code to improve the metrics above.

For evaluation, models with a provided free tier were used on that level. The models used are [TODO: model list].

I hope that this analysis can further ilumenate the finding by [TODO: insert github code quality LLM studies].

Side-Effects & Pure Code

A function is considered pure, if the function doesn't affect anything on the outside and doesn't relies upon information obtained from the outside. This means a pure functions scope is only defined by the function signature.

A side-effect is an action performed by a function, that is visible from outside the functions own scope. The most common side-effect are IO operations.

While side-effects and pure functions are not opposites of each other, they are mutually exclusive, since any function with a side-effect can't be pure and any pure function can't have any side effects. Consider the following two implementations of an addition function for illustration of both concepts.

// pure function
int add(int x, int y) {
    return x + y;
}

// impure function without side-effect
const int x = 4;
const int y = 2;
int add() {
    return x + y;
}

// function with side-effect
int x = 4;
int y = 2;
int add() {
    x += y;
}

The performance between all 3 examples above is the same. This can be easily concluded without looking at assembly: To perform the calculation, both values as well as the operation have to be loaded into the CPU registers. The best case in all 3 cases is that all values are already in cache. The worst case is that all values have to be fetched. The order of operation does not impact performance in this trivial case. Most compilers in most languages will inline this function for this exact reason.

The C Problem

This analysis is only partially possible in C and requires some assumption to be generellazied. Arbitrary memory allocation and de-allocation is hard to get right in C.

Hence it became standard for C code to pass the memory used for operations as an argument into the functions. This is a side-effect by design.

As an alternative, "handlers" to pieces of memory can be passed around as values. While memory is still managed by a function scope further up in the call stack, the handlers get treated as if they were actually a full representation of memory. This handler based approach enables us to write "pure" functions, while maintaining the advantages of centralized memory management, at the price of (often) trippling the stack memory footprint of the original pointer.

Code

The displayed code examples in the normal document flow are shortened to preserve readability of the sections. The full code examples used are liked to or can be found at this projects gitlab .

While the "pure" example looks reasonable for many, the side effect version doesn't. While I want to examine both extremes to illustrate my point, I'll add a second version that is more contracted, since this will closer resemble the coding style of some programmers.

Analysis

Performance

The performance of all 3 versions is similar enough, that no significant difference could be measured. The increased assembly cound for the pure version is C, is explained by the function call overhead. Inlining the function calls reduced the assembly to 199 lines (-31 lines).

Editability

Considered were the following scenarios:

Additionally calculation the "minimum" and "maximum"
Parralelizing the code
Switching from Colors to integers

Coupling

Cohesion

Complexity

Mental Load

Testability

Values & References

Global Constants

With the examples it becomes apparent that there is an adverserial relationship between cohesion and coupling vs editability. It is common wisdom to factor repeaded usages if constants literals into constant variables. While this incereases editability for the case that the value changes, the increased coupling and lessened cohesion will result in larger amounts of work if the constant itself changes due to a rename, or removal.

LLMs and Quality

Conclusion

Discussion

The metric of counting might give a first indication of the usefulness of the discussed techniques, but could be improved in future work. One approach to finding a better metric could be how each aspect relates to time. This would require the analysis of many commits and time spend per commit on a large enough code base.

Categories like "Writeability" were excluded in this analysis, since they are highly dependant on tooling, and preparation on the programmers part. [todo: cite?] A programmer with extensive experience in a given language, will have set up shortcuts and snippets, drastically improving the speed in which he can write code. This means that measurements like "amount of characters to type" become meaningless, since whole paragraphs will be filled in at once. Future work might analyze the best possible cases with optimal snippets and key binding, to come to a conclusion on this metric. This was outside the scope of this work.

Likewise, the metric of "Agreeablenes" of code is outside the scope of this analysis.

One flaw of the analysis and comparison between programming languages and techniques is the reliance on libariries and external modules. Due to time constraints, I decided based on personal experience when usage of libraries was appropriate and when not.
As further improvement on this work, additional implementations under similar constraints would be recommended.

Bibliography

Robert W. Floyd, "The Paradigms of Programming" 1978 ACM Turing Award Lecture. Available: https://dl.acm.org/doi/pdf/10.1145/1283920.1283934 [Accessed May 25, 2025].
Vasco Duarte, "NO ESTIMATES". Oikosofy Series, 2015 [E-book]. Available: https://oikosofyseries.com/
Kitware, Inc. and Contributors, "CMake UNITY_BUILD" cmake.org, 2025. [Online]. Available: https://cmake.org/cmake/help/latest/prop_tgt/UNITY_BUILD.html . [Accessed May 25, 2025].
Matt Godbolt, "Compiler Explorer" godbolt.org, 2025. [Online]. Available: https://godbolt.org/ . [Accessed May 25, 2025].
"GCC, the GNU Compiler Collection 13.3.0", [Software]. Free Software Foundation, Inc. 2025. Available: https://gcc.gnu.org/ .
"hyperfine 1.19.0", [Software]. David Peter. 2025. Available: https://github.com/sharkdp/hyperfine .
Rich Hickey. Presentation, Title: "Simple Made Easy" Strange Loop, 2011. Available: https://youtu.be/SxdOUGdseq4?si=9Fdck6Y7jIblNJ_d [Accessed June 13, 2025].
Habib Izadkhah, Maryam Hooshyar, "Class Cohesion Metrics for Software Engineering: A Critical Review" 2017 Computer Science Journal of Moldova, vol.25. Available: https://ibn.idsi.md/sites/default/files/imag_file/44_74_Class%20Cohesion%20Metrics%20for%20Software%20Engineering_A%20Critical%20Review.pdf [Accessed June 5, 2025].
Casey Muratori. Presentation, Title: "Where Does Bad Code Come From?" Molly Rocket, November 2 2021. Available: https://youtu.be/7YpFGkG-u1w?si=JvlKWZUOxfdjo1t3 [Accessed June 16, 2025].
Casey Muratori. Presentation, Title: "'Clean' Code, Horrible Performance" Molly Rocket, February 28 2023. Available: https://youtu.be/tD5NrevFtbU?si=ky3lhyKeTbMxmP-z Transcript available: https://www.computerenhance.com/p/clean-code-horrible-performance [Accessed June 16, 2025].
Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship. Pearson, August 2008. Available: https://www.oreilly.com/library/view/clean-code-a/9780136083238/
T. J. McCabe, "A Complexity Measure" 1976 IEEE Transactions on Software Engineering, vol. SE-2, no. 4, pp. 308-320, doi: 10.1109/TSE.1976.233837, Available: https://ieeexplore.ieee.org/abstract/document/1702388 [Accessed August 5, 2025].

Comparing the Effects of Programming Practices on Code Quality

Index

Introduction

Goals and Criteria

Performance

Editability

Coupling

Cohesion

Complexity

Mental Load

Testability

Abilities of Large Language Models

Side-Effects & Pure Code

The C Problem

Code

Analysis

Performance

Editability

Coupling

Cohesion

Complexity

Mental Load

Testability

Values & References

Global Constants

LLMs and Quality

Conclusion

Discussion

Bibliography

Appendix