Primer on Software Development#

Why Software Development Principles?#

Most research code starts as a quick script: load data, run analysis, plot results. That script grows. Parameters get hardcoded, functions get copy-pasted, and before long you have a tangled mess that only works on your machine, with your data, on a good day.

The principles in this section are not abstract theory. They are practical guidelines that help you write code you can actually maintain, share, and trust. We focus on three ideas:

  1. Good coding practices: making code readable and consistent.

  2. Orthogonality: keeping unrelated parts of your code independent.

  3. DRY: avoiding unnecessary repetition.

None of these require advanced programming skills. They are about discipline and awareness, and they pay off immediately.

Good Coding Practices#

Stick to Language-Specific Standards#

Every programming language has its own idioms and best practices. Beyond the general principles below, familiarize yourself with the specific conventions of your language.

In Python, docstrings are standard. They are string literals (typically """...""") placed as the first statement in a module, function, class, or method. Unlike comments, they are accessible at runtime and can be used to generate documentation automatically. See PEP 257 for the official convention.

Python also uses strict naming conventions: CamelCase for classes, snake_case for functions and variables, ALL_CAPS for constants. These are not enforced by the interpreter, but following them makes your code immediately recognizable to any Python developer.

Try running this in a Python shell:

import this

The “Zen of Python” captures the core philosophy behind idiomatic Python. Some of it is tongue-in-cheek, but the principles are sound.

Make Code Readable#

Use Descriptive Names#

Names are the primary way we understand code. A function called proc tells you nothing; compute_speed tells you everything. If a variable name requires a comment to explain what it holds, the name is wrong.

Nondescriptive names force the reader to memorize arbitrary mappings. Comments are a poor patch for bad naming. Fix the name instead.

Use Comments Appropriately#

Comments serve three purposes:

  • why: Justify a non-obvious design choice or workaround.

  • what: Provide a high-level overview of a complex block.

  • how: Show usage examples, especially for public APIs.

Play-by-play comments that repeat what the code does line-by-line are clutter. If a line needs a comment to explain what it does, it is probably too complex or too cleverly written ;-).

Comments that explain why a specific approach was taken are always valuable, that context cannot be recovered from the code alone.

Don’t Fixate on Line Count#

Equating code quality with brevity is a common trap. Condensing logic into cryptic one-liners makes debugging painful. A slightly longer function with clear, step-by-step logic is always preferable.

Readability trumps brevity. Code is read far more often than it is written.

Embrace Consistency#

A consistent style (same indentation, same naming logic, same file structure) reduces cognitive load. When style varies across a codebase, the reader constantly has to adjust. Follow the established style guide (e.g., PEP 8 for Python) and stick to it.

Orthogonal Code#

The Concept#

In linear algebra, two vectors are orthogonal if their inner product is zero. Movement along one produces no movement along the other. Software development borrows this idea: two pieces of code are orthogonal if changes in one do not affect the behavior of the other.

Your database code should be orthogonal to your user interface code. Changing the color of a button should not break a database query. Changing the database schema should not change the color of a button.

Benefits#

  • Isolated execution: Run specific parts of your analysis without re-running everything. No need to redo a heavy computation just to tweak a plot style.

  • Localized bugs: If a component fails, the issue is likely within that component, not a side effect from somewhere else.

  • Reusability: A plotting function that knows nothing about how the data was computed can be used for any data.

  • Testability: Write unit tests for one module without mocking the entire system.

Trade-offs#

  • Upfront effort: Orthogonal code requires designing interfaces and data structures to pass information between decoupled components.

  • Harder to trace: The logic jumps between independent functions and modules rather than flowing linearly through a single script.

  • Fragmentation risk: Taken too far, you end up with hundreds of tiny micro-functions that are tedious to navigate.

Example#

Consider a script that performs K-means clustering on the Iris dataset. A non-orthogonal version mixes computation and visualization in the same loop:

for i, (x_index, y_index) in enumerate(pairs):
    # Computation
    kmeans = KMeans(n_clusters=3, random_state=42)
    kmeans.fit(X[:, [x_index, y_index]])
    labels = kmeans.labels_

    # Visualization that tightly coupled to computation
    axs[i].scatter(X[:, x_index], X[:, y_index], c=labels, cmap='viridis',
                   edgecolor='k', s=100)

Want to change the plot color? You re-run the clustering. Want different clusters? You re-draw everything.

The orthogonal version separates the two concerns. Computation runs first and stores its results:

def perform_kmeans(X, x_index, y_index, n_clusters=3):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(X[:, [x_index, y_index]])
    return kmeans.labels_

all_labels = [perform_kmeans(X, x, y) for x, y in pairs]

Visualization uses the stored results independently:

def plot_clusters(X, labels, x_index, y_index, title, ax):
    ax.scatter(X[:, x_index], X[:, y_index], c=labels, cmap='viridis',
               edgecolor='k', s=100)
    ax.set_title(title)

for i, (x_index, y_index) in enumerate(pairs):
    plot_clusters(X, all_labels[i], x_index, y_index, titles[i], axs[i])

The code is longer, but the payoff is that if the computation takes three days on a cluster, you run it once, save all_labels, and tweak your plots locally as many times as you want.

The DRY Principle#

The Idea#

DRY stands for Don’t Repeat Yourself. The principle states that every piece of knowledge should have a single, unambiguous representation in a system. If you perform the same logical operation in multiple places, define it once and reuse it.

Benefits#

  • Fewer bugs: Duplicated logic means a fix in one place might be missed in another.

  • Easier maintenance: Centralized logic means updates happen in one location and propagate everywhere.

  • Better quality: Components that get reused also get tested more thoroughly.

  • Cleaner architecture: DRY naturally pushes you toward modular, decoupled design.

When DRY Backfires#

DRY is not a dogma. Pushed too far, it creates problems of its own:

  • Over-abstraction: Code so generic it loses clarity. Functions that handle every edge case become convoluted.

  • Coupling: Forcing two slightly different use cases into one function produces “parameter soup”: functions with complex configuration to cover both cases.

  • Fragility: If two unrelated parts of an application share code by accident (not because the logic is the same), changing the shared code for one part can break the other.

The goal is balance. As a practical rule: duplication is far cheaper than the wrong abstraction. How DRY to be is a judgment call that comes with experience.

Single Source of Truth#

Closely related to DRY is the concept of Single Source of Truth (SSOT). While DRY usually refers to logic, SSOT refers to data: every data element should be defined in exactly one place.

In scientific computing, the boundary between data and code often blurs. Column names in a DataFrame are a good example: are they data (strings in a CSV) or code (identifiers in your script)? Either way, defining them once is always better.

Example#

A common source of duplication in data analysis is “magic strings”. These are column names repeated as string literals throughout a script.

The WET version (Write Everything Twice):

print(fighter_df['Name'])
print(fighter_df['Age'])
# 'Name' and 'Age' appear as strings in many places

If the CSV column is renamed from "Name" to "Vorname", you have to find and replace every occurrence. Miss one, and the script fails silently or crashes far from where the actual problem is.

The DRY version:

from types import SimpleNamespace

fighter_cn = SimpleNamespace()
fighter_cn.name = 'Name'
fighter_cn.age = 'Age'
fighter_cn.city = 'City'

print(fighter_df[fighter_cn.name])
print(fighter_df[fighter_cn.age])

This seems like overhead for a short script, but in a real project it gives you:

  1. Trivial refactoring: change the column name in one place.

  2. IDE autocompletion: fighter_cn. triggers suggestions; "Na..." does not.

  3. Immediate error detection: fighter_cn.naame raises AttributeError right away. fighter_df["Naame"] might silently create an empty column.

Single Source of Truth (SSOT)#

The Idea#

The Single Source of Truth (SSOT) principle dictates that every distinct piece of information within a system must be defined in exactly one location. Once defined, this information is only referred to or derived from that primary definition. While the DRY principle typically governs procedural logic and behavior, SSOT governs state, data, and metadata.

Project-Wide Application#

The necessity of a single source of truth extends across the entirety of a computational project:

  • Data Entities: Raw data must be treated as immutable and stored in a definitive location. Intermediate or cleaned datasets must be derived programmatically from this source via reproducible scripts, rather than being manually copied or modified.

  • Metadata: Project-level information, such as version numbers, author details, and license declarations, must be centralized. Defining a version number in a single standard configuration file (e.g., pyproject.toml) prevents conflicting version reports across environments and documentation.

  • Configuration Parameters: Algorithmic thresholds, hyperparameters, and file paths must be defined in a dedicated configuration file and passed into the execution environment. Hardcoding these parameters throughout the source code creates hidden, fragmented states.

Benefits and Integration with DRY#

  • Elimination of Conflicting States: When information is duplicated, synchronization is continuously required. SSOT eliminates the possibility of components operating on conflicting definitions.

  • Guaranteed Reproducibility: A project’s execution can only be reliably reproduced if its parameters and inputs are definitively known and uniquely identifiable.

  • Synergy with DRY: SSOT and DRY are complementary architectural patterns. DRY ensures that the method for processing data is written once, while SSOT ensures that the data or state being processed is defined once.

Example: Project Versioning#

A pervasive violation of SSOT in software projects is the manual duplication of project metadata.

The Duplicated State approach:

# pyproject.toml
version = "1.2.0"

# src/mypkgs/__init__.py
__version__ = "1.2.0"

# docs/conf.py
release = "1.2.0"

If a release is prepared, the version string must be manually updated in three discrete locations. If one update is overlooked, the project enters an inconsistent state, potentially causing deployment failures or documentation mismatches.

The SSOT approach:

The version is defined exclusively in the standardized project configuration file.

# pyproject.toml
[project]
name = "my_package"
version = "1.2.0"

All other components derive this information dynamically at runtime using standard library utilities.

# src/mypkgs/__init__.py
from importlib.metadata import version

# The version is derived dynamically from the installed package metadata
__version__ = version("my_package")

This architectural pattern guarantees that a version bump requires only a single modification. The change propagates automatically, ensuring the entire project remains strictly synchronized and unambiguous.

Sources:
Hunt, A., & Thomas, D. (2019). The Pragmatic Programmer: Your journey to mastery. Addison-Wesley.