Code Structure#
Codebase Architecture#
A reproducible and scalable computational project requires a structural foundation that enforces the Separation of Concerns (SoC). The codebase must be organized so that distinct operational domains, like code, configuration parameters, environment variables, infrastructure definitions, and data, are as isolated from one another as possible.
By designing the directory structure to reflect these boundaries, the codebase naturally prevents the entanglement of logic and configuration. Furthermore, the structural separation guarantees that core analytical logic can be reused across multiple execution contexts without duplicating code.
Directory Roles#
To achieve this separation, a standard repository utilizes distinct top-level directories, each serving a singular purpose:
src/(Core Logic):
Contains the generalized, reusable Python package. Functions and classes defined here do not execute independently; they wait to be imported. They remain agnostic to external configurations and environment states.scripts/(Operational Execution):
Contains the procedural entry points (e.g.,train_model.py,download_data.py). These scripts act as the orchestrators: they load the environment variables (.env), parse the configuration parameters, and apply them to the reusable logic imported from thesrc/directory.notebooks/(Exploration and Prototyping):
Reserved exclusively for Jupyter notebooks utilized for Exploratory Data Analysis (EDA), prototyping, and interactive visualization. Notebooks import the core logic fromsrc/but are never imported themselves.containers/(Infrastructure Manifests):
Houses the declarative environment blueprints (e.g.,Dockerfile,Apptainer.def). Isolating these manifests ensures the repository root remains clean and strictly uncouples system-level dependencies from the Python application logic.data/:
The strictly version-controlled directory containing raw and processed datasets, isolated from the execution logic.
By structuring the project in this manner, an analysis pipeline can be run identically via an automated script in the scripts/ folder or interactively within the notebooks/ folder, simply by importing the same underlying function from src/, all while executing within a reproducible system environment defined in containers/.
The src/ Layout#
The src directory is central to a well-engineered Python project. Standard tools like pip, uv, and setuptools are designed to natively recognize this structural pattern (often referred to as the “src-layout”).
In Python, “installing” a package essentially means directing the interpreter to the code’s location so it can be globally imported, regardless of the active working directory. By placing the codebase inside a dedicated folder like src/mypkgs, the core analysis logic is physically isolated from runtime scripts and configuration files.
When a subdirectory (e.g., mypkgs) containing an __init__.py file is instantiated, Python treats it as a package (see the official modules tutorial). Any .py file inside becomes a submodule that can be imported, identically to third-party libraries such as pandas or numpy.
Example Implementation#
Consider the following file structure, which strictly separates the reusable package (mypkgs) from the operational execution environment (scripts):
.
├── scripts/
│ └── run_hello.py
├── src/
│ └── mypkgs/
│ ├── __init__.py
│ └── hello.py
├── pyproject.toml
.
If hello.py contains a simple function:
# src/mypkgs/hello.py
def say_hello(name: str) -> None:
"""Greets a name"""
print(f"Hello {name}")
Editable Installations#
Assuming a valid pyproject.toml configuration is present at the repository root, this local package can be integrated into the active virtual environment in “editable” (or development) mode. An editable install creates a symlink to the src/ directory, meaning any modifications to the underlying Python files are immediately reflected across the system without requiring reinstallation.
1. Install the project
Utilizing modern tooling like uv, synchronization automatically installs the local package defined in pyproject.toml in editable mode:
uv sync
Alternatively, utilizing the standard pip installer:
pip install -e .
2. Import and execute Once installed, the functions can be imported into any operational script, notebook, or shell operating within that virtual environment. The script acts purely as the execution trigger, importing the logic from the isolated package:
# scripts/run_hello.py
from mypkgs.hello import say_hello
if __name__ == "__main__":
# Executes the logic defined in src/mypkgs/hello.py
say_hello(name="Bob")
Testing and Documentation#
Beyond the core analytical logic and execution scripts, a robust codebase requires dedicated infrastructure for functional validation and communication. According to the principle of Separation of Concerns, these auxiliary tasks must be structurally isolated into their own independent locations.
Testing Isolation and Scope#
The tests/ directory is dedicated to verifying the installable codebase (i.e. in src/).
A testing framework (such as pytest) operates by importing the pure functions and classes from src/ and asserting that they produce the expected outputs given controlled inputs.
Crucially, the testing framework must strictly ignore the scripts/ and notebooks/ directories.
Operational scripts and interactive notebooks are execution endpoints.
They are inherently stateful, dependent on specific environment variables, and heavily coupled to external data files or network connections.
Because they lack modularity, they are practically impossible to unit test reliably.
This structural rule enforces codebase hygiene: if a procedural script or a Jupyter notebook contains complex data-processing logic that requires validation, that logic is in the wrong place.
It must be extracted, generalized, and moved into the src/ directory.
Once in src/, it can be safely imported by the test suite for validation, and subsequently imported back into the script for execution.
Dependency Separation via pyproject.toml#
Because testing and documentation are independent concerns, their software dependencies must remain isolated from the primary analysis environment.
Historically, this isolation was achieved by placing distinct requirements.txt files inside the tests/ and docs/ directories. However, modern Python packaging consolidates these declarations within the central pyproject.toml file utilizing optional dependencies (or modern dependency groups).
This approach maintains strict environment isolation while centralizing dependency management. It delineates the operating environments into distinct categories:
Core Dependencies: Required to execute the operational data pipeline (e.g.,
pandas,scikit-learn).Test Dependencies: Required exclusively for functional validation (e.g.,
pytest,pytest-cov).Documentation Dependencies: Required exclusively for generating static sites (e.g.,
sphinx,mkdocs).
Example pyproject.toml Configuration#
By defining auxiliary domains as optional extras, the primary execution environment remains lightweight and free of unnecessary bloat. Only the core dependencies are installed by default:
[project]
name = "mypkgs"
version = "0.1.0"
# Core dependencies installed by default
dependencies = [
"pandas>=2.0.0",
"numpy>=1.24.0"
]
[project.optional-dependencies]
# Auxiliary dependencies installed only upon request
test = [
"pytest>=7.0",
"pytest-cov"
]
docs = [
"sphinx",
"mkdocs"
]
Explicit Installation#
When provisioning the environment for a specific task (e.g., running the CI/CD testing pipeline), the auxiliary dependencies are explicitly invoked. The core analysis environment remains unaffected unless these flags are passed.
Utilizing modern uv synchronization:
# Installs core dependencies + testing tools
uv sync --extra test
Utilizing the standard pip installer:
# Installs the local package in editable mode + testing tools
pip install -e .[test]
Note on Notebooks and Documentation#
While tools like Quarto or Jupyter Book allow documents to be authored directly using .ipynb notebooks, these source notebooks should generally remain within the notebooks/ directory.
The docs/ folder is strictly reserved for the structural configuration, styling templates, and static assets required to compile those source notebooks and the src/ package docstrings into a final, deployable website or book.
Container Blueprints#
To maintain a clean repository root and strictly adhere to the Separation of Concerns, environment definitions (container blueprints) should be physically separated from operational logic (scripts) and core analytical code (the src/ directory).
Furthermore, local environment variables (such as those stored in a .env file) must remain strictly decoupled from the immutable container image. They must be excluded during the build phase and injected dynamically at runtime.
Directory Structure#
By utilizing a containers/ directory, the codebase clearly delineates infrastructure from analytical logic:
.
├── containers/
│ ├── pipeline.Dockerfile
│ └── pipeline.def
├── scripts/
│ └── drafts/
│ └── hello.py
├── src/
│ └── mypackage/
├── .dockerignore # Prevents secrets from being copied into the image
├── .env # Local secrets (ignored by version control)
├── pyproject.toml
└── README.md