Separation of Concerns (SoC)

Content

Separation of Concerns (SoC)#

The separation of logic, parameters, environment, and data

Basic

my_data_science_project/
│
├── config/            # Centralized parameterizations
├── data/              # Input data
└── code/              # Code and scripts
.  
.

Key Takeaways:

Foundational separation: Code is physically isolated from configuration parameters and datasets.
Portability: Hardcoded paths and parameters are explicitly eliminated from the source code.
Clarity: The core domains of the project are immediately identifiable to collaborators.

Best Practice

my_data_science_project/
│
├── config/            # Parameters (YAML/TOML)
├── .env               # Environment/State (paths, secrets)
│
├── scripts/           # Logic (execution)
│
├── data/              # Data (inputs)
└── results/           # Data (outputs/deliverables)
.
.

Key Takeaways:

Environment isolation: Local paths and secrets are extracted to .env files.
Security: The .env file must never be committed to version control.
Dynamic loading: Paths are fetched programmatically at runtime rather than hardcoded:

import os
from dotenv import load_dotenv
# Load local .env variables
load_dotenv()
# Fetch environment-specific paths
output_dir = os.getenv("OUTPUT_DIR")

Advanced

my_data_science_project/
│
├── data/
│   ├── raw/           # Immutable data dumps
│   ├── interim/       # Intermediary data
│   └── final/         # Cleaned, tidy data
│
├── src/               # Reusable logic (Python package)
├── scripts/           # Executable logic (batch scripts)
│
├── config/            # Parameters
├── .env               # Environment state
└── results/           # Outputs
.
.

Key Takeaways:

Logic division: Reusable modules (src/) are strictly separated from executable routines (scripts/).
Data lineage: Data is divided into discrete stages to document processing steps and transformations.
Immutability: Raw data is strictly preserved and never overwritten by analytical scripts.

+ Metadata

my_data_science_project/
│
├── data/              # (raw, interim, final)
├── src/               # (reusable logic)
├── scripts/           # (executable logic)
├── results/           # (outputs)
├── config/            # (parameters)
│
├── docs/              # Documentation source files
├── pyproject.toml     # Project metadata and dependencies
├── README.md          # Project overview
├── LICENSE            # Usage rights
├── .env.example       # Environment template
└── .gitignore         # Version control exclusions

Key Takeaways:

Self-description: Essential context and usage instructions are provided at the root level.
Dependency management: Required packages and metadata are defined centrally (e.g., pyproject.toml).
Safe onboarding: Templates (.env.example) are provided so collaborators can safely configure their local environments without sharing secrets.

Quality and Control#

my_data_science_project/
│
├── tests/             # Automated unit tests
├── benchmark/         # Performance tracking
│
├── .github/
│   └── workflows/     # GitHub CI/CD pipelines
│
├── .gitlab-ci.yml     # GitLab CI/CD pipeline
.
.

Key Takeaways:

Automated Assurance: Code correctness and computational efficiency are systematically verified through dedicated tests/ and benchmark/ suites.
Continuous Integration (CI): Automated pipelines (.github/workflows/ or .gitlab-ci.yml) execute tests immediately whenever the codebase is modified.
Reliability: Bugs and performance regressions are caught prior to publication, ensuring that analytical results remain stable and reproducible.