Managing Data

Managing Data#

Linking Data to a Project#

While code defines the execution of a computational process, data constitutes the actual entity being processed. Data must be linked directly to the project repository to allow for reproducibility and to facilitate the environment setup process.

`data/` Directory#

A single entry point for all datasets facilitates orientation and streamlines a project setup. This entry point is typically the data/ directory at the root of a repository. To maintain pipeline integrity and prevent data corruption, this directory is subdivided into three distinct stages:

data/raw/:
The original, immutable data. Files in this directory must not be edited and are to be treated as strictly read-only. Data cleaning or correction is performed via scripts that read from raw/ and output to intermediate directories.
data/interim/:
Intermediate data that has been transformed or cleaned but is not yet formatted for the final modeling or analysis phase.
data/final/:
The canonical datasets utilized directly for modeling, reporting, or publication.
This folder can be further divided including, e.g., data/final/results a directory that contains only synthesized results needed for visualization and reporting.

The `data/README.md`#

The root README.md needs to remain short and concise and thus is restricted to stating data location and ownership. All further explanations and metadata should be placed in data/README.md.

This file should contain:

Data dictionaries (definitions for each column or variable).
Detailed provenance (exact URLs or database queries utilized to fetch the raw data).
A technical summary of the scripts and transformations applied to move data from raw/ to interim/ and ultimately to final/.

Infrastructure & Performance Considerations#

The storage location and access methodology of data significantly impact computational performance, particularly within cloud and cluster infrastructures.

I/O Latency: Reading numerous small files directly from a network drive on a cluster introduces high latency and can degrade overall network performance.
Data Locality: Transferring data to the compute node’s local scratch space (e.g., /tmp or $SCRATCH) prior to script execution is often required to achieve optimal I/O efficiency.

The data/ directory structure and the associated codebase must account for these infrastructure variations. Simply copying all data into the data/ folder is not a viable option for project with non-trivial datasets (i.e. > 100MB).

Scripts must be configurable (e.g., via environment variables in .env or JSON files in config/) to read from different data paths depending on the execution environment, such as a local workstation versus a high-performance compute node.

Versioning Data 1.0#

Git LFS & Submodules

Git is not designed to handle large datasets and binary files. Committing large files, such as multi-gigabyte CSVs, increases repository size, slows down operations, and frequently results in errors due to file size limits imposed by hosting providers.

Git Large File Storage (LFS) in combination with Git Submodules are one way to mitigate this problem. When combined, they allow to link versioned data into the primary codebase repository without increasing its footprint.

Git Large File Storage (LFS)#

Git LFS operates by replacing large files in the repository with small text pointer files, while the actual file contents are stored on a remote server.

Workflow for creating a standalone data repository:

# Initialize LFS in a dedicated data repository (e.g., project-data-repo)
git lfs install

# Configure LFS to track specific file extensions
git lfs track "*.csv"
git lfs track "*.h5"

# Commit and push changes (large files are routed to LFS storage, not Git history)
git add .gitattributes ./dataset.csv
git commit -m "Add raw dataset"
git push origin main

Git Submodules & “Lazy Loading”#

Rather than storing LFS data directly within the primary code repository, the recommended approach is to isolate the data in an independent repository and link it to the code repository using a Git Submodule.

A submodule functions as a strict reference to a specific commit of an external repository.

Linking the data to the project:

# Execute within the primary code repository
git submodule add https://github.com/<owner>/<data-repo>.git data/raw/
git commit -m "Link raw data repository as a submodule"

Lazy Loading Mechanism

When a repository is cloned that contains a submodule, the submodule directory (e.g., data/raw/) is uninitialized and empty by default. Large datasets are not downloaded automatically.

This mechanism allows local development to proceed using minimal or synthetic data, bypassing unnecessary data transfers. When the codebase is deployed to a compute node on a cluster, the required data is explicitly pulled:

# Execute on the compute node where the data is required
git submodule update --init --recursive

This strategy conserves network bandwidth and local disk space while ensuring that the compute environment retrieves the exact required version of the data.

Versioning Data 2.0#

Data Version Control (DVC)

For projects involving complex data pipelines, large-scale datasets, or reliance on cloud storage infrastructure (e.g., AWS S3, Google Cloud Storage, or institutional SSH servers), Data Version Control (DVC) is frequently implemented.

DVC functions similarly to Git but is optimized for large data artifacts and machine learning models. Data is tracked using lightweight .dvc metadata files committed to the Git repository, while the actual data payloads are pushed to a specified remote storage backend.

Implementation: DVC & Git#

The following steps outline the initialization and versioning process when utilizing DVC alongside Git.

Initialization and Configuration#

# Execute within an initialized Git repository
dvc init

# Configure a remote storage backend (e.g., an AWS S3 bucket)
dvc remote add -d myremote s3://my-lab-bucket/project-data

Adding Data to DVC#

Large files or directories are tracked using dvc add rather than git add.

# Track the raw data directory with DVC
dvc add data/raw/

# DVC generates a 'data/raw.dvc' metadata tracking file. 
# This metadata file is then committed to Git.
git add data/raw.dvc data/.gitignore
git commit -m "Track raw data with DVC"

Pushing Data and Code#

Code is pushed to the Git remote, while data is pushed to the DVC remote.

git push origin main
dvc push

Retrieving Data#

When the Git repository is cloned, only the codebase and the .dvc tracking files are initially retrieved. To download the actual datasets linked to a specific Git commit, dvc pull must be executed:

git clone https://github.com/<owner>/<repo-name>.git
cd repo-name

# Retrieves the exact data versions specified by the current Git commit
dvc pull

Storage Agnosticism and Pipeline Tracking

Storage Agnosticism: DVC permits the use of various storage backends (e.g., institutional SSH servers, AWS S3) rather than relying exclusively on Git hosting platforms for large file storage.
Pipeline Tracking: DVC possesses capabilities to track the computational steps that generate data, facilitating the reproduction of transformations from data/raw/ to data/final/.

Accessing Object Storage#

In modern research environments, massive datasets are rarely stored on traditional network drives. Instead, they are hosted on Object Storage systems like AWS S3, Google Cloud Storage, or institutional OpenStack Swift clusters.

Unlike traditional file systems, object storage does not have a real directory hierarchy. Data is stored as discrete “objects” inside a flat “bucket” and accessed via HTTP APIs.

Integration with Git LFS and DVC#

You do not have to abandon version control just because your data is on S3:

DVC (Data Version Control):
DVC natively uses object storage. When you run dvc push, it uploads your datasets directly to your configured S3 or Swift bucket.
Git LFS:
By default, Git LFS uploads your large files to the internal servers of your Git provider (like GitHub or GitLab). However, if you are running out of GitHub quota, Git LFS can be configured to use S3 as its backend by using custom transfer agents (like lfs-s3).

Documentation & Security#

When your code pulls data directly from S3 or Swift, it requires authentication. Never hardcode your Access Keys or Secret Keys into your scripts. Instead, document the required connection parameters in your data/README.md and provide a .env.example file in the root of your repository so users know what credentials they need to supply.

Example .env.example:

# Object Storage Configuration
OS_ENDPOINT_URL=https://s3.your-institution.edu)
OS_BUCKET_NAME=my-research-data-bucket
OS_ACCESS_KEY_ID=your_access_key_here
OS_SECRET_ACCESS_KEY=your_secret_key_here

In your documentation, clearly state how a user can obtain these credentials (e.g., “To run this pipeline, request an S3 access token from the institutional IT helpdesk and place it in your local .env file”).

Patterns & Tooling#

Use the data/README.md file to explicitly document how the data should be brought into the compute environment. There are two primary access patterns for object storage, and the choice greatly affects performance on a cluster:

Bulk Downloading to Local Scratch#

For deep learning or processes that read the same files multiple times, it is usually best to download the data to the compute node’s $SCRATCH or $TMPDIR space before the Python script runs.

A commonly used tool for interacting with S3-compatible storage is the MinIO Client (mc). Document the exact commands users need to run to sync the data:

# ./data/README.md

# 1. Configure the connection (using environment variables)
mc alias set my-s3 $OS_ENDPOINT_URL $OS_ACCESS_KEY_ID $OS_SECRET_ACCESS_KEY

# 2. Mirror the remote bucket to the compute node's fast local scratch space
mc mirror my-s3/my-research-data-bucket/raw_images/ /tmp/scratch/raw_images/

Alternatively, if programmatic downloading directly within the analysis logic is preferred, e.g., the boto3 library can be utilized. This approach is highly effective for fetching specific datasets dynamically and storing them on the local disk prior to processing.

Because the script remains environment-agnostic, connection credentials are fetched directly from the system environment, bypassing the need for hardcoded keys.

# src/data.py
import os
import boto3

def download_from_object_store(endpoint_url, access_key, secret_key, bucket_name, object_key, local_path):
    """
    Downloads a specific object from an S3-compatible store (e.g., OpenStack Swift) 
    to a designated local path. 
    """
    s3_client = boto3.client(
        "s3",
        endpoint_url=endpoint_url,
        aws_access_key_id=access_key,
        aws_secret_access_key=secret_key
    )

    print(f"Downloading {object_key} from {bucket_name} to {local_path}...")
    s3_client.download_file(bucket_name, object_key, local_path)
    print("Download complete.")


if __name__ == "__main__":
    # 1. Fetch secure credentials from the execution environment
    env_endpoint = os.getenv("OS_ENDPOINT_URL")
    env_access_key = os.getenv("OS_ACCESS_KEY_ID")
    env_secret_key = os.getenv("OS_SECRET_ACCESS_KEY")
    config_file_path = os.getenv("OS_CONFIG_FILE")
    
    # 2. Load operational parameters from the designated configuration file
    with open(config_file_path, "r") as f:
        config = json.load(f)

    # 3. Execute the strictly scoped function
    download_from_object_store(
        endpoint_url=env_endpoint,
        access_key=env_access_key,
        secret_key=env_secret_key,
        bucket_name=config["bucket_name"],
        object_key=config["object_key"],
        local_path=config["local_path"]
    )

Streaming Directly into Memory#

For processing large tabular datasets (like Parquet or CSV files) that only need to be read once, you can stream the data directly from object storage into Python using libraries like pandas with s3fs (or fsspec). This bypasses the local hard drive entirely.

Provide a minimal code snippet in your documentation, e.g.:

import pandas as pd
import os

# Pandas uses fsspec under the hood to stream directly from S3
storage_options = {
    "key": os.getenv("OS_ACCESS_KEY_ID"),
    "secret": os.getenv("OS_SECRET_ACCESS_KEY"),
    "client_kwargs": {"endpoint_url": os.getenv("OS_ENDPOINT_URL")}
}

# No local file is ever created; data goes straight to RAM
df = pd.read_parquet(
    "s3://my-research-data-bucket/interim/processed_records.parquet", 
    storage_options=storage_options
)

Performance Tips

Consider cluster limits: Streaming saves disk space; downloading speeds up repeated reads.
Streaming thousands of tiny files from object storage will bottleneck a GPU.

When streaming thousands of tiny files directly from object storage during a training loop, the network overhead will bottleneck a GPU. Therefore it is important to clarify if a pipeline expects the data to be pre-downloaded or if it streams on-the-fly.

Managing Hugging Face Models#

In modern machine learning, pre-trained models (like LLMs or vision models) are essentially massive, complex data artifacts. Just like raw datasets, the models should be strictly versioned and efficiently managed — especially when running on cluster infrastructures.

The Hugging Face Hub is de facto standard for hosting these models. However, naively downloading models can quickly exhaust storage and ruin reproducibility.

Referring to Models#

A common reproducibility failure is vaguely stating the model used (e.g., “We used Mistral 7B”). Model repositories are updated frequently to fix bugs or remove toxic data.

To ensure exact reproducibility define:

The exact Model ID: e.g., mistralai/Mistral-7B-v0.1
The exact Revision: A Git commit SHA or branch tag (e.g., revision="26bca36b...").

By hardcoding the model ID and the commit hash in configuration files (e.g. config.yaml), anyone running the code will pull the exact identical weights.

Downloading and Storing Snapshots#

By default, Hugging Face libraries download massive files to a hidden cache folder in your home directory (~/.cache/huggingface/). On a High-Performance Computing (HPC) cluster, this is often a disaster, as home directories have strict quotas and slow network drives.

Best Practice 1: Environment Variables When deploying to a cluster, always set the HF_HOME environment variable to point to your fast, high-capacity scratch space.

export HF_HOME=/scratch/your_username/hf_cache
uv run python scripts/train.py

Best Practice 2: Explicit Local Snapshots Alternatively, use the huggingface_hub Python library to explicitly download the required files directly into your project’s data/models/ directory before running heavy computations:

from huggingface_hub import snapshot_download

model_path = snapshot_download(
    repo_id="mistralai/Mistral-7B-v0.1",
    revision="26bca36b...", # Pin the version!
    local_dir="data/models/mistral_7b", # Save locally
    ignore_patterns=["*.msgpack", "*.h5"] # Only download PyTorch/Safetensors
)

Storing Adaptations (Fine-Tuning)#

If your project involves fine-tuning a model, you face a storage problem: fine-tuning a 15GB model normally results in a brand-new 15GB model. Saving multiple checkpoints will instantly fill your hard drive.

Use Parameter-Efficient Fine-Tuning (PEFT) Instead of training the whole model, use techniques like LoRA (Low-Rank Adaptation). LoRA freezes the original base model and only trains a tiny set of new adapter weights.

The base model remains untouched (and downloaded via Hugging Face).
Your fine-tuned adaptation is saved as a tiny folder containing a few megabytes of adapter weights (e.g., adapter_config.json and adapter_model.safetensors).

Versioning Adapters

Because LoRA adapters are usually under 100MB, you can often version them directly in Git alongside your code (e.g., in data/models/my_lora_v1/), or easily track them via DVC without needing massive cloud storage quotas.