Versioning Data 1.0

Versioning Data 1.0#

Git LFS & Submodules

Limitations of standard Git for data tracking

  • Committing large files increases repository size and encounters hosting limits.

  • Approach: Combine Git LFS (for the data repository) with Git Submodules (to link data to the code repository).

Git Large File Storage (LFS)

Git LFS replaces large files with text pointers.

# Initialize LFS in a dedicated data repository
git lfs install
git lfs track "*.csv"
git add .gitattributes ./dataset.csv
git commit -m "Add raw dataset"
Git Submodules & “Lazy Loading”

Linking data repositories via Submodules

git submodule add https://github.com/<owner>/<data-repo>.git data/raw/

Lazy Loading Initialization:

# Data is pulled to the compute node only upon explicit initialization
git submodule update --init --recursive