Versioning Data 1.0#
Git LFS & Submodules
Limitations of standard Git for data tracking
Committing large files increases repository size and encounters hosting limits.
Approach: Combine Git LFS (for the data repository) with Git Submodules (to link data to the code repository).
Git Large File Storage (LFS)
Git LFS replaces large files with text pointers.
# Initialize LFS in a dedicated data repository
git lfs install
git lfs track "*.csv"
git add .gitattributes ./dataset.csv
git commit -m "Add raw dataset"
Git Submodules & “Lazy Loading”
Linking data repositories via Submodules
git submodule add https://github.com/<owner>/<data-repo>.git data/raw/
Lazy Loading Initialization:
# Data is pulled to the compute node only upon explicit initialization
git submodule update --init --recursive