Versioning Data 2.0#
Data Version Control (DVC)
Data tracking for machine learning pipelines
Designed specifically for dataset and model versioning.
Git tracks
.dvcmetadata files.
DVC manages data transfer to remote storage (e.g., AWS S3, GCS, SSH).
Features a storage-agnostic architecture.
Implementation: DVC & Git#
1. Initialization & Remote Configuration:
dvc init
dvc remote add -d myremote s3://my-lab-bucket/data
2. Tracking & Committing Data:
dvc add data/raw/
git add data/raw.dvc data/.gitignore
git commit -m "Track raw data with DVC"
3. Push Code & Data:
git push origin main
dvc push
4. Retrieve Data:
git clone https://github.com/<owner>/<repo-name>.git
dvc pull # Retrieves exact data versions linked to this commit