Accessing Object Storage#
“Cloud-native” data management
Object Storage (S3/Swift): Stores data as flat objects in “buckets” rather than nested folders.
DVC natively uses S3: It pushes your heavy data directly to object storage.
Git LFS: Uses GitHub/GitLab storage by default, but can be configured to use S3.
Security Rule #1: Never commit credentials!
Add
.envto your.gitignore!Document what environment variables are needed.
Provide a
.env.examplefile.Specify the Endpoint URL and Bucket Name in the
data/README.md.
How to fetch the data:
Downloading to local scratch: Use e.g.the MinIO CLI Client (
mc) or boto3 to quickly mirror S3 buckets to local scratch space.Python Native: Use
pandaswiths3fs/fsspecto stream directly into memory.
Provide credentials via environment variables (e.g.
dotenv run -- python ...orUV_ENV_FILE=.env uv run python ...)
Performance Tips
Consider cluster limits: Streaming saves disk space; downloading speeds up repeated reads.
Streaming thousands of tiny files from object storage will bottleneck a GPU.