Accessing Object Storage

Accessing Object Storage#

“Cloud-native” data management

  • Object Storage (S3/Swift): Stores data as flat objects in “buckets” rather than nested folders.

  • DVC natively uses S3: It pushes your heavy data directly to object storage.

  • Git LFS: Uses GitHub/GitLab storage by default, but can be configured to use S3.

Documentation & Security

Security Rule #1: Never commit credentials!

  • Add .env to your .gitignore!

  • Document what environment variables are needed.

  • Provide a .env.example file.

  • Specify the Endpoint URL and Bucket Name in the data/README.md.

Patterns & Tooling

How to fetch the data:

  • Downloading to local scratch: Use e.g.the MinIO CLI Client (mc) or boto3 to quickly mirror S3 buckets to local scratch space.

  • Python Native: Use pandas with s3fs / fsspec to stream directly into memory.

  • Provide credentials via environment variables (e.g. dotenv run -- python ... or UV_ENV_FILE=.env uv run python ...)

Performance Tips

  • Consider cluster limits: Streaming saves disk space; downloading speeds up repeated reads.

  • Streaming thousands of tiny files from object storage will bottleneck a GPU.