Large File Storage (LFS)#

Why Use LFS?#

Effective data management is crucial in scientific research, yet it’s often overlooked, leading to data loss, duplication, or errors.

Tracking large files with can be challenging. As the size of your repository grows, operations like cloning, fetching, and pushing can slow down because stores the entire repository’s history locally. To address this, Large File Storage (LFS) provides a solution for managing large files in your repositories.

-LFS is an extension that replaces large files in your repository with lightweight text pointers, while the actual file contents are stored on a remote server. This allows you to work with large files without affecting performance.

vs. LFS: A Comparison#

#

When you commit changes with , it creates objects to represent the state of your files at that point in time, which are stored in the .git/objects directory. Each object is a snapshot of the file contents, and uses pointers (hashes) to reference these objects.

is optimized for handling text files. Since text files typically have small, incremental changes, efficiently stores only the differences (deltas) between versions.

However, changes in binary files (e.g., images, videos, datasets) are not as easily represented as deltas. When a binary file is modified, often stores the entire file again, leading to bloated repositories and slower performance over time.

LFS#

LFS replaces large files with small pointer files that reference the actual content stored outside the main repository.

The large files themselves are stored in a seperate location (e.g., a remote server) which keeps the main repository lightweight and efficient. When you clone a repository using LFS, only the pointer files are downloaded, not the large files. When you checkout a file, LFS automatically downloads the actual file content. Similarly, when you commit a large file, uploads it to the external storage and replaces it with a pointer file in the repository.

Data Model#

Let’s dive deeper into the technical details of how and LFS handle files.

How to use LFS#

To use LFS, you need to install the LFS client on your local machine.

1. Install Git LFS:#

  • Download and install from git-lfs.github.com.

  • Initialize in your repository by running the following command in your repository directory:

    git lfs install
    

2. Track Large Files:#

  • Specify the file types to be tracked by LFS. For example, to track all .pdf files, run:

    git lfs track "*.pdf"
    

3. Commit and Push:#

  • Add the .gitattributes file:

    The .gitattributes file is a configuration file used by to manage attributes for paths in your repository. It allows you to define how certain files should be handled by , including aspects like line endings, mark files as binary to prevent text-based operations on them, and more. This file is particularly useful when working with LFS to specify which files should be managed by LFS.

    git add .gitattributes
    

    Example: If you want to use Git LFS for large media files like images and videos, your .gitattributes file might look like this:

    # Use LFS for image files
    *.jpg filter=lfs diff=lfs merge=lfs -text
    *.png filter=lfs diff=lfs merge=lfs -text
    
    # Use LFS for video files
    *.mp4 filter=lfs diff=lfs merge=lfs -text
    
  • Add, commit, and push large files:

    git add <large-file>
    git commit -m "Add large file"
    git push
    

LFS Availability at UZH#

Popular hosting services like GitHub, GitLab, Bitbucket, and Azure DevOps have built-in support for LFS.

For self-hosted servers, it is important to ensure LFS support is enabled. This may require additional installation and configuration because LFS stores large files in a separate storage location, which requires extra server-side support management (e.g., storage, authentication, bandwidth handling).

At the IMATH, LFS is NOT supported on the IMATH GitLab instance.

However, at the University of Zurich (UZH), LFS is supported on the UZH GitLab instance, with a size limit of 15 GB per project (this includes all parts of a project, i.e. repository, LFS, etc.). The data is stored in the Switch Cloud, which is hosted outside of UZH but remains within Switzerland, though generally the UZH data protection regulations still apply.

Please note that GitLab is primarily intended for collaborative software development, not simply for data storage. UZH’s GitLab has limited disk space, and all data is deleted after 12 months of user inactivity. For long-term data storage, it is recommended to use OneDrive or SwitchDrive.