Large File Storage (LFS)#

Why Use LFS?#

Effective data management is crucial in scientific research, yet it’s often overlooked, leading to data loss, duplication, or errors.

Tracking large files with can be challenging. As the size of your repository grows, operations like cloning, fetching, and pushing can slow down because stores the entire repository’s history locally. To address this, Large File Storage (LFS) provides a solution for managing large files in your repositories.

-LFS is an extension that replaces large files in your repository with lightweight text pointers, while the actual file contents are stored on a remote server. This allows you to work with large files without affecting performance.

vs. LFS: A Comparison#

#

When you commit changes with , it creates objects to represent the state of your files at that point in time, which are stored in the .git/objects directory. Each object is a snapshot of the file contents, and uses pointers (hashes) to reference these objects.

is optimized for handling text files. Since text files typically have small, incremental changes, efficiently stores only the differences (deltas) between versions.

However, changes in binary files (e.g., images, videos, datasets) are not as easily represented as deltas. When a binary file is modified, often stores the entire file again, leading to bloated repositories and slower performance over time.

LFS#

LFS replaces large files with small pointer files that reference the actual content stored outside the main repository.

The large files themselves are stored in a separate location (e.g., a remote server) which keeps the main repository lightweight and efficient. When you clone a repository using LFS, only the pointer files are downloaded, not the large files. When you checkout a file, LFS automatically downloads the actual file content. Similarly, when you commit a large file, uploads it to the external storage and replaces it with a pointer file in the repository.

Data Model#

Let’s dive deeper into the technical details of how and LFS handle files.

How does Track Files?#

To understand how LFS works, it’s essential to first understand how tracks files.

One of the core functionalities of is its ability to track files and their changes efficiently. But how does the underlying mechanisms and data structures look like?

’s Data Model

’s data model is based on three main concepts:

Blobs (Binary Large Objects) are used to store the contents of files.

Each blob is identified by an SHA-1 hash of its content, ensuring that identical files are stored only once. Blobs do not contain any metadata about the file, such as its name or permissions.

Trees represent directories and contain pointers to blobs (files) and other trees (subdirectories).

Each tree object includes the file names, permissions, and the SHA-1 hashes of the blobs or trees it contains.

Commits are snapshots of the entire repository at a given point in time.

A commit object contains a pointer to a tree object (representing the state of the repository), metadata (such as the author, committer, and commit message), and pointers to parent commits.

Tracking Changes

tracks changes to files through a series of stages:

Working Directory is where you make changes to your files (recall, works asynchronously).

does not track these changes until you stage them.

Staging Area is a file (usually located in .git/index) that stores information about what will go into your next commit.

When you stage a file using git add, calculates the SHA-1 hash of the file’s content, creates a blob object if it doesn’t already exist, and updates the index with the blob’s hash and the file’s metadata.

Repository (History) When you commit changes, creates a new commit object that points to the current state of the staging area.

The commit object is added to the repository’s history, forming a chain of commits that represent the project’s evolution over time.

Efficient Storage

uses several techniques to efficiently store and manage file changes:

Delta Compression is used by to store the differences between file versions.

Delta compression is particularly effective for text files, where changes are often small and incremental (e.g., adding a line of code). When you commit a change, calculates the delta (difference) between the new file and the previous version by comparing the blobs’ content.

Packfiles are used by to store objects in a compressed format.

Packfiles are compressed files that contain multiple objects (blobs, trees, commits). When you push changes to a remote repository, creates packfiles to transfer the objects efficiently.

In summary, ’s ability to track files and manage changes efficiently is a result of its robust data model and storage mechanisms. By using blobs, trees, and commits, ensures that file contents are stored uniquely and changes are tracked accurately. The staging area and efficient storage techniques like delta compression and packfiles contribute to ’s performance and reliability as a version control system.

How LFS Tracks Large Files#

LFS (Large File Storage) is an extension to that improves the handling of large files. It replaces large files in your repository with lightweight references, while storing the actual file content on a remote server.

How LFS Works

Pointer Files: When you add a large file to a repository using LFS, the file is replaced with a small pointer file. This pointer file contains metadata about the large file, such as its size and a unique identifier (OID - Object ID). The pointer file is committed to the repository, while the actual large file is stored separately.

Storing Large Files The actual content of the large file is stored on a remote LFS server. This server can be a dedicated LFS server, a cloud storage service, or any other storage solution that supports LFS. When you push your changes to the remote repository, LFS uploads the large files to the LFS server and commits the pointer files to the repository.

Fetching Large Files: When you clone or pull a repository that uses LFS, the pointer files are downloaded as part of the repository. LFS then automatically fetches the actual large files from the LFS server based on the pointers. This ensures that the large files are available locally without bloating the repository.

Tracking Changes

LFS tracks changes to large files in a way that integrates seamlessly with :

Adding Files: When you add a large file using git lfs track, LFS creates a .gitattributes file (if it doesn’t already exist) and adds an entry for the file type or specific file. This tells LFS to handle these files. The large file is then added to the staging area, and a pointer file is created and committed to the repository.

Committing Changes: When you commit changes, the pointer file is included in the commit. The actual large file content is managed separately by LFS by uploading the large file to the LFS server and updating the pointer file with the OID (Object ID) of the large file. This keeps the repository lightweight and ensures that large files do not slow down repository operations.

Efficient Storage

LFS optimizes storage and performance through several mechanisms:

Deduplication: LFS stores each unique version of a large file only once on the LFS server. If the same file is added multiple times, only one copy is stored, saving space.

Pointer Files: By using pointer files, LFS reduces the size of the repository by storing large file metadata separately. This keeps the repository lightweight and improves performance.

Bandwidth Optimization: By storing large files separately and only downloading them when needed, LFS reduces the amount of data transferred during repository operations. This is particularly beneficial for teams working with large assets like media files or datasets.

Conclusion LFS enhances ’s capabilities by efficiently managing large files. By using pointer files and storing large file content separately, LFS keeps repositories lightweight and performant. The integration with ’s version control features ensures that large files are tracked and versioned seamlessly, making LFS a valuable tool for developers working with large assets.

How to use LFS#

To use LFS, you need to install the LFS client on your local machine.

1. Install Git LFS:#

  • Download and install from git-lfs.github.com.

  • Initialize in your repository by running the following command in your repository directory:

    git lfs install
    

Adding tracked files

Regularly tracked files need to be untracked first, before they can be tracked with git LFS:

git rm --cached "data.csv"
git lfs track "data.csv"
git add "data.csv"

2. Track Large Files:#

  • Specify the file types to be tracked by LFS. For example, to track all .pdf files, run:

    git lfs track "*.pdf"
    

3. Commit and Push:#

  • Add the .gitattributes file:

    The .gitattributes file is a configuration file used by to manage attributes for paths in your repository. It allows you to define how certain files should be handled by , including aspects like line endings, mark files as binary to prevent text-based operations on them, and more. This file is particularly useful when working with LFS to specify which files should be managed by LFS.

    git add .gitattributes
    

    Example: If you want to use Git LFS for large media files like images and videos, your .gitattributes file might look like this:

    # Use LFS for image files
    *.jpg filter=lfs diff=lfs merge=lfs -text
    *.png filter=lfs diff=lfs merge=lfs -text
    
    # Use LFS for video files
    *.mp4 filter=lfs diff=lfs merge=lfs -text
    
  • Add, commit, and push large files:

    git add <large-file>
    git commit -m "Add large file"
    git push
    

LFS Availability at UZH#

Popular hosting services like GitHub, GitLab, Bitbucket, and Azure DevOps have built-in support for LFS.

For self-hosted servers, it is important to ensure LFS support is enabled. This may require additional installation and configuration because LFS stores large files in a separate storage location, which requires extra server-side support management (e.g., storage, authentication, bandwidth handling).

At the IMATH, LFS is supported on the IMATH GitLab instance.

At the University of Zurich UZH GitLab instance, LFS is supported, with a size limit of 15 GB per project (this includes all parts of a project, i.e. repository, LFS, etc.). The data is stored in the Switch Cloud, which is hosted outside UZH but remains within Switzerland, though generally the UZH data protection regulations still apply.

Please note that GitLab is primarily intended for collaborative software development, not simply for data storage. UZH’s GitLab has limited disk space, and all data is deleted after 12 months of user inactivity. For long-term data storage, it is recommended to use OneDrive or SwitchDrive.