Submodules #

What are Submodules ?#

A submodule is essentially a repository embedded within another repository. It allows you to include and manage external repositories within your main project. This is particularly useful for integrating third-party libraries or dependencies that are maintained separately.

Note

Once a submodule is initialized and updated, its content appears as a regular folder inside the parent repository.

Why are Submodules Useful?#

When working on a project, you may need to incorporate another project, such as a library developed by someone else or a tool you’re building for use in multiple projects. A common challenge in these situations is maintaing separation between the two projects separate while being able to use one with the other.

For example, imagine you’re conducting research and want to use a data analysis tool you previously developed for another project. You have two options: you could copy the code from the old project into your new one, or you could include it from a shared source. The problem with copying the code is that if you make any changes, it can be difficult to merge those changes back into the original tool later. Conversely, including it from a shared source may limit your ability to customize it, and ensuring that all collaborators have access to it can be challenging.

addresses this issue with something called submodules. Submodules allow you to keep a repository as a subfolder within another repository. This setup enables you to selectively include specific versions of external repositories in your main project. This is particularly useful for incorporating your own tools or third-party resources that are maintained separately, while allowing you to keep your changes separate.

Features of Submodules #

  • Separation: Submodules remain independent repositories, so their versioning history is separate from the parent repository. This allows for better modularity and organization of code, as each submodule can evolve independently.

  • Pinning: Submodules are usually pinned to specific commits, ensuring reproducibility by locking them to a particular state. This means that when you clone the parent repository, you get the exact version of the submodule that was used at the time of the last commit.

  • Updates: Submodules can be updated independently or synchronized with the parent repository. You can choose to pull the latest changes from the submodule’s repository without affecting the parent repository, or you can update the submodule reference in the parent repository to point to a new commit.

Benefits of Submodules for Reproducibility#

  • Pinning Submodules: By pinning submodules to specific commits, you ensure the same version of an external library, dataset (remember LFS!), or tool is always used, which is crucial for reproducibility in complex projects. This helps avoid issues that arise from changes in dependencies.

  • Independent Versioning: Each submodule has its own versioning history, allowing it to be maintained separately from the parent project. This means that updates or changes in a submodule do not directly impact the parent repository unless explicitly updated.

  • Flexibility: Submodules can be easily updated or switched to different versions without affecting the rest of the project. This flexibility allows developers to experiment with new features or fixes in a submodule while keeping the main project stable.

Use Cases for Submodules#

Third-party Libraries#

If you are developing a project that relies on third-party libraries, you can use submodules to include these libraries in your project without merging them directly into your project. This makes it easier to manage updates and changes. For example, you might use a submodule to include the code of a research paper you want to integrate into your analysis.

Shared Code#

If you have a common codebase that is used across multiple projects, submodules can help you centralize and streamline updates. By creating a submodule for this shared codebase, any changes made will automatically be reflected in all dependent projects. For example, if you have a library of functions that are used in multiple projects, a submodule allows you to update it in one place and propagate those changes to all the projects that depend on it.

Separate Repositories#

For projects that consist of multiple repositories, submodules allow you to link these repositories together while maintaining separate version control for each. This allows for modular project management while still working on them cohesively. For example, this course is organized into several repositories, each containing different sections of the course material.

using-git-in-accademia
├── ci-cd-workflows
├── git-and-science
├── git-and-its-remotes
├── working-with-git

Essential Commands for Submodules#

Here’s a simple overview of the basic commands for working with submodules:

Command

Description

git submodule add <rep-url> [path]

Add a new submodule to the project.

git submodule update --init --remote

Initialize & update, fetching the latest changes from remote repo.

git submodule status [path]

Check the status of the submodules (in [path]).

git submodule update --remote [path]

Update the submodule to the latest commit on the tracked branch.

git rm --cached [path-to-submodule]

Remove a submodule from the parent repository.

git submodule init

Initialize submodules in a cloned repository.

git submodule update

Update the submodules to the commit specified in the parent repo.

Working with Submodules#

Remember that a submodule usually does not track a branch, so before you start working in a submodule, checkout the branch you want to work on!

Once the submodule is initialized, you can work inside the submodule folder as if it were an ordinary repository.

After making any changes in a submodule, simply add the path to the submodule to a commit in the parent repository to update the commit that the parent repository should track.

Gotchas for Submodules#

  1. Submodules Do Not Update Automatically ⚠️: When you clone a repository that contains submodules, the latter are not automatically updated to the latest commit. You need to run git submodule update or use the --recurse-submodules option when cloning to ensure they are initialized and updated.

  2. Repository Resides in the .git Folder of the Parent Repo 🔒: The metadata for submodules is stored in the parent repository’s .git folder.

This means that the actual repository for the submodule is not in its own separate .git folder, which can lead to confusion. Be cautious that providing access to the parent repository’s .git folder grants access to the history of all its submodules!

  1. Submodule Commits Are Detached 🤔: submodules are designed to be pinned to a specific commit and do not track a branch.

When you check out a submodule, it is usually in a “detached HEAD” state, generally meaning it is not on a branch. This can be confusing if you try to make changes directly in the submodule without creating a new branch first.

Tip

You can set up a submodule to track a branch with the -b option:

git submodule add -b <bname> https://gitlab.com/...

Alternatively, navigate into the directory of an existing submodule (e.g., mySub) and run:

git checkout bname
git branch --set-upstream=origin/bname
cd ../  # you leave the submodule
git add mySub
git commit -m "Tracking branch bname in mySub"
  1. Submodule URLs Can Change 🔗: If the URL of a submodule repository changes, you must update the .gitmodules file in the parent repository. Failing to do so can lead to broken links when trying to update or clone the submodule.

  2. Cloning with Submodules Requires Extra Steps 🛠️: When cloning a repository with submodules, you need to use the --recurse-submodules option or run git submodule init and git submodule update afterward. Forgetting these steps can lead to missing submodule content.

  3. Submodules Can Increase Complexity 🌀: Using submodules can add complexity to your project structure. If not managed properly, it can lead to confusion about which version of a submodule is being used and how it relates to the parent repository.

Handling Submodules in CI/CD Pipelines#

When using submodules in your GitHub Workflows or GitLab Pipelines, you need to ensure that the submodules are properly initialized and updated on the runner.

GitHub Workflow Example: Handling Submodules#

In GitHub Actions , you can use the actions/checkout action with the submodules option set to true to ensure submodules are cloned and updated as part of the workflow.

Example: GitHub Workflow (.github/workflows/ci.yml)

name: CI with Submodules

on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository with submodules
        uses: actions/checkout@v2
        with:
          submodules: true   # Initialize and update submodules
          fetch-depth: 0     # Ensure full history is fetched

      - name: Build Project
        run: |
          # Example build command
          make build

This ensures that the submodules are checked out and updated as part of your CI workflow.

Pipeline: Handling Submodules#

In GitLab CI, you can add the GIT_SUBMODULE_STRATEGY option to ensure submodules are fetched during the CI pipeline.

Example: GitLab Pipeline (.gitlab-ci.yml)

variables:
  GIT_SUBMODULE_STRATEGY: recursive  # Or 'normal'
  GIT_SUBMODULE_FORCE_HTTPS: "true"  # Rewrite url to use HTTPS
  GIT_SUBMODULE_DEPTH: 0             # Fetch full history

stages:
  - build

build_project:
  stage: build
  script:
    - make build

You only need to set the GIT_SUBMODULE_STRATEGY variable and the submodules will be fetched.