Submodules #
What are Submodules?#
A submodule is essentially a repository embedded within another repository. It allows you to include and manage external repositories within your main project. This is particularly useful for integrating third-party libraries or dependencies that are maintained separately.
Note
Once a submodule is initialized and updated, its content appears as a regular folder inside the parent repository.
Why are Submodules Useful?#
When working on a project, you may need to incorporate another project, such as a library developed by someone else or a tool you’re building for use in multiple projects. A common challenge in these situations is maintaining separation between the two projects separate while being able to use one with the other.
For example, imagine you’re conducting research and want to use a data analysis tool you previously developed for another project. You have two options: you could copy the code from the old project into your new one, or you could include it from a shared source. The problem with copying the code is that if you make any changes, it can be difficult to merge those changes back into the original tool later. Conversely, including it from a shared source may limit your ability to customize it, and ensuring that all collaborators have access to it can be challenging.
addresses this issue with something called submodules. Submodules allow you to keep a repository as a subfolder within another repository. This setup enables you to selectively include specific versions of external repositories in your main project. This is particularly useful for incorporating your own tools or third-party resources that are maintained separately, while allowing you to keep your changes separate.
Features of Submodules#
Separation: Submodules remain independent repositories, so their versioning history is separate from the parent repository. This allows for better modularity and organization of code, as each submodule can evolve independently.
Pinning: Submodules are usually pinned to specific commits, ensuring reproducibility by locking them to a particular state. This means that when you clone the parent repository, you get the exact version of the submodule that was used at the time of the last commit.
Updates: Submodules can be updated independently or synchronized with the parent repository. You can choose to pull the latest changes from the submodule’s repository without affecting the parent repository, or you can update the submodule reference in the parent repository to point to a new commit.
Benefits for Reproducibility#
Pinning Submodules: By pinning submodules to specific commits, you ensure the same version of an external library, dataset (remember LFS!), or tool is always used, which is crucial for reproducibility in complex projects. This helps avoid issues that arise from changes in dependencies.
Independent Versioning: Each submodule has its own versioning history, allowing it to be maintained separately from the parent project. This means that updates or changes in a submodule do not directly impact the parent repository unless explicitly updated.
Flexibility: Submodules can be easily updated or switched to different versions without affecting the rest of the project. This flexibility allows developers to experiment with new features or fixes in a submodule while keeping the main project stable.
Use Cases#
Third-party Libraries#
If you are developing a project that relies on third-party libraries, you can use submodules to include these libraries in your project without merging them directly into your project. This makes it easier to manage updates and changes. For example, you might use a submodule to include the code of a research paper you want to integrate into your analysis.
Separate Repositories#
For projects that consist of multiple repositories, submodules allow you to link these repositories together while maintaining separate version control for each. This allows for modular project management while still working on them cohesively. For example, this course is organized into several repositories, each containing different sections of the course material.
using-git-in-accademia
├── ci-cd-workflows
├── git-and-science
├── git-and-its-remotes
├── working-with-git
Essential Commands#
Here’s a simple overview of the basic commands for working with submodules:
Command |
Description |
|---|---|
|
Add a new submodule to the project. |
|
Initialize & update, fetching the latest changes from remote repo. |
|
Update the submodules to the commit specified in the parent repo. |
|
Update the submodule to the latest commit on the tracked branch. |
|
Check the status of the submodules (in [path]). |
|
Remove a submodule from the parent repository. |
Working with Submodules#
Remember that a submodule usually does not track a branch, so before you start working in a submodule, checkout the branch you want to work on!
Once the submodule is initialized, you can work inside the submodule folder as if it were an ordinary repository.
After making any changes in a submodule, simply add the path to the submodule to a commit in the parent repository to update the commit that the parent repository should track.
Gotchas for Submodules#
Submodules Do Not Update Automatically ⚠️: When you clone a repository that contains Submodules, the latter are not automatically updated to the latest commit. You need to run
git submodule updateor use the--recurse-submodulesoption when cloning to ensure they are initialized and updated.Repository Resides in the
.gitFolder of the Parent Repo 🔒: The metadata for Submodules is stored in the parent repository’s.gitfolder.
This means that the actual repository for the Submodule is not in its own separate .git folder, which can lead to confusion. Be cautious that providing access to the parent repository’s .git folder grants access to the history of all its Submodules!
Submodule Commits Are Detached : Submodules are designed to be pinned to a specific commit and do not track a branch.
When you check out a Submodule, it is usually in a “detached HEAD” state, generally meaning it is not on a branch. This can be confusing if you try to make changes directly in the Submodule without creating a new branch first.
Tip
You can set up a Submodule to track a branch with the -b option:
git submodule add -b <bname> https://gitlab.com/...
Alternatively, navigate into the directory of an existing Submodule (e.g., mySub) and run:
git checkout bname
git branch --set-upstream=origin/bname
cd ../ # you leave the submodule
git add mySub
git commit -m "Tracking branch bname in mySub"
Submodule URLs Can Change 🔗: If the URL of a Submodule repository changes, you must update the
.gitmodulesfile in the parent repository. Failing to do so can lead to broken links when trying to update or clone the Submodule.Cloning with Submodules Requires Extra Steps 🛠️: When cloning a repository with submodules, you need to use the
--recurse-submodulesoption or rungit submodule initandgit submodule updateafterward. Forgetting these steps can lead to missing Submodule content.Submodules Can Increase Complexity 🌀: Using submodules can add complexity to your project structure. If not managed properly, it can lead to confusion about which version of a Submodule is being used and how it relates to the parent repository.
Handling Submodules in CI/CD Pipelines#
When using submodules in your GitHub Workflows or GitLab Pipelines, you need to ensure that the submodules are properly initialized and updated on the runner.
GitHub Workflow Example#
In GitHub Actions , you can use the actions/checkout action with the submodules option set to true to ensure submodules are cloned and updated as part of the workflow.
Example: GitHub Workflow (.github/workflows/ci.yml)
name: CI with Submodules
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout repository with submodules
uses: actions/checkout@v2
with:
submodules: true # Initialize and update submodules
fetch-depth: 0 # Ensure full history is fetched
- name: Build Project
run: |
# Example build command
make build
This ensures that the submodules are checked out and updated as part of your CI workflow.
GitLab Pipeline Example#
In GitLab CI, you can add the GIT_SUBMODULE_STRATEGY option to ensure submodules are fetched during the CI pipeline.
Example: GitLab Pipeline (.gitlab-ci.yml)
variables:
GIT_SUBMODULE_STRATEGY: recursive # Or 'normal'
GIT_SUBMODULE_FORCE_HTTPS: "true" # Rewrite url to use HTTPS
GIT_SUBMODULE_DEPTH: 0 # Fetch full history
GIT_SUBMODULE_UPDATE_FLAGS: --remote # Checkout latest commit on
# specified branch
stages:
- build
build_project:
stage: build
script:
- make build
You only need to set the GIT_SUBMODULE_STRATEGY variable and the submodules will be fetched.
Exercise: The Course Content Structure#
Now that we know about Git submodules, let’s take a closer look at how the material of this course is actually structured and how the web-content is built.
Each of the 4 parts of this course resides in its own Repository:
In this main course Repository (t4d-gmbh/using-git-in-academia) all the content is aggregated, compiled into html, and published as a GitHub page.
However, the content of the four parts does not reside directly in the main course t4d-gmbh/using-git-in-academia Repository itself:
How are the four parts included into the main course repository (t4d-gmbh/using-git-in-academia)?
What are the advantages of such a setup?
What are potential drawbacks?
Have a look at the
source/contentfolder in the main course repository (t4d-gmbh/using-git-in-academia).The
source/index.mdfile declares what is included in thehtmlcontent.
How the 4 parts are included
The four parts are handled as submodules and added in the source/content/ directory.
This is specified in the .gitmodules file that has entries in the form:
[submodule "source/content/working-with-git"]
path = source/content/working-with-git
url = https://github.com/t4d-gmbh/working-with-git.git
branch = main
With this setup we can fetch the content of the submodules (e.g. with git submodule update --remote) “unpacking” the content of the repositories into the specified paths.
The actual import of the content is then initiated in the source/index.md file with an import block like this (simplified):
\```{toctree}
:caption: Part 1: Working with Git
:maxdepth: 1
:numbered:
:hidden:
content/working-with-git/source/content/index
\```
Finally, the process is automated in the pages.yml workflow, that fetches the four submodules for us (see relevant lines), builds the pages and the slides (here) and deploys the generated html content to a static page (here).
Advantages of such a setup
With submodules, we can decouple the content of each part from each other and the main course (t4d-gmbh/using-git-in-academia). This allows for independent development in each part without affecting the others and the combined content.
In addition, each version of the combined content (i.e. each commit in t4d-gmbh/using-git-in-academia) specifies the exact version (i.e. the commit hash) of each of the four parts included.
Thus, whenever we “checkout” a specific commit in the using-git-in-academia repository, we will also get the specific versions of the four parts.
Furthermore, we can decide for each part individually which version we want to use. We simply set the submodule to the state we want and create a new tag in the main repository.
Finally, the parts can be viewed on their own, see e.g. t4d-gmbh.github.io/working-with-git/ or recombined differently.
Potential disadvantages
Using Git submodules generally adds a layer of complexity to a project, which makes it more difficult to handle, especially for collaborators providing smaller contributions.
Splitting up the content into four different repositories and providing the actual html content from a fifth repository, that simply aggregates the four parts obstructs a simple mapping from the individual html pages to the markdown files that define them.
An additional drawback of submodules can be the decoupling of versions:
A change in any of the submodules is not automatically picked up when running git submodule update in the main repository.
By default, a git submodule always stays at the specified commit hash.
This is a big advantage when it gets to consistency, however, in some cases this behaviour is not what we want, and we might accidentally display some outdated content if we forget to update the status of a submodule.
One approach to address this issue is by specifying a branch for each submodule (in the .gitmodules file) and to update them with git submodule update --remote.
The --remote option will ensure that the latest commit of the specified branch is used and not the currently registered one.