Versioning ⚡️Reproducibility

Content

Versioning ⚡️Reproducibility#

is an excellent tool for version control, allowing you to track changes in code and facilitate collaborative development. However, achieving scientific reproducibility requires more than just using .

Reproducibility encompasses not only code management but also the versioning od data, tracking the computational environment, ensuring consistent execution, and thoroughly documenting the workflow. This comprehensive approach enables others to reliably reproduce your results.

Reproducibility#

Before we begin, it is important to clarify the definition of “reproducibility” as opposed to “replicability”.

In a scientific context, these terms are are often used interchangeably, but they can also refer to different concepts.

In this course, we adopt the widely accepted interpretation that reproducibility is achieved when an analysis can be repeated using the same data, yielding the same results.

Replicability, on the other hand, refers to the situation where an analysis conducted with different data on the same study subject leads to the same conclusions.

Note

For an in-depth exploration of this subject, we recommend reading Understanding Reproducibility and Replicability (In: Reproducibility and Replicability in Science, National Academies of Sciences, 2019)

Based on these definitions, our focus here is on a narrow interpretation of reproducibility. More specifically we consider reproducibility when the same implementation of a method applied to the exact same data produces the same result, as it is commonly adopted in computer science. We will demonstrate how and related remote services can be utilized to enhance the reproducibility of computational studies in this sense.

Attention

Some products require strict certification processes, such as the ASME VVUQ standard. For example, medical devices must comply with the ASME V&V 40 standard where reproducibility is part of the verification and validation process.

What’s Missing?#

If you implement a computational analysis using an existing dataset in Python (or another programming language) and track your project with , are there aspects of your analysis that remain uncontrolled?

Where and its Remote Services Can Help#

Documentation

Comprehensive documentation of the entire project, including the rationale behind decisions, methodologies used, and any challenges encountered, to facilitate understanding and reproducibility.

How?

Use the README.md file as central location for your documentation of the content.
Make use of Issues and Merge/Pull Requests to declare what you are doing (Issue) and how you are doing it (Merge/Pull Request).

Data Availability

Data needs to be accessible and its usage and preprocessing properly documented.

How?

Version Control:
- Use Git to track changes in your data.
- Manage large datasets efficiently with Git Large File Storage.
Publish Data if possible:
- Use platforms like Zenodo for data sharing.
- Share the Digital Object Identifier (DOI) and links to increase visibility.

Workflow Documentation

Clearly document the steps taken in the analysis to allow others to understand and replicate the process.

How?

Document the execution workflow in an automation script. E.g., with a simple run_analysis.sh, a workflow management tool like snakemake or nextflow, or GitHub Actions .
Clarify how the version of the analysis scripts, as well as, the dataset are linked to the execution of the analysis.
Specify how the execution environment is build up.

Configuration Settings

Involves documenting parameters and settings that guide the analysis, which is essential for reproducing the same results.

Note

This includes randomness control, i.e., the use and specification of seeds whenever possible.

How?

Declare all configuration parameter separately from your analysis scripts!
Use simple and readable formats (e.g. .YAML or .json).
Track the configuration files in the same repository as your analysis scripts.

Hint

If you use multiple machines, maintain separate configuration files for each. This way, you avoid manually updating settings in your scripts when switching machines.

Dependencies

Refers to the libraries and frameworks used in the analysis. Specifying exact versions helps control for changes that could affect results.

How?

Include dependency declarations in your repository. E.g., requirements.txt or environment.yml.
Pin dependencies rather than minimal requirements. E.g., numpy==1.19.2 instead of numpy>=1.19.2.

Transitive Dependencies

Address the indirect dependencies required by your primary libraries. Managing these ensures that all necessary components are accounted for.

How?

Utilize isolated, temporary environments to execute your analysis. E.g., virtualenv, renv or conda.
Prefer declarative systems such as NixOS, or at the very least, use Docker .

Execution Environment

Covers the operating system, hardware, and any relevant settings that could influence the analysis, ensuring that the environment is replicable.

How?

Report hardware and hardware configuration. E.g., with lshw or inxi.

How Can Enhance Reproducibility#

By using , you can effectively track and version the scripts associated with your computational analysis. You can easily add a simple text file (such as renv.lock or pyproject.toml) that contains pinned direct dependencies to your repository, allowing them to be tracked alongside the rest of your code. This establishes a solid foundation for managing dependencies directly with .

Utilizing a standard README.md file enables you to document the installation process for declared dependencies, instructions for running the analysis script, and provide insights into how the analysis workflow is structured and should be executed.

Moreover, incorporating a parameterization file (e.g., a .YAML or .json file) into the repository allows you to outline the parameters utilized in the analysis. Ideally, you should modify the analysis script to automatically load all necessary parameters from this file. This approach facilitates documentation and tracking the configuration settings employed in your analysis, including random number generator seeds, hyperparameters, and more.

Can Do More!#

While tracking large binary files is not ’s strong suit, there is an extension called LFS that efficiently makes up for this limitation.

With LFS, you can efficiently track and version larger dataset, thereby contributing to the availability of analysed data by publishing your findings alongside the exact version of the dataset used in your analysis.

… and More!#

To effectively track both your analysis scripts and the analyzed data (using LFS), it’s essential to establish a connection between these two repositories. While it’s possible to combine scripts and data into a single repository, this approach may not be ideal, as both the dataset and the analysis scripts can evolve independently. This is especially true for datasets, which may be utilized in various studies.

Fortunately, provides git submodule, a feature specifically designed for linking multiple repositories.

By using submodules, you can seamlessly connect different repositories, allowing you to specify the exact version of the data used in each version of your analysis.

How Remote Services Can Enhance Reproducibility#

If you’ve explored git-and-its-remotes and ci-cd-workflows, you may insights into how remote services like GitHub and GitLab can improve the reproducibility of scientific analyses.

One of the most significant contributions of these platforms is improved accessibility. Researchers can easily share their work, making it more available to others in the scientific community.

Collaboration tools, such as Issues and Merge/Pull Requests, play a significant role in enhancing documentation for an analyses. These tools are particularly effective for formalizing features and track their implementation.

Automation is another important aspect of remote services, which allows for automated processes to be triggered by various events within a Repository or Project. Continuous Deployment (CD) is especially relevant to the reproducibility of scientific analyses:

Automation scripts can do more than just deploy a website; they can specify which scripts to run and the corresponding versions of the data used. Since automation scripts define the conditions for their execution and are part of the repository, they can comprehensively document and declaratively specify an analysis, detailing everything from the data utilized to the analysis scripts, their dependencies, and the specific environment in which the scripts were executed.

Bringing It All Together: Enhancing Reproducibility#

To go from basic version control to full reproducibility, you need:

Documentation: Include thorough documentation into your repository using a README.md file or a dedicated docs/ directory. This ensures that users can easily understand your project.
Data Availability: Publish your data! Use LFS for effective versioning of large datasets.
Workflow Documentation: Leverage submodules and automation scripts to comprehensively document the full analysis workflow, providing clarity on how to execute your project.
Dependencies: Clearly specify direct dependencies in your project. This helps users install the necessary libraries or tools to run your analysis.
Transitive Dependencies: Define isolated execution environments to manage transitive dependencies effectively, e.g., to ensure that all required packages are available without conflicts.
Environment tracking: Use isolation tools like Docker or ✨NixOS✨ to track the execution environment, guaranteeing consistency across different systems.
Configuration Settings: Declare and load configuration settings to manage parameters used in your analysis, making it easier for others to reproduce your work.