Linking Data to a Project

Linking Data to a Project#

Linking data is a mandatory condition for a projects reproducibility.

data/ Directory

Single entry point with strict structural separation:

  • data/raw/: Immutable, original data. (Read-only)

  • data/interim/: Cleaned or transformed intermediate data.

  • data/final/: Synthesized output data and canonical datasets ready for modeling.

The data/README.md

The data documentation location:

  • A projects README.md and data/README.md must provide all necessary information regarding used data.

  • Data dictionaries (definitions of columns/variables).

  • Provenance (exact URLs, SQL queries).

  • Summary of transformations (raw/ -> interim/ -> final/).

Infrastructure & Performance Considerations

Impact of data storage on performance

  • I/O Latency: Network drives exhibit high latency when accessing numerous small files.

  • Data Locality: Data is typically transferred to local scratch space (/tmp, $SCRATCH) on compute nodes prior to execution.

  • Configurability: Codebases must handle variable paths based on the execution environment.

Simply copying all data to data/ wont cut it!