Linking Data to a Project#
Linking data is a mandatory condition for a projects reproducibility.
data/ DirectorySingle entry point with strict structural separation:
data/raw/: Immutable, original data. (Read-only)data/interim/: Cleaned or transformed intermediate data.data/final/: Synthesized output data and canonical datasets ready for modeling.
data/README.mdThe data documentation location:
A projects
README.mdanddata/README.mdmust provide all necessary information regarding used data.Data dictionaries (definitions of columns/variables).
Provenance (exact URLs, SQL queries).
Summary of transformations (
raw/->interim/->final/).
Infrastructure & Performance Considerations
Impact of data storage on performance
I/O Latency: Network drives exhibit high latency when accessing numerous small files.
Data Locality: Data is typically transferred to local scratch space (
/tmp,$SCRATCH) on compute nodes prior to execution.Configurability: Codebases must handle variable paths based on the execution environment.
Simply copying all data to data/ wont cut it!