Loughborough University
Leicestershire, UK
LE11 3TU
+44 (0)1509 222222
Loughborough University

IT Services : High Performance Computing

Provenance


Introduction

Provenance is the ability to track how an item was created. This can be done via various methods, including providing Metadata for a result indicating what its antecedents are. If those antecedents also have metadata then the provenance can be tracked back and back through various files and processes to original data.

For Reproducibility

Provenance is useful should it be found that data, programs or processes are problematic as it allows a piece of data to be recreated by rerunning analyses with corrected data, programs or processes.

As Part of Research Data Management

Provenance is also useful when using Research Data Management as it allows you to determine what contributed to the results you have published in a paper, and ensure that all data, programs, source code, and other details (subject to the metadata for these being sufficient) are captured, and where necessary, stored in the RDM system.

For example, your program might take an external, public data set D, produce intermediate results R1 using program P1 and arguments A1 and then final results R2 using program P2 and arguments A2 . Tracking provenance will allow you to store the methods by which D was turned into R2 (including arguments and the source code of P1 and P2, perhaps). Since D is a public data set then as long as versions and hashes are retained then D need not be stored in the RDM, and if R1 can be reliably produced from D using P1 and A1 then it need not be stored either, just its metadata, unless R1 takes a long time to create from D.