Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid changing timestamps in outputs #2065

Open
3 tasks
donkirkby opened this issue Sep 11, 2024 · 0 comments
Open
3 tasks

Avoid changing timestamps in outputs #2065

donkirkby opened this issue Sep 11, 2024 · 0 comments

Comments

@donkirkby
Copy link
Member

Steps to reproduce:

  1. Write a container with an app that generates a PDF with content from the inputs. For example, something similar to the basic examples of the Python libraries, reportlab and matplotlib.
  2. Upload the container into Kive, and launch a run.
  3. When the run is finished, rerun it.

Expected behaviour: the outputs should match.

Actual behaviour: PDF outputs usually don't match.

Analysis

Most libraries write PDF files with a timestamp in the file content. That means that the exact same data inputs won't generate the exact same outputs, if the two runs happened at different times.

Good news, though. It looks like both matplotlib and reportlab support the SOURCE_DATE_EPOCH environment variable that is intended to help make outputs reproducible. If we took the timestamp of the container and passed it to each run in the SOURCE_DATE_EPOCH environment variable, that would probably avoid the problems with PDFs not matching after reruns.

Another option is to take the latest date from the container and all the input datasets. I think that would be easier to understand, but I suspect that would be unreliable. For example, if we rerun two nested runs where the output of one is the input of the other, and we have to recreate that output, then it would have a different timestamp from the first output.

If a pipeline wants to generate PDFs with the current date as the creation timestamp, it could unset the SOURCE_DATE_EPOCH environment variable before calling the library code.

  • Test these two libraries to make sure they support the environment variable.
  • Set the environment variable when launching Singularity.
  • Update the pipeline developer documentation to explain what the environment variable does, and how to disable it.
@donkirkby donkirkby added this to the 0.17 - Collate data sets milestone Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant