You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Write a container with an app that generates a PDF with content from the inputs. For example, something similar to the basic examples of the Python libraries, reportlab and matplotlib.
Upload the container into Kive, and launch a run.
When the run is finished, rerun it.
Expected behaviour: the outputs should match.
Actual behaviour: PDF outputs usually don't match.
Analysis
Most libraries write PDF files with a timestamp in the file content. That means that the exact same data inputs won't generate the exact same outputs, if the two runs happened at different times.
Good news, though. It looks like both matplotlib and reportlab support the SOURCE_DATE_EPOCH environment variable that is intended to help make outputs reproducible. If we took the timestamp of the container and passed it to each run in the SOURCE_DATE_EPOCH environment variable, that would probably avoid the problems with PDFs not matching after reruns.
Another option is to take the latest date from the container and all the input datasets. I think that would be easier to understand, but I suspect that would be unreliable. For example, if we rerun two nested runs where the output of one is the input of the other, and we have to recreate that output, then it would have a different timestamp from the first output.
If a pipeline wants to generate PDFs with the current date as the creation timestamp, it could unset the SOURCE_DATE_EPOCH environment variable before calling the library code.
Test these two libraries to make sure they support the environment variable.
Set the environment variable when launching Singularity.
Update the pipeline developer documentation to explain what the environment variable does, and how to disable it.
The text was updated successfully, but these errors were encountered:
Steps to reproduce:
Expected behaviour: the outputs should match.
Actual behaviour: PDF outputs usually don't match.
Analysis
Most libraries write PDF files with a timestamp in the file content. That means that the exact same data inputs won't generate the exact same outputs, if the two runs happened at different times.
Good news, though. It looks like both matplotlib and reportlab support the
SOURCE_DATE_EPOCH
environment variable that is intended to help make outputs reproducible. If we took the timestamp of the container and passed it to each run in theSOURCE_DATE_EPOCH
environment variable, that would probably avoid the problems with PDFs not matching after reruns.Another option is to take the latest date from the container and all the input datasets. I think that would be easier to understand, but I suspect that would be unreliable. For example, if we rerun two nested runs where the output of one is the input of the other, and we have to recreate that output, then it would have a different timestamp from the first output.
If a pipeline wants to generate PDFs with the current date as the creation timestamp, it could unset the
SOURCE_DATE_EPOCH
environment variable before calling the library code.The text was updated successfully, but these errors were encountered: