Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect when files are being staged #5905

Open
rcannood opened this issue Mar 20, 2025 · 7 comments · May be fixed by #5907
Open

Detect when files are being staged #5905

rcannood opened this issue Mar 20, 2025 · 7 comments · May be fixed by #5907

Comments

@rcannood
Copy link
Contributor

rcannood commented Mar 20, 2025

New feature

It'd be great to be able to get notified (using the TraceObserver) when files are being staged.

Use case

I'm currently building a provenance plugin nf-lamin for logging workflow executions in LaminDB.

It's easy to detect when files are being published (i.e. the outputs of the workflow) using the TraceObserver. However it's not so easy to detect when files are staged (i.e. the inputs of the workflow).

I notice that the nf-prov plugin uses nextflow.prov.util.ProvHelper.getWorkflowInputs() to detect workflow inputs. However, this only allows me to observer the paths after the files have already been staged. That is, I only get to see work/stage-2478a5a8-2313-49c9-8cfe-92ef6483859b/91/ec7b0d3c79c84f8f7e16e07d823a7e/samplesheet-2-0.csv while I actually need https://github.com/nf-core/test-datasets/raw/scrnaseq/samplesheet-2-0.csv.

Suggested implementation

We could modify the TraceObserver. Note that I would add onFileStage and onFileStaging so we could know whether a file is yet to be staged, or whether it has been staged.

trait TraceObserver {
    void onFilePublish(Path destination, Path source)
    void onFilePublishing(Path destination, Path source)
    void onFileStage(Path destination, Path source)
    void onFileStaging(Path destination, Path source)
}

Alternatively, an enum could be added to reduce the number of different functions.

enum FileTransferStatus {
    STARTED,
    FINISHED
}
trait TraceObserver {
    void onFilePublish(Path destination, Path source, FileTransferStatus status) {
        // preserve backward compatibility
        if (status == FileTransferStatus.FINISHED) {
            onFilePublish(destination, source)
        }
    }
    void onFileStage(Path destination, Path source, FileTransferStatus status) {}
}

Happy to send a PR if it helps! :)

@bentsherman
Copy link
Member

This would also help me solve a problem in nf-prov described here

I would propose just adding this method:

    void onFileStage(Path destination, Path source)

@pditommaso
Copy link
Member

you can determine if it's stage file checking the file path scheme != work dir scheme

@bentsherman
Copy link
Member

I think the problem is that when a foreign file is added to the task inputs, the FileHolder does not include the original path:

def path = normalizeToPath(item)
def target = executor.isForeignFile(path) ? batch.addToForeign(path) : path
def holder = new FileHolder(target)
files << holder

I think if we changed it to new FileHolder(path, target), then you could recover the original path in the onProcessPending event

@rcannood
Copy link
Contributor Author

I think the problem is that when a foreign file is added to the task inputs, the FileHolder does not include the original path:

Precisely ^^

@pditommaso
Copy link
Member

Not the inputs files != staged files, therefore I'm not understanding why you want to capture only the latter for provenance purpose?

@rcannood
Copy link
Contributor Author

I see -- If I understand correctly, you think it would make more sense to notify whenever a file is used as input?

@bentsherman
Copy link
Member

For my part, the problem is that right now I can only track the staged file, when I really want to track the original remote file for provenance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants