Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC for CID store annotations and workflow outputs structure #5885

Open
wants to merge 8 commits into
base: cid-store
Choose a base branch
from

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Mar 13, 2025

This PR is a PoC for adding the annotations to metadata entries in the CID and restructure the workflow outputs in the way defined in output DSL

It is currently getting annotations from tags

Tested with a variation of the e2e test with a small modification to include tags.

output {
  samples {
    path { sample ->
      ...
      }
    }
    tags (experimentId: "value", experimentDate: "date")
    index {
      ...
    }
  }
}

An example of is the generated WorkflowResults. publishedData is a list of all files published and outputs are how they are defined as records in the ouputsDsl

{
    "type": "WorkflowResults",
    "run": "cid://835f10672ae237225406de48672493a0",
    "outputs": {
        "samples": [
            {
                "id": "delta",
                "fastqc": "cid://835f10672ae237225406de48672493a0/fastqc/delta.fastqc.log",
                "quant": "cid://835f10672ae237225406de48672493a0/quant/delta"
            },
            {
                "id": "beta",
                "fastqc": "cid://835f10672ae237225406de48672493a0/fastqc/beta.fastqc.log",
                "quant": "cid://835f10672ae237225406de48672493a0/quant/beta"
            },
            {
                "id": "alpha",
                "fastqc": "cid://835f10672ae237225406de48672493a0/fastqc/alpha.fastqc.log",
                "quant": "cid://835f10672ae237225406de48672493a0/quant/alpha"
            }
        ]
    },
    "publishedData": [
        "cid://835f10672ae237225406de48672493a0/fastqc/delta.fastqc.log",
        "cid://835f10672ae237225406de48672493a0/quant/delta",
        "cid://835f10672ae237225406de48672493a0/fastqc/beta.fastqc.log",
        "cid://835f10672ae237225406de48672493a0/quant/beta",
        "cid://835f10672ae237225406de48672493a0/fastqc/alpha.fastqc.log",
        "cid://835f10672ae237225406de48672493a0/quant/alpha",
        "cid://835f10672ae237225406de48672493a0/samples.csv"
    ]
}

Workflow output files are annotated with the provided tags

$ nextflow cid show cid://835f10672ae237225406de48672493a0/fastqc/delta.fastqc.log
{
    "type": "WorkflowOutput",
    "path": "/home/jorgee/nextflow_tests/provenance-test/results/fastqc/delta.fastqc.log",
    "checksum": {
        "value": "f8406b93427367bd50bc2bdf34659aa3",
        "algorithm": "nextflow",
        "mode": "standard"
    },
    "source": "cid://a786558405845a775fc6218fa6aa7b03/delta.fastqc.log",
    "size": 6,
    "createdAt": 1741870059179,
    "modifiedAt": 1741870059179,
    "annotations": {
        "experimentId": "value",
        "experimentDate": "date"
    }
}

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Copy link

netlify bot commented Mar 13, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 853a9c7
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67d2dbe06d34d10008e6b5cb
😎 Deploy Preview https://deploy-preview-5885--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@jorgee jorgee changed the base branch from master to cid-store March 13, 2025 13:21
@bentsherman bentsherman self-requested a review March 13, 2025 13:27
Copy link
Member

@bentsherman bentsherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far. Left a few minor suggestions. I will try to play with it when I have time

jorgee and others added 2 commits March 14, 2025 12:45
Co-authored-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Mar 14, 2025

I have updated the code with annotations field in the output. It can be Map or a closure. The closure is evaluated per sample. So, we could support the case of adding sample information as annotation such as the sampleId

output {
  samples {
    path { sample ->... }

    annotations { sample ->
        return [experimentId: params.experimentId , sampleId : sample.id]
    }
    index {
      path 'samples.csv'
      header true
      sep ','
    }
  }
}

Comment on lines 118 to 119
// emit workflow publish event
session.notifyWorkflowPublish(name, normalized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I've been thinking about saving the index metadata to the workflow result. Currently it will always be a list because of how this publish event works. If the source channel is a value channel, it will be saved as a JSON array with a single element rather than just a JSON object

So I would like to refactor this event to instead happen in the onComplete of the publish op, and include either (1) the one record as an object if the source was a value channel or (2) the entire list of index records if the source was a queue channel.

If you think you can implement that, feel free, otherwise I can do it myself

Copy link
Contributor Author

@jorgee jorgee Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do it.
The other doubt I had in this part is about normalized paths. Originally, it was calling notifyWorkflowPublish per value without normalizing the path. It was just normalized for the index file. For testing purposes I moved the notification to this place. But I think it should notify values with normalized paths in all the cases, even if the index file is not set. What do you think about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't sure originally whether to normalize the paths here or not, since a trace observer could use the onFilePublish events to recover the source-target mapping.

But as we integrate the index file with the cid store, I think it makes sense to normalize them in the index file.

And yes, I think we should do the onWorkflowPublish regardless of whether the index file is specified. The only complication is the index "mapper", but I plan to remove it anyway in the third preview, so you can remove it here if you want

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the workflow outputs branch based on our discussion. So you can just copy the changes from that branch if you want

pditommaso and others added 3 commits March 15, 2025 11:16
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Comment on lines +174 to +180
void annotations(Map value) {
setOption('annotations', value)
}

void annotations(Closure value) {
setOption('annotations', value)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: decide whether to use tags or phase it out in favor of annotations

@jorgee
Copy link
Contributor Author

jorgee commented Mar 18, 2025

Last changes:

  • Implemented support for retreiving output descriptions Channel.fromPath("cid://<workflow_run_cid>/outputs")
  • Workflow output cid files are added within <workflow_run_cid>/outputs to avoid the ambiguity of having an outputs with publish dir outputs
  • Outputs file is managed as an extension of CidPath that creates an InputStream or ByteChannel from the metadata description.
  • When passed to a task it is treated as a foreign file because it keeps the cid scheme and it will be serialized to a real file as a staging file
  • Added a new field in WorkflowResults creationTime to avoid several serializalizations when staging it from different tasks

jorgee added 2 commits March 19, 2025 10:00
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@pditommaso
Copy link
Member

@jorgee also some conflicts to solve here 🙏

@jorgee jorgee mentioned this pull request Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants