Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to semantically describe a workflow #86

Open
samwaseda opened this issue Jan 2, 2025 · 4 comments
Open

How to semantically describe a workflow #86

samwaseda opened this issue Jan 2, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@samwaseda
Copy link
Member

I talked a lot with ChatGPT over Christmas about how to semantically describe workflow steps, because in this issue I didn't address anything about nodes. I sort of came to the conclusion that what makes the most sense is to define triples for inputs and outputs for workflow nodes. Let's take the same example, where the workflow consists of two steps: structure creation for a given element, and the calculation of its energy:

def create_structure(element) - > Atoms
    ....
    return structure

def calculate_energy(structure) -> float:
    ....
    return energy

My suggestion is to define:

  • create_structure - hasInput - create_structure.input.element
  • create_structure - hasOutput - create_structure.output.structure
  • calculate_energy - hasInput - calculate_energy.input.structure
  • calculate_energy - hasOutput - calculate_energy.output.energy
  • create_structure.output.structure - equalTo - calculate_energy.input.structure

And I think there should be some marking of the fact that element is the global input and energy is the global output. On top of this, we can obviously also append all the ontological information that I've talked about in this issue.

This being said, all the input/output definition + equalTo looks extremely redundant. I guess I'm gonna try to make a prototype in the coming days, but I would appreciate it if you could leave a comment if you guys have an idea.

@samwaseda samwaseda added the enhancement New feature or request label Jan 2, 2025
@liamhuber
Copy link
Member

Only very shallow feedback, but: I feel like

create_structure.output.structure - equalTo - calculate_energy.input.structure

is indeed redundant and something that is an instance-level feature that gets (and indeed should and needs to be gotten) at the level of the graph connection -- i.e. inferred by a combination if semantic typing and graph connections. What I see missing is rather the connection across the node body, something like

calculate_energy.output.energy - isPropertyOf - calculate_energy.input.structure

This is pretty straightforward, but the idea gets more complicated to me if we start considering something like, e.g., a vacancy formation energy, which is somehow a property of multiple structures.

I would also recommend that we push for examples requiring transitive behaviour, as this is going to be the hard part. This means examples of at least (and preferably at most, for now) three nodes.

@liamhuber
Copy link
Member

I would note also that the existing implementation for transitive features is actually still missing something -- we can link output to upstream o-types with the transitive capability, but in the event of multiple outputs, I don't think we have the infrastructure currently to link a specific output to a specific upstream input. The next generation solution should offer this level of control.

@samwaseda
Copy link
Member Author

calculate_energy.output.energy - isPropertyOf - calculate_energy.input.structure

I should have clearly stated my intention in my first post above (To be honest I was just doodling because I didn't really know how to proceed). In this post, we talked about how to connect data using triples, which the case you mentioned here belongs to. The reason why I posted this issue was because I started having the feeling that the knowledge graph should know the full workflow automatically, maybe in the form of

Input A - node B - Output/Input C - node D - Output/Input E - node F - Output G

At this point, there's only a workflow connection between inputs and outputs, like between A and C, and the user is free to define triples stating for example calculate_energy.output.energy - isPropertyOf - calculate_energy.input.structure, which would be then included into the knowledge graph.

Here's maybe a more important question: Why am I doing this? Well I'm still exploring a practical case, where the user has the possibility to look up something scientifically meaningful. A simple example is something like "What's the vacancy formation energy of a given element?". And a simple way of looking it up is to see whether a workflow has element as an input and vacancy formation energy an output, which could be done by looking up the triples. This is still far far far far from addressing transitive properties, or even a connection between the input element and the output vacancy formation energy, but it's still something that my previous prototype cannot handle 1.

Footnotes

  1. This being said, it's totally unclear to me why we are doing this via a semantic graph and not a workflow graph.

@samwaseda samwaseda mentioned this issue Jan 7, 2025
@samwaseda
Copy link
Member Author

In the meantime, there's a prototype that I implemented in this PR, that should already do some work. The example is the same as in this issue.

from rdflib import Namespace, Graph
from pyiron_ontology.parser import get_inputs_and_outputs, get_triples

EX = Namespace("http://example.org/")

from pyiron_workflow import Workflow
from semantikon.typing import u
from ase import Atoms, build

@Workflow.wrap.as_function_node
def create_structure(
    element: u(str, triple=(EX["IsElementOf"], "outputs.output_structure"), uri=EX["element"])
) -> u(Atoms, uri=EX["ComputationalSample"]):
    output_structure = build.bulk(element, cubic=True)
    return output_structure

@Workflow.wrap.as_function_node
def get_volume(
    input_structure: u(Atoms, uri=EX["ComputationalSample"])
) -> u(float, units="angstrom**3", triple=(EX["IsCalculatedPropertyOf"], "inputs.input_structure"), uri=EX["volume"]):
    volume = input_structure.get_volume()
    return volume

wf = Workflow("my_workflow")

wf.my_structure = create_structure(element="Al")
wf.my_volume = get_volume(input_structure=wf.my_structure)
wf.run()

And then the knowledge graph can be retrieved via:

graph = Graph()
for key, value in wf.children.items():
    data = get_inputs_and_outputs(value)
    graph += get_triples(data, EX)

This time I'm not gonna show the diagram anymore because it's a Persian bazaar now. The important point is that now I included workflow nodes in the knowledge graph. I also switched to @liamhuber's notation (i.e. inputs.structure instead of just structure). And in the knowledge graph, the variables are given by instance_name.inputs.variable_name (if it's an input, otherwise obviously outputs instead). This avoids a conflict, when the same workflow node is used multiple times in a workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants