How to semantically describe a workflow #86

samwaseda · 2025-01-02T17:42:53Z

I talked a lot with ChatGPT over Christmas about how to semantically describe workflow steps, because in this issue I didn't address anything about nodes. I sort of came to the conclusion that what makes the most sense is to define triples for inputs and outputs for workflow nodes. Let's take the same example, where the workflow consists of two steps: structure creation for a given element, and the calculation of its energy:

def create_structure(element) - > Atoms
    ....
    return structure

def calculate_energy(structure) -> float:
    ....
    return energy

My suggestion is to define:

create_structure - hasInput - create_structure.input.element
create_structure - hasOutput - create_structure.output.structure
calculate_energy - hasInput - calculate_energy.input.structure
calculate_energy - hasOutput - calculate_energy.output.energy
create_structure.output.structure - equalTo - calculate_energy.input.structure

And I think there should be some marking of the fact that element is the global input and energy is the global output. On top of this, we can obviously also append all the ontological information that I've talked about in this issue.

This being said, all the input/output definition + equalTo looks extremely redundant. I guess I'm gonna try to make a prototype in the coming days, but I would appreciate it if you could leave a comment if you guys have an idea.

The text was updated successfully, but these errors were encountered:

liamhuber · 2025-01-03T17:21:33Z

Only very shallow feedback, but: I feel like

create_structure.output.structure - equalTo - calculate_energy.input.structure

is indeed redundant and something that is an instance-level feature that gets (and indeed should and needs to be gotten) at the level of the graph connection -- i.e. inferred by a combination if semantic typing and graph connections. What I see missing is rather the connection across the node body, something like

calculate_energy.output.energy - isPropertyOf - calculate_energy.input.structure

This is pretty straightforward, but the idea gets more complicated to me if we start considering something like, e.g., a vacancy formation energy, which is somehow a property of multiple structures.

I would also recommend that we push for examples requiring transitive behaviour, as this is going to be the hard part. This means examples of at least (and preferably at most, for now) three nodes.

liamhuber · 2025-01-03T17:23:14Z

I would note also that the existing implementation for transitive features is actually still missing something -- we can link output to upstream o-types with the transitive capability, but in the event of multiple outputs, I don't think we have the infrastructure currently to link a specific output to a specific upstream input. The next generation solution should offer this level of control.

samwaseda · 2025-01-03T20:39:40Z

calculate_energy.output.energy - isPropertyOf - calculate_energy.input.structure

I should have clearly stated my intention in my first post above (To be honest I was just doodling because I didn't really know how to proceed). In this post, we talked about how to connect data using triples, which the case you mentioned here belongs to. The reason why I posted this issue was because I started having the feeling that the knowledge graph should know the full workflow automatically, maybe in the form of

Input A - node B - Output/Input C - node D - Output/Input E - node F - Output G

At this point, there's only a workflow connection between inputs and outputs, like between A and C, and the user is free to define triples stating for example calculate_energy.output.energy - isPropertyOf - calculate_energy.input.structure, which would be then included into the knowledge graph.

Here's maybe a more important question: Why am I doing this? Well I'm still exploring a practical case, where the user has the possibility to look up something scientifically meaningful. A simple example is something like "What's the vacancy formation energy of a given element?". And a simple way of looking it up is to see whether a workflow has element as an input and vacancy formation energy an output, which could be done by looking up the triples. This is still far far far far from addressing transitive properties, or even a connection between the input element and the output vacancy formation energy, but it's still something that my previous prototype cannot handle ¹.

This being said, it's totally unclear to me why we are doing this via a semantic graph and not a workflow graph. ↩

samwaseda · 2025-01-09T21:59:18Z

In the meantime, there's a prototype that I implemented in this PR, that should already do some work. The example is the same as in this issue.

from rdflib import Namespace, Graph
from pyiron_ontology.parser import get_inputs_and_outputs, get_triples

EX = Namespace("http://example.org/")

from pyiron_workflow import Workflow
from semantikon.typing import u
from ase import Atoms, build

@Workflow.wrap.as_function_node
def create_structure(
    element: u(str, triple=(EX["IsElementOf"], "outputs.output_structure"), uri=EX["element"])
) -> u(Atoms, uri=EX["ComputationalSample"]):
    output_structure = build.bulk(element, cubic=True)
    return output_structure

@Workflow.wrap.as_function_node
def get_volume(
    input_structure: u(Atoms, uri=EX["ComputationalSample"])
) -> u(float, units="angstrom**3", triple=(EX["IsCalculatedPropertyOf"], "inputs.input_structure"), uri=EX["volume"]):
    volume = input_structure.get_volume()
    return volume

wf = Workflow("my_workflow")

wf.my_structure = create_structure(element="Al")
wf.my_volume = get_volume(input_structure=wf.my_structure)
wf.run()

And then the knowledge graph can be retrieved via:

graph = Graph()
for key, value in wf.children.items():
    data = get_inputs_and_outputs(value)
    graph += get_triples(data, EX)

This time I'm not gonna show the diagram anymore because it's a Persian bazaar now. The important point is that now I included workflow nodes in the knowledge graph. I also switched to @liamhuber's notation (i.e. inputs.structure instead of just structure). And in the knowledge graph, the variables are given by instance_name.inputs.variable_name (if it's an input, otherwise obviously outputs instead). This avoids a conflict, when the same workflow node is used multiple times in a workflow.

samwaseda added the enhancement New feature or request label Jan 2, 2025

samwaseda mentioned this issue Jan 7, 2025

Add parsers #90

Draft

samwaseda mentioned this issue Jan 14, 2025

Let's talk about reasoning #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to semantically describe a workflow #86

How to semantically describe a workflow #86

samwaseda commented Jan 2, 2025

liamhuber commented Jan 3, 2025

liamhuber commented Jan 3, 2025

samwaseda commented Jan 3, 2025

samwaseda commented Jan 9, 2025

How to semantically describe a workflow #86

How to semantically describe a workflow #86

Comments

samwaseda commented Jan 2, 2025

liamhuber commented Jan 3, 2025

liamhuber commented Jan 3, 2025

samwaseda commented Jan 3, 2025

Footnotes

samwaseda commented Jan 9, 2025