Specs for the ontology group #85

samwaseda · 2024-12-20T10:07:43Z

So dear @pyiron/ontology team, I wrote down in a very explicit way the parts that we can probably agree on quickly, in order for the implementation to be done systematically. I'm obviously painfully aware of the fact that I haven't included any of the tricky points discussed this Monday and in the following conversation on the discussion page, but I think it's good to have a solid ground to build a house on. I think the list below is fairly rudimentary but I still don't think it's something to sneeze at, because just for having this information we can already extract quite an amount of ontological data.

Anyway I would be very happy to have your comments. Especially it would be nice if we could also think about how to include the points we discussed this week.

I already started writing a prototype here, but it doesn't strictly follow the points I wrote here, so I will rewrite it and post the revamped version here below.

Basics

(Technical) The user should provide a default name space EX = Namespace(my_namespace)
"An argument with a URI" <-> "An ontological object"
The label should be defined by the user, otherwise the label extracted by pyiron_workflow is used
The label defines the node (cf. next point)
The following information is automatically extracted and added to the knowledge graph
- label: (URIRef(label), RDFS.label, Literal(label))
- uri: (URIRef(label), RDF.type, uri)
- data type?
- value (if available) (URIRef(label), EX.hasValue, Literal(value))
- units (if available) (URIRef(label), EX.hasUnit, EX[units]) (probably a different namespace for units)

All the items above (URI, label, data type, units) can be specified either by a data class or by semantikon arguments, i.e. the following two cases are equivalent:

Case I

from semantikon.converter import semantikon_class
from ase import Atoms

@semantikon_class
@dataclass
class ComputationalSample:
    class Structure(Atoms):
        uri: 1234

    class Output:
        class Energy:
            uri: 5678
            units: "eV"
            dtype: float

def calculate_energy(structure: ComputationalSample.Structure) -> ComputationalSample.Output.Energy:
    ...
    return energy

Case II

from semantikon.typing import u
from ase import Atoms

def calculate_energy(structure: u(Atoms, uri=1234)) -> u(float, uri=5678, units="eV"):
    ....
    return energy

Triples

Only triples define the relationships between nodes (EX["IsElementOf"], "structure")
- For the object of the triple, the label must be given as a str
Triples can also define properties, e.g. (EX["HasDefect"], EX["vacancy"])
- For the object of the triple, the property must be given as a URI

Parsing

The parsing of the information above is done without an additional parser, i.e. pyiron provides a tool (probably a function) to automatically extract the ontological information mentioned above.
The extracted information is given in the form of a knowledge graph

The text was updated successfully, but these errors were encountered:

samwaseda · 2024-12-20T13:39:09Z

This time (in contrast to my last attempts), I strictly followed the text above, and wrote a prototype. It's again workflow to calculate the volume of a chemical element, where it involves two steps: element -> structure -> volume.

Parser

from rdflib import Graph, Literal, RDF, URIRef, RDFS, Namespace
from semantikon.converter import parse_input_args, parse_output_args

EX = Namespace("http://example.org/")

def get_inputs_and_outputs(node):
    """
    Read input and output arguments with their type hints and return a dictionary containing all input output information

    Args:
        node (pyiron_workflow.nodes.Node): node to be parsed

    Returns:
        (dict): dictionary containing input output args, type hints, values and variable names
    """
    inputs = parse_input_args(node.node_function)
    outputs = parse_output_args(node.node_function)
    if isinstance(outputs, dict):
        outputs = (outputs, )
    outputs = {key: out for key, out in zip(node.outputs.labels, outputs)}
    for key, value in node.inputs.to_value_dict().items():
        inputs[key]["value"] = value
        inputs[key]["var_name"] = key
    for key, value in node.outputs.to_value_dict().items():
        outputs[key]["value"] = value
        outputs[key]["var_name"] = key
    return {"input": inputs, "output": outputs}

def get_node_dict(io_dict):
    """
    Translate the dictionary returned by get_inputs_and_outputs into
    the one that contains keys with their labels (or variable names
    whenever labels are not available) and the values with the dict
    content
    """
    results = {}
    for io_ in ["input", "output"]:
        for key, value in io_dict[io_].items():
            if "uri" not in value:
                continue
            if value["label"] is not None:
                results[value["label"]] = value
            else:
                results[key] = value
    return results

def node_to_knowledge_graph(node, graph=None, EX=EX):
    """Translate a node into a knowledge graph"""
    if graph == None:
        graph = Graph()
    d = get_inputs_and_outputs(node)
    node_dict = get_node_dict(d)
    for key, d in node_dict.items():
        label = URIRef(key)
        label_def_triple = (label, RDFS.label, Literal(key))
        if len(list(graph.triples(label_def_triple))) == 0:
            graph.add(label_def_triple)
            graph.add((label, RDF.type, d["uri"]))
            graph.add((label, EX.HasValue, Literal(d["value"])))
            if d["units"] is not None:
                graph.add((label, EX.HasUnits, EX[d["units"]]))
        if d["triple"] is not None:
            graph.add((label, d["triple"][0], URIRef(d["triple"][1])))
    return graph

def workflow_to_knowledge_graph(wf, graph=None, EX=EX):
    if graph is None:
        graph = Graph()
    for node in wf.children.values():
        graph = node_to_knowledge_graph(node=node, graph=graph, EX=EX)
    return graph

User definition

from pyiron_workflow import Workflow
from semantikon.typing import u
from ase import Atoms, build

@Workflow.wrap.as_function_node
def create_structure(
    element: u(str, triple=(EX["IsElementOf"], "structure"), uri=EX["element"])
) -> u(Atoms, uri=EX["ComputationalSample"]):
    structure = build.bulk(element, cubic=True)
    return structure

@Workflow.wrap.as_function_node
def get_volume(
    structure: u(Atoms, uri=EX["ComputationalSample"])
) -> u(float, units="angstrom**3", triple=(EX["IsCalculatedPropertyOf"], "structure"), uri=EX["volume"]):
    volume = structure.get_volume()
    return volume

wf = Workflow("my_workflow")

wf.structure = create_structure(element="Al")
wf.volume = get_volume(structure=wf.structure)
wf.run()

Parsing

graph = workflow_to_knowledge_graph(wf)
print(list(graph.triples(3 * (None, ))))

Output:

[(rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef('http://example.org/element')),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://example.org/HasUnits'),
  rdflib.term.URIRef('http://example.org/angstrom**3')),
 (rdflib.term.URIRef('structure'),
  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'),
  rdflib.term.Literal('structure')),
 (rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'),
  rdflib.term.Literal('element')),
 (rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://example.org/IsElementOf'),
  rdflib.term.URIRef('structure')),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://example.org/HasValue'),
  rdflib.term.Literal('66.43012500000002', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double'))),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://example.org/IsCalculatedPropertyOf'),
  rdflib.term.URIRef('structure')),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'),
  rdflib.term.Literal('volume')),
 (rdflib.term.URIRef('structure'),
  rdflib.term.URIRef('http://example.org/HasValue'),
  rdflib.term.Literal("Atoms(symbols='Al4', pbc=True, cell=[4.05, 4.05, 4.05])")),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef('http://example.org/volume')),
 (rdflib.term.URIRef('structure'),
  rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef('http://example.org/ComputationalSample')),
 (rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://example.org/HasValue'),
  rdflib.term.Literal('Al'))]

liamhuber · 2024-12-20T15:54:14Z

Nice work, @samwaseda. At a first read it looks good to me. The only clear problem is see is that the io parsing needs to more particularly specify whether a reference to another object is referencing an input or output object. If I'm reading this correctly, labels are scraped from the function, all the labels are lumped together, and then we look for our reference. Since there's nothing stopping us from having the same label as both an input and output, this is thus not fully defined.

I reserve the right to find other complaints later 😝 but so far this is the only thing that I see as definitively problematic. This is a nice strong foundation.

samwaseda · 2024-12-20T16:24:02Z

The only clear problem is see is that the io parsing needs to more particularly specify whether a reference to another object is referencing an input or output object. If I'm reading this correctly, labels are scraped from the function, all the labels are lumped together, and then we look for our reference. Since there's nothing stopping us from having the same label as both an input and output, this is thus not fully defined.

Ah it's true that I removed label="structure" etc. in the example above just to test the code, but yes, otherwise people should write:

@Workflow.wrap.as_function_node
def create_structure(
    element: u(str, triple=(EX["IsElementOf"], "structure"), uri=EX["element"], label="element")
) -> u(Atoms, uri=EX["ComputationalSample"], label="structure"):
    structure = build.bulk(element, cubic=True)
    return structure

And I'm kind of hoping that the introduction of label enable the user to specify whether it's input or output by choosing distinct labels, i.e. they could say something like

@Workflow.wrap.as_function_node
def some_transformation(
    u(Atoms, uri=EX["ComputationalSample"], label="input.structure")
) -> u(Atoms, uri=EX["ComputationalSample"], label="output.structure"):
    ...
    return structure

liamhuber · 2024-12-20T16:57:42Z

I think that we can probably work it so the label itself is scraped from the signature (/the io_preview() later once this is powered up to capture the entire semantikon annotation), rather it's only references to these labels that I believe need to be scoped like in your second example with "inputs.structure"

samwaseda added the enhancement New feature or request label Dec 20, 2024

samwaseda mentioned this issue Dec 20, 2024

How to connect inputs and outputs #84

Closed

This was referenced Jan 2, 2025

How to semantically describe a workflow #86

Open

Add parsers #90

Draft

samwaseda mentioned this issue Jan 9, 2025

How to compose a workflow automatically? #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specs for the ontology group #85

Specs for the ontology group #85

samwaseda commented Dec 20, 2024 •

edited

Loading

samwaseda commented Dec 20, 2024

liamhuber commented Dec 20, 2024

samwaseda commented Dec 20, 2024

liamhuber commented Dec 20, 2024

Specs for the ontology group #85

Specs for the ontology group #85

Comments

samwaseda commented Dec 20, 2024 • edited Loading

Basics

Case I

Case II

Triples

Parsing

samwaseda commented Dec 20, 2024

Parser

User definition

Parsing

liamhuber commented Dec 20, 2024

samwaseda commented Dec 20, 2024

liamhuber commented Dec 20, 2024

samwaseda commented Dec 20, 2024 •

edited

Loading