Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specs for the ontology group #85

Open
samwaseda opened this issue Dec 20, 2024 · 4 comments
Open

Specs for the ontology group #85

samwaseda opened this issue Dec 20, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@samwaseda
Copy link
Member

samwaseda commented Dec 20, 2024

So dear @pyiron/ontology team, I wrote down in a very explicit way the parts that we can probably agree on quickly, in order for the implementation to be done systematically. I'm obviously painfully aware of the fact that I haven't included any of the tricky points discussed this Monday and in the following conversation on the discussion page, but I think it's good to have a solid ground to build a house on. I think the list below is fairly rudimentary but I still don't think it's something to sneeze at, because just for having this information we can already extract quite an amount of ontological data.

Anyway I would be very happy to have your comments. Especially it would be nice if we could also think about how to include the points we discussed this week.

I already started writing a prototype here, but it doesn't strictly follow the points I wrote here, so I will rewrite it and post the revamped version here below.

Basics

  • (Technical) The user should provide a default name space EX = Namespace(my_namespace)
  • "An argument with a URI" <-> "An ontological object"
  • The label should be defined by the user, otherwise the label extracted by pyiron_workflow is used
  • The label defines the node (cf. next point)
  • The following information is automatically extracted and added to the knowledge graph
    • label: (URIRef(label), RDFS.label, Literal(label))
    • uri: (URIRef(label), RDF.type, uri)
    • data type?
    • value (if available) (URIRef(label), EX.hasValue, Literal(value))
    • units (if available) (URIRef(label), EX.hasUnit, EX[units]) (probably a different namespace for units)

All the items above (URI, label, data type, units) can be specified either by a data class or by semantikon arguments, i.e. the following two cases are equivalent:

Case I

from semantikon.converter import semantikon_class
from ase import Atoms

@semantikon_class
@dataclass
class ComputationalSample:
    class Structure(Atoms):
        uri: 1234

    class Output:
        class Energy:
            uri: 5678
            units: "eV"
            dtype: float

def calculate_energy(structure: ComputationalSample.Structure) -> ComputationalSample.Output.Energy:
    ...
    return energy

Case II

from semantikon.typing import u
from ase import Atoms

def calculate_energy(structure: u(Atoms, uri=1234)) -> u(float, uri=5678, units="eV"):
    ....
    return energy

Triples

  • Only triples define the relationships between nodes (EX["IsElementOf"], "structure")
    • For the object of the triple, the label must be given as a str
  • Triples can also define properties, e.g. (EX["HasDefect"], EX["vacancy"])
    • For the object of the triple, the property must be given as a URI

Parsing

  • The parsing of the information above is done without an additional parser, i.e. pyiron provides a tool (probably a function) to automatically extract the ontological information mentioned above.
  • The extracted information is given in the form of a knowledge graph
@samwaseda samwaseda added the enhancement New feature or request label Dec 20, 2024
@samwaseda
Copy link
Member Author

This time (in contrast to my last attempts), I strictly followed the text above, and wrote a prototype. It's again workflow to calculate the volume of a chemical element, where it involves two steps: element -> structure -> volume.

Parser

from rdflib import Graph, Literal, RDF, URIRef, RDFS, Namespace
from semantikon.converter import parse_input_args, parse_output_args

EX = Namespace("http://example.org/")

def get_inputs_and_outputs(node):
    """
    Read input and output arguments with their type hints and return a dictionary containing all input output information

    Args:
        node (pyiron_workflow.nodes.Node): node to be parsed

    Returns:
        (dict): dictionary containing input output args, type hints, values and variable names
    """
    inputs = parse_input_args(node.node_function)
    outputs = parse_output_args(node.node_function)
    if isinstance(outputs, dict):
        outputs = (outputs, )
    outputs = {key: out for key, out in zip(node.outputs.labels, outputs)}
    for key, value in node.inputs.to_value_dict().items():
        inputs[key]["value"] = value
        inputs[key]["var_name"] = key
    for key, value in node.outputs.to_value_dict().items():
        outputs[key]["value"] = value
        outputs[key]["var_name"] = key
    return {"input": inputs, "output": outputs}

def get_node_dict(io_dict):
    """
    Translate the dictionary returned by get_inputs_and_outputs into
    the one that contains keys with their labels (or variable names
    whenever labels are not available) and the values with the dict
    content
    """
    results = {}
    for io_ in ["input", "output"]:
        for key, value in io_dict[io_].items():
            if "uri" not in value:
                continue
            if value["label"] is not None:
                results[value["label"]] = value
            else:
                results[key] = value
    return results

def node_to_knowledge_graph(node, graph=None, EX=EX):
    """Translate a node into a knowledge graph"""
    if graph == None:
        graph = Graph()
    d = get_inputs_and_outputs(node)
    node_dict = get_node_dict(d)
    for key, d in node_dict.items():
        label = URIRef(key)
        label_def_triple = (label, RDFS.label, Literal(key))
        if len(list(graph.triples(label_def_triple))) == 0:
            graph.add(label_def_triple)
            graph.add((label, RDF.type, d["uri"]))
            graph.add((label, EX.HasValue, Literal(d["value"])))
            if d["units"] is not None:
                graph.add((label, EX.HasUnits, EX[d["units"]]))
        if d["triple"] is not None:
            graph.add((label, d["triple"][0], URIRef(d["triple"][1])))
    return graph

def workflow_to_knowledge_graph(wf, graph=None, EX=EX):
    if graph is None:
        graph = Graph()
    for node in wf.children.values():
        graph = node_to_knowledge_graph(node=node, graph=graph, EX=EX)
    return graph

User definition

from pyiron_workflow import Workflow
from semantikon.typing import u
from ase import Atoms, build

@Workflow.wrap.as_function_node
def create_structure(
    element: u(str, triple=(EX["IsElementOf"], "structure"), uri=EX["element"])
) -> u(Atoms, uri=EX["ComputationalSample"]):
    structure = build.bulk(element, cubic=True)
    return structure

@Workflow.wrap.as_function_node
def get_volume(
    structure: u(Atoms, uri=EX["ComputationalSample"])
) -> u(float, units="angstrom**3", triple=(EX["IsCalculatedPropertyOf"], "structure"), uri=EX["volume"]):
    volume = structure.get_volume()
    return volume

wf = Workflow("my_workflow")

wf.structure = create_structure(element="Al")
wf.volume = get_volume(structure=wf.structure)
wf.run()

Parsing

graph = workflow_to_knowledge_graph(wf)
print(list(graph.triples(3 * (None, ))))

Output:

[(rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef('http://example.org/element')),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://example.org/HasUnits'),
  rdflib.term.URIRef('http://example.org/angstrom**3')),
 (rdflib.term.URIRef('structure'),
  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'),
  rdflib.term.Literal('structure')),
 (rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'),
  rdflib.term.Literal('element')),
 (rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://example.org/IsElementOf'),
  rdflib.term.URIRef('structure')),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://example.org/HasValue'),
  rdflib.term.Literal('66.43012500000002', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double'))),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://example.org/IsCalculatedPropertyOf'),
  rdflib.term.URIRef('structure')),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'),
  rdflib.term.Literal('volume')),
 (rdflib.term.URIRef('structure'),
  rdflib.term.URIRef('http://example.org/HasValue'),
  rdflib.term.Literal("Atoms(symbols='Al4', pbc=True, cell=[4.05, 4.05, 4.05])")),
 (rdflib.term.URIRef('volume'),
  rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef('http://example.org/volume')),
 (rdflib.term.URIRef('structure'),
  rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef('http://example.org/ComputationalSample')),
 (rdflib.term.URIRef('element'),
  rdflib.term.URIRef('http://example.org/HasValue'),
  rdflib.term.Literal('Al'))]
Screenshot 2024-12-20 at 14 35 54

@liamhuber
Copy link
Member

Nice work, @samwaseda. At a first read it looks good to me. The only clear problem is see is that the io parsing needs to more particularly specify whether a reference to another object is referencing an input or output object. If I'm reading this correctly, labels are scraped from the function, all the labels are lumped together, and then we look for our reference. Since there's nothing stopping us from having the same label as both an input and output, this is thus not fully defined.

I reserve the right to find other complaints later 😝 but so far this is the only thing that I see as definitively problematic. This is a nice strong foundation.

@samwaseda
Copy link
Member Author

The only clear problem is see is that the io parsing needs to more particularly specify whether a reference to another object is referencing an input or output object. If I'm reading this correctly, labels are scraped from the function, all the labels are lumped together, and then we look for our reference. Since there's nothing stopping us from having the same label as both an input and output, this is thus not fully defined.

Ah it's true that I removed label="structure" etc. in the example above just to test the code, but yes, otherwise people should write:

@Workflow.wrap.as_function_node
def create_structure(
    element: u(str, triple=(EX["IsElementOf"], "structure"), uri=EX["element"], label="element")
) -> u(Atoms, uri=EX["ComputationalSample"], label="structure"):
    structure = build.bulk(element, cubic=True)
    return structure

And I'm kind of hoping that the introduction of label enable the user to specify whether it's input or output by choosing distinct labels, i.e. they could say something like

@Workflow.wrap.as_function_node
def some_transformation(
    u(Atoms, uri=EX["ComputationalSample"], label="input.structure")
) -> u(Atoms, uri=EX["ComputationalSample"], label="output.structure"):
    ...
    return structure

@liamhuber
Copy link
Member

I think that we can probably work it so the label itself is scraped from the signature (/the io_preview() later once this is powered up to capture the entire semantikon annotation), rather it's only references to these labels that I believe need to be scoped like in your second example with "inputs.structure"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants