Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyiron_workflow-independent dict containing workflow info #576

Open
samwaseda opened this issue Feb 5, 2025 · 39 comments · May be fixed by #586
Open

pyiron_workflow-independent dict containing workflow info #576

samwaseda opened this issue Feb 5, 2025 · 39 comments · May be fixed by #586
Labels
enhancement New feature or request

Comments

@samwaseda
Copy link
Member

Note: I will work on this myself. You can just tell me what you think @liamhuber

Currently, we have graph_as_dict, which returns a dict containing all the information about the workflow. However, when we try to make semantikon work with any workflow manager, the most natural thing for me is to ask every workflow manager to provide a dictionary that does not require semantikon to depend on the workflow manager. In addition, that would be consistent with the effort that @jan-janssen is making here.

I was initially thinking we should extend @jan-janssen's universal representation of workflows by adding optional args required by semantikon, and potentially also allowing nodes_dict to contain sub-nodes to represent also the Macro nodes. However, I started realizing that this might not be the right way, because:

  • There's no consensus on how macro nodes or metadata should be defined
  • (Maybe more importantly) @jan-janssen's universal representation specifies the numerical logic of execution and in this regard it is conceptually complete

So my first suggestion is to separate semantikon-dict from the universal representation dict, in order to represent the workflow including the scientific logic.

Coming to the actual content, semantikon would need:

  • IO-values
  • Edges
  • Unique node labels
  • IO labels
  • Underlying python function
  • (Optionally) type hints

The type hints are optional, because semantikon can parse them, but that could be a bit dangerous because it might fail to assign the types to the IO labels. The reason why we need the original python function is because since this PR the function metadata is attached to f._semantikon_metadata, which I would not ask the workflow manager to provide.

Concerning the Macro nodes, I was testing the following code:

@Workflow.wrap.as_function_node
def add_one(a: int):
    result = a + 1
    return result

@Workflow.wrap.as_function_node
def add_two(b: int):
    result = b + 2
    return result

@Workflow.wrap.as_macro_node
def add_three(macro, c: int):
    macro.one = add_one(a=c)
    macro.two = add_two(b=macro.one)
    result = macro.two
    return result

wf = Workflow("my_wf")
wf.three = add_three(c=1)
wf.four = add_one(a=wf.three)
wf.run()

Then I realized that wf.graph_as_dict does explicitly not provide the connection between wf.three.inputs.c and wf.three.one.inputs.a as well wf.three.two.outputs.result and wf.three.outputs.result. They may not be edges in the sense of workflow edges, but they still somehow have to be included in the dict.

Well, thanks for reading, but actually I'm probably the one who profited from it the most as it helped me figure out what I actually want XD

@samwaseda samwaseda added the enhancement New feature or request label Feb 5, 2025
@samwaseda
Copy link
Member Author

samwaseda commented Feb 5, 2025

Today, I had a discussion with @jan-janssen while jogging, and he sounded he's actually interested in extending the universal representation, and I thought about it now for quite some time for macro nodes. With this in mind, question for @jan-janssen, but also for @liamhuber: On this page @jan-janssen defined the universal representation of the following code:

def add_x_and_y(x, y):
    z = x + y
    return x, y, z

def add_x_and_y_and_z(x, y, z):
    w = x + y + z
    return w

x, y, z = add_x_and_y(x=1, y=2)
w = add_x_and_y_and_z(x=x, y=y, z=z)

via (I defined a dict instead of individual variables for the reasons that will follow, but the content is the same)

workflow_dict = {
    "nodes_dict": {
        0: add_x_and_y,
        1: add_x_and_y_and_z,
        2: 1,
        3: 2,
    },
    
    "edges_lst": [
        {'target': 0, 'targetHandle': 'x', 'source': 2, 'sourceHandle': None},
        {'target': 0, 'targetHandle': 'y', 'source': 3, 'sourceHandle': None},
        {'target': 1, 'targetHandle': 'x', 'source': 0, 'sourceHandle': 'x'},
        {'target': 1, 'targetHandle': 'y', 'source': 0, 'sourceHandle': 'y'},
        {'target': 1, 'targetHandle': 'z', 'source': 0, 'sourceHandle': 'z'},
    ]
}

Now my suggestion: How about changing it to the following?

nodes_dict = {
    0: add_x_and_y,
    1: add_x_and_y_and_z,
}

edges_lst = [
    {'target': 0, 'targetHandle': 'x', 'source': "input", 'sourceHandle': "X", "value": 1},
    {'target': 0, 'targetHandle': 'y', 'source': "input", 'sourceHandle': "Y", "value": 2},
    {'target': 1, 'targetHandle': 'x', 'source': 0, 'sourceHandle': 'x'},
    {'target': 1, 'targetHandle': 'y', 'source': 0, 'sourceHandle': 'y'},
    {'target': 1, 'targetHandle': 'z', 'source': 0, 'sourceHandle': 'z'},
    {'target': "output", 'targetHandle': "W", 'source': 1, 'sourceHandle': 'w'},
]

The reason is fairly easy: Let's say I have the following macro node:

def all_at_once(X, Y):
    x, y, z = add_x_and_y(x=X, y=Y)
    w = add_x_and_y_and_z(x=x, y=y, z=z)
    W = w
    return W

Then I can define:

macro_dict = {
    "nodes_dict": {
        0: add_x_and_y,
        1: add_x_and_y_and_z,
    },
    "edges_lst": [
        {'target': 0, 'targetHandle': 'x', 'source': "input", 'sourceHandle': "X", "value": 1},
        {'target': 0, 'targetHandle': 'y', 'source': "input", 'sourceHandle': "Y", "value": 2},
        {'target': 1, 'targetHandle': 'x', 'source': 0, 'sourceHandle': 'x'},
        {'target': 1, 'targetHandle': 'y', 'source': 0, 'sourceHandle': 'y'},
        {'target': 1, 'targetHandle': 'z', 'source': 0, 'sourceHandle': 'z'},
        {'target': "output", 'targetHandle': "W", 'source': 1, 'sourceHandle': 'w'},
    ]
}

Now let's say we have the following workflow:

w = all_at_once(X=1, Y=2)
x, y, z = add_x_and_y(x=w, y=3)

Then the new definition would allow us to write something like:

workflow_dict = {
    "nodes_dict": {
        0: macro_dict,
        1: add_x_and_y_and_z,
    },
    "edges_lst": [
        {'target': 0, 'targetHandle': 'X', 'source': "input", 'sourceHandle': None, "value": 1},
        {'target': 0, 'targetHandle': 'Y', 'source': "input", 'sourceHandle': None, "value": 2},
        {'target': 1, 'targetHandle': 'y', 'source': "input", 'sourceHandle': None, "value": 3},
        {'target': 1, 'targetHandle': 'x', 'source': 0, 'sourceHandle': 'W'},
        {'target': "output", 'targetHandle': None, 'source': 1, 'sourceHandle': 'x'},
        {'target': "output", 'targetHandle': None, 'source': 1, 'sourceHandle': 'y'},
        {'target': "output", 'targetHandle': None, 'source': 1, 'sourceHandle': 'z'},
    ]
}

From my point of view, this conceptually incorporates macro nodes. Besides, after the execution value can be optionally set to all the edges as well. What do you think?

@samwaseda
Copy link
Member Author

I forgot to mention, but if we replace the enumeration in the node_dict keys by the node labels, with this representation we can extract all the information that I listed above except for the type hints, which we can probably also optionally return.

@jan-janssen
Copy link
Member

jan-janssen commented Feb 5, 2025

I like the idea of an output handle, as previously it was not directly clear which property the user is interested in.

In terms of nesting the macro dictionary inside the node dictionary, I am not so happy as I initially wanted to separate the nodes / python functions from the edges / logic. Now with the nested structure both get merged.

Another suggestion would be:

workflow_dict = {
  "mymacro" : {
    "nodes_dict": {
        0: add_x_and_y,
        1: add_x_and_y_and_z,
    },
    "edges_lst": [
        {'target': 0, 'targetHandle': 'x', 'source': "input", 'sourceHandle': "X", "value": 1},
        {'target': 0, 'targetHandle': 'y', 'source': "input", 'sourceHandle': "Y", "value": 2},
        {'target': 1, 'targetHandle': 'x', 'source': 0, 'sourceHandle': 'x'},
        {'target': 1, 'targetHandle': 'y', 'source': 0, 'sourceHandle': 'y'},
        {'target': 1, 'targetHandle': 'z', 'source': 0, 'sourceHandle': 'z'},
        {'target': "output", 'targetHandle': "W", 'source': 1, 'sourceHandle': 'w'},
    ]
  },
  "main": {
    "nodes_dict": {
        0: "mymacro",
        1: add_x_and_y_and_z,
    },
    "edges_lst": [
        {'target': 0, 'targetHandle': 'X', 'source': "input", 'sourceHandle': None, "value": 1},
        {'target': 0, 'targetHandle': 'Y', 'source': "input", 'sourceHandle': None, "value": 2},
        {'target': 1, 'targetHandle': 'y', 'source': "input", 'sourceHandle': None, "value": 3},
        {'target': 1, 'targetHandle': 'x', 'source': 0, 'sourceHandle': 'W'},
        {'target': "output", 'targetHandle': None, 'source': 1, 'sourceHandle': 'x'},
        {'target': "output", 'targetHandle': None, 'source': 1, 'sourceHandle': 'y'},
        {'target': "output", 'targetHandle': None, 'source': 1, 'sourceHandle': 'z'},
    ]
 },
}

@jan-janssen
Copy link
Member

Comment on the use of values in the edges. I feel it is dangerous as the user might not realize that two values are related to each other. Let's say I hard code the lattice constant for aluminium in my workflow. When I use it in multiple structure generators, I might define the same value in multiple edges. For somebody using my workflow it is not clear how these values are related.

@liamhuber
Copy link
Member

At a high level I have three pieces:

  1. Macro subgraphs should be kept isolated from the main graph. Don't connect them. The macro should behave like any other function, it just so happens that in this case it's a "function" that we can look into and represent as a graph. In that way I like @jan-janssen's last example best, but I still feel like there might be too much connection going on.
  2. I agree that attaching values to the edges is dangerous. IMO it falls into the general danger category of mixing up the workflow class (recipe) with the workflow instance. Maybe we can find a way to do it, but probably better to first get a representation we're happy with that only handles the recipe, and then think if it's extensible to handle the data too.
  3. I find the whole source/handle business very hard to read. I think we can probably improve readability. I'll try and cook up and example of my own before you guys wake up tomorrow.

@liamhuber
Copy link
Member

Given

# Function node
def add_x_and_y(x, y):
    z = x + y
    return x, y, z

# Function Node
def add_x_and_y_and_z(x, y, z):
    w = x + y + z
    return w


# Macro node
def all_at_once(X, Y):
    x, y, z = add_x_and_y(x=X, y=Y)
    w = add_x_and_y_and_z(x=x, y=y, z=z)
    W = w
    return W

# Workflow
all_at_once(1, 2)

I would imagine a graph representation like this:

my_first_node = {
    "function": add_x_and_y,
    "inputs": ("x", "y",),  # These can be automatically scraped
    "outputs": ("x", "y", "z"),  # This can be scraped in pyiron_workflow
    # but not in general -- nevertheless we need some way of identifying outputs,
    # so I guess some meta data like this dictionary is necessary to accompany functions
}

my_second_node = {
    "function": add_x_and_y_and_z,
    "inputs": ("x", "y",), 
    "outputs": ("w",),
}

my_macro = {
    "inputs": ("X", "Y",),
    "outputs": ("W",),
    "nodes": {  # Label to callable
        "xy": my_first_node,
        "xyz": my_second_node,
    },
    "edges": (
        ("inputs.X", "xy.inputs.x"),
        ("inputs.Y", "xy.inputs.y"),
        ("xy.outputs.x", "xyz.inputs.x"),
        ("xy.outputs.y", "xyz.inputs.y"),
        ("xy.outputs.z", "xyz.inputs.z"),
        ("xy.outputs.w", "outputs.W"),
    )
}

workflow = {
    "inputs": ("X", "Y",),
    "outputs": ("W",),
    "nodes": {"m": my_macro},
    "edges": (
        ("inputs.X", "m.inputs.X"),
        ("inputs.Y", "m.inputs.Y"),
        ("m.outputs.W", "outputs.W"),
    )
}

Note that the IO is always explicitly defined, and thus macros are able to indicate how their IO communicates with their children, but there is no direct connection between a macro's siblings and its children. (Maybe this is already the case in @jan-janssen's example! I am just having a bit of trouble grokking it.)

Then notice that here even the functions are required to have an explicty "graphy" representation dictionary. In pyiron_workflow we can always scrape all of this. For general python functions it gets harder when it comes to the output. We have tools in pwf to scrape labels from even the symbols of the code in the return function, but you still can't do it generically because what to do if there's more than one return function? So no matter how we slice it, I think we will need to force generic python callables into some more constrained and explicit form. Here I do it with a dictionary that is based on the returned tuple. One advantage here is that if you imagine I wanted to be working with the tuple (x, y, z) in my graph, the same underlying function could get the following representation to accomplish that:

my_first_node = {
    "function": add_x_and_y,
    "inputs": ("x", "y",), 
    "outputs": ("tup"),  # Now the whole return statement is parsed as a single tuple
    # Of course this flexiblity only goes so far -- we couldn't pull out "(x, y), z" 
    # It's all together or all apart
}

Ok, and finally note that we can imagine extending such a terminology to include annotations:

"inputs": ( ("x", None), ("y", u(int, units="meters"),),

or

"inputs": {
    "x": None,
    "y": u(int, units="meters"),
}

So coming back to a high level after a bit more percolation: I like where you guys are going with this. Excluding the fact that I've dropped values, I think we've got basically all the pieces you mentioned in your first post, @samwaseda. I think my favourite thing so far is my example above + the (label, annotation) dictionary for inputs and outputs.

Until I wrote up this example, I hadn't really thought about it in this generic workflow language context, but now I see we really do need to constrain the bottom-most graph callables -- i.e. the root most functions. We certainly can't graph-ize anything, e.g. functions with multiple return branches with different numbers/natures of items would totally sink this approach, so we'd better constrain them.

@samwaseda
Copy link
Member Author

I feel it is dangerous as the user might not realize that two values are related to each other. Let's say I hard code the lattice constant for aluminium in my workflow. When I use it in multiple structure generators, I might define the same value in multiple edges. For somebody using my workflow it is not clear how these values are related.

The whole idea is to represent a workflow that a user has run or is going to run, that means there is no way whether two values are supposed to be the same or accidentally the same, i.e. we could think about:

output_1 = my_function(structure_1)
output_2 = my_function(structure_2)
output_3 = my_function(structure_2)

Let's assume structure_1 and structure_2 happen to be the same. In this case all three inputs are the same, but while the last two are "supposed" to be the same, the first one "happens" to be the same. So actually it might be more dangerous to group them by their values. Instead, I think the user should make another node to explicitly say that the last two should get the same structure, i.e.:

structure = structure_generator(some_input)
output_1 = my_function(structure_1)
output_2 = my_function(structure)
output_3 = my_function(structure)

and include structure_generator into the workflow.

I can live with @liamhuber's representation (not to mention also with @jan-janssen's), except for the fact that I would like to have a straightforward extension to include values, i.e. instead of:

workflow = {
    "inputs": ("X", "Y",),
    "outputs": ("W",),
    "nodes": {"m": my_macro},
    "edges": (
        ("inputs.X", "m.inputs.X"),
        ("inputs.Y", "m.inputs.Y"),
        ("m.outputs.W", "outputs.W"),
    )
}

I would suggest:

workflow = {
    "inputs": ("X", "Y",),
    "outputs": ("W",),
    "nodes": {"m": my_macro},
    "edges": (
        {"input": "inputs.X", "output": "m.inputs.X"},
        {"input": "inputs.Y", "output": "m.inputs.Y"},
        {"input": "m.outputs.W", "output": "outputs.W"},
    )
}

This being said, it might be slightly more difficult for @jan-janssen to accept this representation, because it would involve more parsing when it's read by the workflow manager. This being said, our suggestions are not so different from each other, so I'm gonna make an example for @liamhuber's one for now and open a PR here.

@samwaseda
Copy link
Member Author

Sorry actually since one dictionary should be returned, the representation should look like this:

result = { 
    "my_first_node": {
        "function": add_x_and_y,
        "inputs": ("x", "y",),  # These can be automatically scraped
        "outputs": ("x", "y", "z"),  # This can be scraped in pyiron_workflow
        # but not in general -- nevertheless we need some way of identifying outputs,
        # so I guess some meta data like this dictionary is necessary to accompany functions
    }   

    "my_second_node": {
        "function": add_x_and_y_and_z,
        "inputs": ("x", "y",), 
        "outputs": ("w",),
    }   

    "my_macro": {
        "inputs": ("X", "Y",),
        "outputs": ("W",),
        "edges": (
            {"input": "inputs.X", "output": "my_first_node.inputs.x"},
            {"input": "inputs.Y", "output": "my_first_node.inputs.y"},
            {"input": "my_first_node.outputs.x", "output": "my_second_node.inputs.x"},
            {"input": "my_first_node.outputs.y", "output": "my_second_node.inputs.y"},
            {"input": "my_first_node.outputs.z", "output": "my_second_node.inputs.z"},
            {"input": "my_second_node.outputs.w", "output": "outputs.W"},
        )   
    }   

    "workflow": {
        "inputs": ("X", "Y",),
        "outputs": ("W",),
        "edges": (
            {"input": "inputs.X", "output": "my_macro.inputs.X"},
            {"input": "inputs.Y", "output": "my_macro.inputs.Y"},
            {"input": "my_macro.outputs.W", "output": "outputs.W"},
        )   
    }   
}

I also removed nodes because they are defined by the keys.

@liamhuber
Copy link
Member

About to go to sleep, but quickly: I think you probably still want a nodes field, otherwise it's purely implicit which node belongs to which macro/workflow, accessible only by parsing the edges. Might be fine to not require it to specify the dictionary, but for reading the dictionary I'd still like to see these.

@jan-janssen
Copy link
Member

For general python functions it gets harder when it comes to the output. We have tools in pwf to scrape labels from even the symbols of the code in the return function, but you still can't do it generically because what to do if there's more than one return function? So no matter how we slice it, I think we will need to force generic python callables into some more constrained and explicit form.

At least for the general workflow definition, I do not expect anybody to write this graph directly. Rather I expect them to use one of the existing workflow frameworks and export the graph in this format. The advantage is that the current workflow frameworks all have a way how to declare the expected output, even though they might not agree to do it in the same way.

@samwaseda
Copy link
Member Author

samwaseda commented Feb 6, 2025

About to go to sleep, but quickly: I think you probably still want a nodes field, otherwise it's purely implicit which node belongs to which macro/workflow, accessible only by parsing the edges. Might be fine to not require it to specify the dictionary, but for reading the dictionary I'd still like to see these.

That's true... One possibility is to merge the ideas of you two: I changed @jan-janssen's version with node and variable combined

workflow_dict = { 
  "mymacro" : { 
    "nodes_dict": {
        "xy": add_x_and_y,
        "xyz": add_x_and_y_and_z,
    },  
    "edges_lst": [
        {'target': 'xy.inputs.x', 'source': "input.X", "value": 1}, 
        {'target': 'xy.inputs.y', 'source': "input.Y", "value": 2}, 
        {'target': 'xyz.inputs.x', 'source': 'xy.outputs.x'},
        {'target': 'xyz.inputs.y', 'source': 'xy.outputs.y'},
        {'target': 'xyz.inputs.z', 'source': 'xy.outputs.z'},
        {'target': "output.W", 'source': 'xyz.outputs.w'},
    ]   
  },  
  "main": {
    "nodes_dict": {
        "mymacro": "mymacro",
        "xyz": add_x_and_y_and_z,
    },  
    "edges_lst": [
        {'target': 'mymacro.inputs.X', 'source': "input", "value": 1}, 
        {'target': 'mymacro.inputs.Y', 'source': "input", "value": 2}, 
        {'target': 'xyz.inputs.y', 'source': "input", "value": 3}, 
        {'target': 'xyz.inputs.x', 'source': 'mymacro.outputs.W'},
        {'target': "output", 'source': 'xyz.outputs.x'},
        {'target': "output", 'source': 'xyz.outputs.y'},
        {'target': "output", 'source': 'xyz.outputs.z'},
    ]   
 },
}

And I claim this is closer to what pyiron_workflow does than your representation, because in your representation the combination of my_first_node = {"function": add_x_and_y, ...} and my_macro = {... "nodes": {"xy": my_first_node, ...} ...} implies add_x_and_y is first instantiated (to my_first_node) and assigned to my_macro.xy, which is not the case. In reality, add_x_and_y is directly instantiated to my_macro.xy, which is what the representation of @jan-janssen's corresponds to.

@jan-janssen
Copy link
Member

I am not bound to a specific notation and for me the two are equivalent:

{'target': 'xyz', 'targetHandle': 'x', 'source': 'xy', 'sourceHandle': 'x'},

and:

{'target': 'xyz.inputs.x', 'source': 'xy.outputs.x'},

so from that side, I am fine with either solution.

The part where I am still not sure, is where to store the values. I could live with using the "value" field of the edges for optional storage of outputs, but at least for the default input values and the parameters the user is able to change, I would prefer those to be nodes.

@samwaseda
Copy link
Member Author

The part where I am still not sure, is where to store the values. I could live with using the "value" field of the edges for optional storage of outputs, but at least for the default input values and the parameters the user is able to change, I would prefer those to be nodes.

I'm not sure if I'm understanding the problem correctly. I was thinking about writing a function like get_dict_representation with a default argument values=False. If values is True and the workflow has already run, then I would put the values to all the edges, otherwise only the default values would be attached, which would correspond to the case that I presented above. That's not good enough?

@jan-janssen
Copy link
Member

Does it make sense to have the outside input and output variables specifically marked. I mean when I search for a workflow which calculates a bulk modulus, I do not really want to parse all the edges to see if the bulk modulus is an output edge.

@samwaseda
Copy link
Member Author

Does it make sense to have the outside input and output variables specifically marked.

I guess that's definitely an extension that we can think of, but I thought the universal representation was the minimum representation to communicate with other nodes, and as long as there is a possibility to extract data, you don't want to give more requirements to the workflow managers? In other words, doesn't it make sense to agree on a minimum representation and run everything else outside?

@samwaseda
Copy link
Member Author

@liamhuber How do I know that there's a connection between macro.w.outputs.w and my_macro.outputs.W in pyiron_workflow?

@Workflow.wrap.as_macro_node
def all_at_once(macro, X, Y):
    macro.x, macro.y, macro.z = add_x_and_y(x=X, y=Y)
    macro.w = add_x_and_y_and_z(x=x, y=y, z=z)
    W = macro.w
    return W

my_macro = all_at_once()

@liamhuber
Copy link
Member

At least for the general workflow definition, I do not expect anybody to write this graph directly. Rather I expect them to use one of the existing workflow frameworks and export the graph in this format. The advantage is that the current workflow frameworks all have a way how to declare the expected output, even though they might not agree to do it in the same way.

Sounds good to me. At any rate, I think we need to include such a function-input-output representation at this dictionary level. For instance, @samwaseda's immediately following example here is, as far as I can tell, incomplete and uncomputable because the generic representation has no way to know that "xy.outputs.z" exists or what it should be.

@liamhuber
Copy link
Member

Does it make sense to have the outside input and output variables specifically marked.

I guess that's definitely an extension that we can think of, but I thought the universal representation was the minimum representation to communicate with other nodes, and as long as there is a possibility to extract data, you don't want to give more requirements to the workflow managers? In other words, doesn't it make sense to agree on a minimum representation and run everything else outside?

I think this comes naturally from my last example, where "inputs" and "outputs" are explicit fields of each abstract object. In my example it was just a tuple of string names, but these fields could hold doubles of (name, annotation), triples of (name, annotation, value). Not that these need to be triples, we could as well have dict[str, tuple[Annotation | None, object | None]] -- my point is just that the dictionary representation of the workflow object should definitely have "input" and "output" fields.

@samwaseda
Copy link
Member Author

For instance, @samwaseda's immediately following example #576 (comment) is, as far as I can tell, incomplete and uncomputable because the generic representation has no way to know that "xy.outputs.z" exists or what it should be.

Ah yeah I see that I actually totally ignored your inline comment XD

@liamhuber
Copy link
Member

@liamhuber How do I know that there's a connection between macro.w.outputs.w and my_macro.outputs.W in pyiron_workflow?

@Workflow.wrap.as_macro_node
def all_at_once(macro, X, Y):
macro.x, macro.y, macro.z = add_x_and_y(x=X, y=Y)
macro.w = add_x_and_y_and_z(x=x, y=y, z=z)
W = macro.w
return W

my_macro = all_at_once()

So first off, you can't run that. The example is incomplete anyhow, but all children of macros need to be nodes, so I'm going to operate on the assumption that add_x_and_y and add_x_and_y_and_z are intended to be function nodes, not raw functions. Then in that case we are not powerful enough to decompose tuple output like macro.x, macro.y, macro.z = add_x_and_y(x=X, y=Y) and you are missing the macro. scoping in macro.w = add_x_and_y_and_z(x=x, y=y, z=z).

Here's an operational example that I think is equivalent:

import pyiron_workflow as pwf

@pwf.as_function_node
def add_x_and_y(x, y):
    z = x + y
    return x, y, z

@pwf.as_function_node
def add_x_and_y_and_z(x, y, z):
    w = x + y + z
    return w


@pwf.as_macro_node
def all_at_once(macro, X, Y):
    macro.xy = add_x_and_y(x=X, y=Y)
    macro.xyz = add_x_and_y_and_z(
        x=macro.xy.outputs.x,
        y=macro.xy.outputs.y,
        z=macro.xy.outputs.z,
    )
    W = macro.xyz
    return W

print(all_at_once.preview_io())

my_macro = all_at_once()
my_macro(1,2)

We can then return to the question "How do I know that there's a connection between macro.xyz.outputs.w and my_macro.outputs.W in pyiron_workflow?"

The answer is that a the class level, you don't. You know that all_at_once has an output W, and if you had given a return annotation for the macro definition, you'd know about it's type. This we scrape from the actual source code of the macro definition. Already here we have some safeguards/freedoms/constraints:

  • We could have instead simply said return macro.xyz, and the macro output label would be xyz instead of W, but functionality would be identical
    • Since macro matches the first argument of the function definition, we scrape it off and the macro output label is just xyz
  • We can mix-and-match
    • since add_x_and_y_and_z is has a single output, we could say return W, macro.xyz, and it would work find (although we'd get the same output twice, which is boring, but in general it is a fine thing to do)
    • since xy has multiple outputs, we need to be more specific, but we can return W, macro.xy.outputs.z
      • As-is, this will give an error because we have more to strip off than macro. -- ValueError: Tried to scrape cleaned labels for all_at_once, but at least one of ['W', 'xy.outputs.z'] still contains a '.' -- please provide explicit labels
      • Adding @pwf.as_macro_node("W", "foobar") resolves this

But at the end of the day, at the class level the macro is basically taking our word for it that we've defined the IO of the macro in a valid way.

Once we move to the instance, we'll get errors when we try to create the macro instance if we've screwed things up. For instance, if we try return W, 42, we'll get AttributeError: 'int' object has no attribute 'channel'. At this point, we have the actual subgraph objects in hand and can make sure that everything the macro is returning is a valid channel representation.

Now, finally, we get to your answer: we demand that everything coming out of the macro be channel data because the role of a macro is to execute a subgraph; the macro's IO is merely an interface to the IO of this subgraph. We handle the relationship between the macro IO and its child IO with "value receiver" pairs. Unlike a connection which needs to form between (input/output)-(output/input), there is only ever one value receiver and it forms (input/output)-(input/output). All it does is keep the .value synchronized between the two channels. You can go look at pyiron_workflow.channels to dig into this; getting it mypy compliant made all the differences between connections and value pairs extremely explicit.

Thus, in the running example above:

my_macro.xyz.outputs.w.value_receiver is my_macro.outputs.W
>>> True

my_macro.xyz.value_receiver is my_macro.outputs.W  # exploiting single-output property
>>> True

my_macro.outputs.W in my_macro.xyz.connections
>>> False
# Obviously, they are both outputs!

It would be super cool to know more of this stuff at the class level, but I don't currently see a way how.

At any rate, for the generic workflow representation, I still think we don't need to know that these two have a connection; A macro is just like any other function and the workflow only cares (and only should care) what it's IO is. Then, in the macro's own generic definition, the macro knows how its IO maps to the IO of its subgraph (again, because we need explicit "input" and "output" fields in our generic workflow description!)

Now the big caveat -- what about ontological information? Ideally, we want the macro output channels to somehow be able to read their subgraph and make sure that their output channels assume the same ontological- (and, for that matter type-)information as the child channels to which they are value coupled. Right now, I don't even do this for types. I do think it's possible at the instance level, but I don't yet see how to guarantee it at the class level. So for now I would suggest that we simply demand that users go to the trouble of explicitly annotating their workflows, and that we go in and add the pyiron_workflow infrastructure to merely verify that these annotations are correct at instantiation time, once we have all the information we will need to make such a validation.

@liamhuber
Copy link
Member

Now it is my turn to feel like a Russian novelist 😝

@samwaseda
Copy link
Member Author

Now it is my turn to feel like a Russian novelist 😝

Yeah and I feel like I'm the one who's drunk XD

@samwaseda
Copy link
Member Author

Now the big caveat -- what about ontological information? Ideally, we want the macro output channels to somehow be able to read their subgraph and make sure that their output channels assume the same ontological- (and, for that matter type-)information as the child channels to which they are value coupled. Right now, I don't even do this for types.

wowowowow now I understand the punchline of this story. It feels like you actually declared both Raskolnikov and Sonya dead in the epilogue XD

@samwaseda
Copy link
Member Author

But then I don't really get it: When you draw the graph, it looks like the inputs and the outputs are correctly connected. How are these arrows drawn?

@liamhuber
Copy link
Member

But then I don't really get it: When you draw the graph, it looks like the inputs and the outputs are correctly connected. How are these arrows drawn?

When who draws it? When drawing a graph, the data edges for "connections" are always shown between ports of sibling nodes. It would be perfectly reasonable to then open up and resolve a macro in the same image and show lines from its input channels to the input channels of its children, and from the output channels of its children to its output channels, but (a) these lines, while meaningful, aren't true data connection edges, and (b) when we define a macro as we do currently -- by decorating some function -- we can't inspect the innards like this until it has been instantiated. Right now only the IO of the macro is machine-verifiable at the class level.

I'd love to change (b), and maybe we can make some headway by making the graph constructor into a graph-constructor-recipe that is available as a class method... I think such an attack should be doable, but it would take some effort and I don't know how easy it would be to pull off while still maintaining the current user-ease for defining macros. So, like, I'm open to it, but it will cost time at a minimum and likely some UX penalty.

@liamhuber
Copy link
Member

Actually, talking through it, I'd really like to change (b). It's going to take some thinking for how to not make it too much more verbose for users though

@samwaseda
Copy link
Member Author

ok actually it looks like I misunderstood something. I reformulate my question: How can I know that in the example above that my_macro.xyz.value_receive is my_macro.outputs.W? I just compare all the nodes and all the outputs?

@liamhuber
Copy link
Member

How can you know as a human, or how does the computer know?

As a human you just read the macro definition and know that return values are getting mapped to outputs, and that value receivers are the functionality we use to handle that mapping.

As a computer, so far we only care about knowing at subgraph creation time (aka macro instantiation time). For the technicalities of building the mapping it's best to just go read the macro class.

As a computer post facto, yes you could iterate over macro inputs and all the child outputs and search for their value receivers (all the macro inputs will have one, most of the macro child output usually won't, but the fraction depends totally on your graph)

@liamhuber
Copy link
Member

liamhuber commented Feb 6, 2025

Ah, unless you meant my common workflow definition. There I stuck it in the edges, despite the fact I keep asserting it isn't one: ("xy.outputs.w", "outputs.W"). Of course it can't be a "true" edge though, because it's output-to-output. I think we need these "value "edges"" to get user data into and results out of workflows anyhow.

I also had a typo, that should be "xyz"

@samwaseda
Copy link
Member Author

ah now I finally got it. I don't know how I got stuck, but I was stuck for some weird reasoning not even worth explaining. I'm gonna work on the prototype and will present it probably today.

@samwaseda
Copy link
Member Author

Now my next suggestion:

workflow_dict = { 
    "mymacro": {
        "nodes_dict": {
            "xy": add_x_and_y,
            "xyz": add_x_and_y_and_z,
        },
        "inputs": ("X", "Y"),
        "outputs": ("W", ),
        "edges_lst": [
            {'target': 'xy.inputs.x', 'source': "input.X", "value": 1}, 
            {'target': 'xy.inputs.y', 'source': "input.Y", "value": 2}, 
            {'target': 'xyz.inputs.x', 'source': 'xy.outputs.x'},
            {'target': 'xyz.inputs.y', 'source': 'xy.outputs.y'},
            {'target': 'xyz.inputs.z', 'source': 'xy.outputs.z'},
            {'target': "output.W", 'source': 'xyz.outputs.w'},
        ]   
    },  
    "main": {
        "nodes_dict": {
            "mymacro": "mymacro",
            "xyz": add_x_and_y_and_z,
        },  
        "edges_lst": [
            {'target': 'mymacro.inputs.X', 'source': "input", "value": 1}, 
            {'target': 'mymacro.inputs.Y', 'source': "input", "value": 2}, 
            {'target': 'xyz.inputs.y', 'source': "input", "value": 3}, 
            {'target': 'xyz.inputs.x', 'source': 'mymacro.outputs.W'},
            {'target': "output", 'source': 'xyz.outputs.x'},
            {'target': "output", 'source': 'xyz.outputs.y'},
            {'target': "output", 'source': 'xyz.outputs.z'},
        ]   
    },  
}

I'm still reluctant to put input and output arguments in the main workflow definition, because {'target': 'mymacro.inputs.X', 'source': "input.X", "value": 1} simply does not correspond to what pyiron_workflow does. If it's really really required, then I would put inputs.mymacro__X.

@liamhuber
Copy link
Member

Yeah, by and large I'm happy with this. I still think it's a technical criticality to wrap the underlying functions -- again in this example I don't see a way for the machine to parse 'xy.outputs.z' without something like

    "xy": {
        "function": add_x_and_y,
        "inputs": ("x", "y"),
        "outputs": ("x", "y", "z"),
    }

I'm still reluctant to put input and output arguments in the main workflow definition, because {'target': 'mymacro.inputs.X', 'source': "input.X", "value": 1} simply does not correspond to what pyiron_workflow does. If it's really really required, then I would put inputs.mymacro__X.

At a high level, I will still push for these to be included. First, because I don't see a reason for "macros" to have a fundamentally different syntax from "workflows". We need IO for macros, so I think it improves consistency and readability to give them to workflows too. Second, I will obviously promote pyiron_workflow above all other tools, but at this abstract level I do think we should avoid getting too hung up on it -- rather let's only focus on whether it makes sense for describing workflows in general.

Very specifically, I'm totally content with inputs.mymacro__X. You're right this is what pyiron_workflow does automatically. But! There is also no barrier to giving it a more specific name like inputs.X -- the Workflow class automatically scrapes its IO from unconnected child channels, but it does allow you to provide an IO map for renaming these to something more convenient. So there's zero conflict here.

As a nit, what do you guys think of "from" and "to" instead of "source" and "target"? I also think we can truncate "nodes_dict" down to "nodes", and truncate and then specify "edges_lst" to "data_edges".

Finally, I was trying to think whether there's a way to express the I-I/O-O value coupling by exploiting the "value" field of the edge dictionary. I didn't manage to find a way that I both liked more and that carried the right logical information for the machine, but I want to bring it up in case you guys can see a more clever way to attack along that route than I do. It would be nice to have only "true" edges in the from-to edge data, and to have some really clear indication when we're supplying information for a "false" edge.

@liamhuber
Copy link
Member

Ok, I slept on it and some more ideas came.

First, regarding where to store values: they can live either on the inputs/outputs, or on the edges themselves. There is no difference between these on the quality of information we stored, but they differ in readability and memory efficiency; storing on edges the value gets repeated n_connections times, storing on the I/O channels it gets repeated n_connections+1 times -- i.e. the worst case (and most typical case) is that IO storage is 2x worse. For now, let's optimize for human readability, and we can always convert our representation to/from a more efficient machine-storage representation as needed.

Next, if we are storing values on the IO itself, then the edges only need to be string references to the scoped IO labels. I realized that we can nicely distinguish "true" data edges from "fake" value synchronizing edges by the fact that the former must be directional (output->input) whereas the latter, in principle, is a non-directional property. Of course, in practice input-input value synchronizations always see the outer-scope value get updated and this propagates to the inner-scope, and vice-versa for output-output value synchronizations, but the result is that the values are always synchronized. Thus, they can be easily distinguished by an ordered tuple ("output", "input") vs an unordered set {"outer", "inner"}.

Finally, I extended the underlying example functions a bit to demonstrate how important the import and output fields are:

from semantikon.typing import u

# Function node
def add_x_and_y(x, y):
    z = x + y
    return x, y, z

# Function Node
def add_x_and_y_and_z(x, y=4, z: u(int, "meters") = 5) -> int:
    w = x + y + z
    return w


# Macro node
def all_at_once(X, Y: int):
    x, y, z = add_x_and_y(x=X, y=Y)
    w = add_x_and_y_and_z(x=x, y=y, z=z)
    W = w
    return W

# Workflow
all_at_once(1, 2)

all_at_once_dict = {
    "inputs": {
        "mymacro__X": {"value": 1},
        "mymacro__Y": {"value": 2},
    },
    "outputs": {"mymacro__W": {"value": 6}},
    "nodes": {
        "mymacro": {
            "inputs": {
                "X": {"value": 1},
                "Y": {
                    "value": 2, 
                    "annotation": int,
                },
            },
            "outputs": {"W": {"value": 6}},
            "nodes": {
                "xy": {
                    "function": add_x_and_y,
                    "inputs": {
                        "x": {"value": 1},
                        "y": {"value": 2},
                    },
                    "outputs": {
                        "x": {"value": 1},
                        "y": {"value": 2 },
                        "z": {"value": 3},
                    }
                },
                "xyz": {
                    "function": add_x_and_y_and_z,
                    "inputs": {
                        "x": {"value": 1},
                        "y": {
                            "value": 2,
                            "default": 4,
                        },
                        "z": {
                            "value": 3, 
                            "default": 5,
                            "annotation": u(int, "meters"),
                        },
                    },
                    "outputs": {"w": {"value": 6, "annotation": int}},
                },
            },
            "data_edges": (
                {"inputs.X", "xy.inputs.X"},
                {"inputs.Y", "xy.inputs.Y"},
                ("xy.outputs.x", "xyz.inputs.x"),
                ("xy.outputs.y", "xyz.inputs.y"),
                ("xy.outputs.z", "xyz.inputs.z"),
                {"xyz.outputs.w", "outputs.W"},
            ),
        },
    },
    "data_edges": (
        {"inputs.mymacro__X", "mymacro.inputs.X"},
        {"inputs.mymacro__Y", "mymacro.inputs.Y"},
        {"mymacro.outputs.W", "outputs.mymacro__W"},
    ),
}

I think this gives us the power to capture everything we want, and (for me at least...) it is still pretty readable given the density of information it contains. One thing I really like is that now the only difference between the "recipe" class-view and the post-run instance view is that the "value": fields on the various "inputs" and "outputs" get populated!

A big change here is that everything has been defined in a nested way. This is fine, and probably preferable, when we don't reuse nodes. I could imagine breaking these into reusable parts by simply having the node/macro dictionaries defined independently in the same scope as the macro/workflow is defined. E.g.,

xyz_dict = {
    "function": add_x_and_y_and_z,
    "inputs": {
        "x": {},
        "y": {"default": 4},
        "z": {
            "default": 5,
            "annotation": u(int, "meters"),
        },
    }
}

all_at_once_dict = {
    ...
                "xyz": xyz_dict,
    ...
}

Since this is now reusable, note that I've purged the values. This works beautifully for the "recipe" view, but for the instance-view, I guess we'd again want to convert the dictionary to a more verbose nested view so that each occurrence of xyz_dict is able to get its own "value": fields populated individually.

@samwaseda
Copy link
Member Author

Now I also slept on it, and I guess we can agree on your final version. Here, I try to formulate the specs:

Requirements

  • A node must at least consist of inputs and outputs, in the form of a dictionary containing the argument names in the keys and their metadata in the values.

Optional fields

  • A node can contain
    • nodes and edges1 (if composite node), or
    • function (if atomic node)
  • Inside nodes, more nodes are nested
  • edges consists of a tuple of tuples of length 2, indicating the data flow from port to another.
  • A port name is made up by io.variable (io = inputs / outputs). It can be preceded by the child node names (nested_node_1.nested_node_2.io.variable). The variable names must correspond to the ones specified in the inputs and outputs fields.
  • function contains the underlying python function of the node
  • Optional fields for the inputs and outputs are: value, annotation, default2

Footnotes

  1. I'm also fine with data_edges but if nodes are just nodes why not just edges?

  2. Not sure if we need default

@niklassiemer
Copy link
Member

I am in favor of an optional default field. It ensures the same workflow even if implicit inputs change after some time.

@samwaseda
Copy link
Member Author

samwaseda commented Feb 11, 2025

all_at_once_dict = {
    ...
                "xyz": xyz_dict,
    ...
}

I support this idea, but I find it also somewhat inconvenient that the label is not included in the dictionary, because I would need it in the knowledge graph.

@samwaseda samwaseda linked a pull request Feb 11, 2025 that will close this issue
@liamhuber
Copy link
Member

Here, I try to formulate the specs:
...

  • edges consists of a tuple of tuples of length 2, indicating the data flow from port to another.

You don't mention how we pass information from the scope of one graph to another graph (i.e. where I used sets earlier). Even if we don't go for the set notation, this layer of information passing should be explicitly identified.

@liamhuber
Copy link
Member

all_at_once_dict = {
...
"xyz": xyz_dict,
...
}

I support this idea, but I find it also somewhat inconvenient that the label is not included in the dictionary, because I would need it in the knowledge graph.

Can you clarify what you mean by "label"? From my reading the label is "xyz" in both the full nested view and also the abstracted view where we insert an existing dict. In the later case the variable xyz_dict is just that, a variable that has some scope and only lives while the interpreter lives.

In principle, we could add a "label" field:

xyz_dict = {
    "label": "xyz_dict",
    "function": add_x_and_y_and_z,
    "inputs": {
        "x": {},
        "y": {"default": 4},
        "z": {
            "default": 5,
            "annotation": u(int, "meters"),
        },
    }
}

But at this stage it would really be a "recipe label", and IMO would anyhow get overridden to the "instance label" of "xyz" once it appears in the other workflow/macro.

@liamhuber
Copy link
Member

I'm also fine with data_edges but if nodes are just nodes why not just edges?

With nodes I was trimming off a data type specifier, which is redundant because we can anyhow just look at the code and see the data type. With data_edges I did this trimming, then added back a "meaning" prefix. This is a little bit of forward-engineering and perhaps YAGNI, but my best understanding is that there are certain graph topologies that we will not be able to express with data edges alone. In particular, consider a bifurcating graph with an "if" clause. In my experience the most practical way to resolve this is to additionally have execution_edges (Signal channels in pwf). If we have an alternate plan in the works for how to extend this generic description to a wider set of graphs that doesn't reuse the term edges, then I'm fine with truncating data_edges down...but I want to hear a reasonable concept of such a plan before we ditch the data_ 😝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants