Pgen in the specification #386

jwhite242 · 2022-03-25T01:04:59Z

jwhite242
Mar 25, 2022
Maintainer

Just starting a discussion on some potential design changes in the spec regarding the parameter generators. The problem this aims to solve is the lack of embedded documentation on a spec that's designed to be run with pgen functions. So, maybe we can add a new block to the spec to function as a default invocation of the pgen, which would also document it, while also still allowing the cli invocations to override it. So, something like this:

...
parameter.generator:
    description: simple parameter generator example
    path: mypgen.py
    pargs:
        - arg1: some_value
        - arg2: another_value

One thought here is what about spec's that could have multiple pgen's instead of having one giant one that can do many things? (i.e. different parameter sampling strategies). Perhaps enable making the generator entries a list, with another key under parameter generator specifying the default_pgen:

parameter.generator:
    default_pgen: mypgen    # must be one of the names below, which schema can check for
    generators:
      - name: mypgen
        description: simple parameter generator example
        path: mypgen.py
        pargs:
            - arg1: some_value
            - arg2: another_value

      - name: myotherpgen
        description: another simple parameter generator example
        path: myotherpgen.py
        pargs:
            - arg1: some_value
            - arg2: another_value

Additionally, we could treat these similarly to the dependency block, with a check on them being at the specified path and block study launch if not. Or maybe some more flexiblity like keying on whether default_pgen is empty or not and if it is, use the global.parameters instead or something?

jwhite242 · 2022-03-29T17:31:01Z

jwhite242
Mar 29, 2022
Maintainer Author

One more potential ease of use addition to this would be some sort of addition to maestro's cli to use an existing pgen to populate this block in a spec. something like:

$ maestro attach mypgen.py spec.yaml -o new_spec.yaml

And perhaps a component of making this work would be requiring (or encouraging if introspection might work too) numpydoc/googledoc docstrings on the pgen function from which the description and pargs could be parsed from the pgen itself, ensuring the spec is in-sync with the pgen function.

So, taking one of the parg examples from the docs and updating it:

from maestrowf.datastructures.core import ParameterGenerator
import itertools as iter

def get_custom_generator(env, **kwargs):
    '''
    Simple filtering parameter generator example

    Parameters
    ------------
    env: dict
        Environment block from the yaml specification

    SIZE_MIN: int, default=10
        Set minimum size of generated parameter set

    SIZE_STEP: int, default=10
        Increment between sizes in generated parameter set

    NUM_SIZES: int, default=3
        Number of parameter sets to generate for sampling the SIZE parameter
    
    Returns
    --------
    ParameterGenerator containing selected parameter sets
    '''
    p_gen = ParameterGenerator()

    # Unpack any pargs passed in
    size_min = int(kwargs.get('SIZE_MIN', '10'))
    size_step = int(kwargs.get('SIZE_STEP', '10'))
    num_sizes = int(kwargs.get('NUM_SIZES', '3'))

    sizes = range(size_min, size_min+num_sizes*size_step, size_step)
    iterations = (10, 20, 30)

    size_values = []
    iteration_values = []
    trial_values = []

    for trial, param_combo in enumerate(iter.product(sizes, iterations)):
        size_values.append(param_combo[0])
        iteration_values.append(param_combo[1])
        trial_values.append(trial)

    params = {
        "TRIAL": {
            "values": trial_values,
            "label": "TRIAL.%%"
        },
        "SIZE": {
            "values": size_values,
            "label": "SIZE.%%"
        },
        "ITER": {
            "values": iteration_values,
            "label": "ITER.%%"
        },
    }

    for key, value in params.items():
        p_gen.add_parameter(key, value["values"], value["label"])

    return p_gen

That could then be used to auto generate the following section in the yaml spec with the initial attach invocation:

parameter.generator:
    default_pgen: mypgen    # must be one of the names below, which schema can check for
    generators:
      - name: mypgen
        description: Simple filtering parameter generator example
        path: mypgen.py
        pargs:
            - SIZE_MIN: 10
            - SIZE_STEP: 10
            - NUM_SIZES: 3

Still leaves the question of whether we could automate documenting constants in some manner this way (the iterations parameter in the above pgen)... Though maybe that would be best in the docstring of the pgen?

0 replies

jwhite242 · 2022-08-12T00:53:10Z

jwhite242
Aug 12, 2022
Maintainer Author

Further thoughts/discussion on this regarding the provenance. Upon executing a study, serializing this to the yaml spec in that specific study's workspace would be a good way to document what was executed. It removes any confusion about the parameter combos in the meta files not matching what's in the global.parameters block.

Additional concerns I'm documenting here for the eventual implementation so i don't forget: serializing/copying the pgen's themselves. The copy in the executed study workspace I think makes more sense to have just the one that was executed in the list of generators, still have the default_pgen key calling it out, and update the pargs if any to reflect the actual invocation. One caveat is if users use path dependencies on the env block to verify these are there (useful in the case of just using this default_pgen and skipping having to put it on the command line manually): if all generators are in those dependencies then having them all copied into the workspace and left in that record spec might make more sense?

4 replies

doutriaux1 Aug 12, 2022
Maintainer

@jwhite242 quick comment regarding provenance and pgen. The way I get my param (via pgen) is that I query a db, get the records of interest and use them to generate the parameters to my study. The issue with this is that is that there is no guarantee the db will be identical the next time I run the study. I do realize it's not really a Maestro issue per say, but I'm curious if you have any ideas on how to help with provenance in such cases.

jwhite242 Aug 12, 2022
Maintainer Author

Great question @doutriaux1. So, I can see two pieces of this provenance from maestro's perspective, and most likely this will just be one part of the overall provenance of the larger workflows. I think we can provide a complete, reproducible view of what maestro ran in a given study using two things:

Including this pgen invocation (or something like it) into the spec that is copied into the executed study workspace
A record of the parameters injected into each step; this is almost there with the files in the meta directory providing a way to reconstruct the parameter sets used in each step.

There's a little cleanup to do on point 2, and a better way to access that which we already have a working prototype for.

I think that settles what was run. For the reproducible piece, that's an interesting question when you're pulling things from a database. One potential solution there would be we could make a csv reader pgen that maestro always has and simply serialize the parameter set to a csv in that executed study workspace. Might be a second differently named record spec to make that work and make it clear that this one is slightly different so we can preserve the exact invocation in addition to this reproducible/record type.

How does something like that sound?

doutriaux1 Aug 12, 2022
Maintainer

I like the idea to serialize the parameters used by Maestro that ensures we can rerun exactly the same thing at a later time

FrankD412 Aug 12, 2022
Maintainer

I don't know if the CSV parser is the right way to go here; however, I imagine that is probably also a useful feature for generally running large numbers of combinations through Maestro without bloating the specification file. That said, the specification would be the ideal place for this sort of information since someone could just as easily take that specification and it's a simple maestro run to reproduce. That is to say, we would take in the parameters and then as part of the pre-amble to running, Maestro could override the global.parameters block with the properly formatted block directly from the generated generator that pgen returned. The benefit to this is that you can then replay the specification in the future without pgen involved, but that also means that the original command line is also not documented which means that the pargs are lost (plus it's not as readily parse-able as a csv).

I'm starting to wonder if this discussion could be a good spring board on not just handling pgen metadata, but discussing a more robust way to handle metadata in general since the only metadata that Maestro produces is the serialized dictionaries which are just direct dumps of the sets in the study object that's created.

I just wanted to cycle back to metadata in general because to me this conversation is extremely relevant to the bigger picture on metadata too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pgen in the specification #386

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pgen in the specification #386

jwhite242 Mar 25, 2022 Maintainer

Replies: 2 comments · 4 replies

jwhite242 Mar 29, 2022 Maintainer Author

jwhite242 Aug 12, 2022 Maintainer Author

doutriaux1 Aug 12, 2022 Maintainer

jwhite242 Aug 12, 2022 Maintainer Author

doutriaux1 Aug 12, 2022 Maintainer

FrankD412 Aug 12, 2022 Maintainer

jwhite242
Mar 25, 2022
Maintainer

Replies: 2 comments 4 replies

jwhite242
Mar 29, 2022
Maintainer Author

jwhite242
Aug 12, 2022
Maintainer Author

doutriaux1 Aug 12, 2022
Maintainer

jwhite242 Aug 12, 2022
Maintainer Author

doutriaux1 Aug 12, 2022
Maintainer

FrankD412 Aug 12, 2022
Maintainer