Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish clear flow schema #506

Open
vyasr opened this issue Apr 26, 2021 · 5 comments
Open

Establish clear flow schema #506

vyasr opened this issue Apr 26, 2021 · 5 comments
Labels

Comments

@vyasr
Copy link
Contributor

vyasr commented Apr 26, 2021

This issue is part of glotzerlab/signac#527 but requires a more focused thread.

signac-flow currently stores a number of pieces of internal data, such as bundles, submission status, and configuration information for FlowProjects. The current data organization is ad hoc and relies on signac's config, and we want to decouple flow from these signac internals. #504 is the first step, establishing an explicit schema versioning for flow's internal data and enabling migration. The next step is defining a clear schema for signac-flow.

Concrete proposals:

  • Running flow init in a directory should run signac init if necessary (already happens), then create an additional directory .flow and a file .flow/config for flow config information.
  • Bundles should be stored in a file .flow/${FLOWPROJECT_NAME}_bundles.
  • Status should be stored in a file .flow/${FLOWPROJECT_NAME}_status. It should not be stored in the project document, which is user-facing.
  • Submission ids should use FlowProject subclass names rather than signac project names. This change is also related to Removing signac project id signac#541.

Possible proposals:

  • Config information should be stored per-FlowProject, not per signac project. Many config values conceptually belong to a FlowProject more than a signac project.
  • Config information should be stored directly in the FlowProject class definition rather than in a config files.
    • Pros:
      • Removes the need to maintain an external config file
      • Clearly and visibly associates config data with a FlowProject
    • Cons:
      • Would require storing possibly "private" data like account names on supercomputers into the "public" project.py files (note that we already do this in signac.rc, but maybe we can find a way to avoid that). One possibility is that some information ends up belonging to the FlowProject class, while other information (like accounts) goes in a config file.
      • This may make it harder to copy over from one machine to another. The problem could be ameliorated by formalizing an association between accounts and environments so that users have a way to specify the account on one supercomputer vs another (which we could internally resolve using our hostname pattern logic), but it will invariably be more verbose.
  • Bundle and status files can use JSON to get the migration going, but we should consider using alternative formats (e.g. a compressed binary file) to reduce I/O and speed up flow's internals. One option would be to formalize something like the gzip format used for the signac project cache into a new synced collections backend.

Open questions:

  • Status/bundle information needs to be associated with a FlowProject to disambiguate the case of multiple FlowProjects in a signac project, but my proposed naming solution above opens us up to the possibility of users renaming their FlowProject class and orphaning the files. We could simply add a dialog whenever we fail to load prior information and ask the user if there is a previous FlowProject name that might be associated with the data, but I don't much like that and would be very open to better ideas.
  • We need to carefully consider whether there are any gotchas associated with putting a project.py file (containing a FlowProject) somewhere other than within a signac project. To what extent should we support this? I think everything should work fine if we simply require users in that case to call FlowProjectSubclass.init_project("/path/to/root") in order for that to work, but we need to make sure that's the case. Decoupling flow's schema from signac's should help somewhat since it'll remove some of the fragile monkey-patching of the config that we do.
@bdice
Copy link
Member

bdice commented May 10, 2021

@vyasr Generally looks good to me, and I think it's close enough that we can begin work on this set of proposals.


Comments on Concrete Proposals

My understanding is that names like .flow/config and .flow/${FLOWPROJECT_NAME}_bundles are relative to the signac.rc file, not project.py. That means that this directory/file structure (which I like) is a valid manifestation of that proposal, and should address the second "open question":

/data/my_project_data
├── .flow
│   ├── Project_bundles
│   ├── Project_status
│   └── config
├── signac.rc
└── workspace
/home/user/my_project_code
└── src
     └── project.py
# project.py
from flow import FlowProject


class Project(FlowProject):
    pass

# operations...

if __name__ == '__main__':
    Project.get_project("/data/my_project_data").main()

Comments on Possible Proposals

Config information should be stored per-FlowProject, not per signac project. Many config values conceptually belong to a FlowProject more than a signac project.

The current config values are:

[flow]
import_packaged_environments = boolean()
status_performance_warn_threshold = float(default=0.2)
show_traceback = boolean()
eligible_jobs_max_lines = int(default=10)
status_parallelization = string(default='none')

and account names, which are specified in a sub-namespace like

[flow]
[[SummitEnvironment]]
account = ABC123

I would be comfortable with all of the first set of values being per-project or global, and to the second point, I would also be perfectly fine with making these values FlowProject constructor arguments. I don't think it matters. I do think that it is important that users can specify account names globally, ideally via an equivalent to $HOME/.signacrc that can apply to all signac usage on that cluster. I would not support a solution that requires account names to be specified in the code (or anything that is expected to be committed). This might require a $HOME/.flowconfig or $HOME/.flowaccounts or something of the sort, if $HOME/.signacrc is not a good place.


This may make it harder to copy over from one machine to another. The problem could be ameliorated by formalizing an association between accounts and environments so that users have a way to specify the account on one supercomputer vs another (which we could internally resolve using our hostname pattern logic), but it will invariably be more verbose.

I don't understand this suggestion enough to evaluate whether it's a good or bad idea. Can you be more concrete about what you're thinking?


Lastly, I support json/gzip for status and bundling. I would even suggest that we make that a SyncedCollection backend that specializes in large documents. 😉

Comments on Open Questions

The concern about orphaning bundles/status when renaming a FlowProject class is a good question. I'd be willing to just accept the cost of orphaning data and warn users not to do that in the documentation.


Hope that hit all the highlights.

@bdice bdice added the proposal label May 10, 2021
@vyasr
Copy link
Contributor Author

vyasr commented May 10, 2021

Thanks for the feedback! Responses inline:

My understanding is that names like .flow/config and .flow/${FLOWPROJECT_NAME}_bundles are relative to the signac.rc file, not project.py. That means that this directory/file structure (which I like) is a valid manifestation of that proposal, and should address the second "open question":

Yes, paths are relative to the signac project root, not project.py. My personal preference and what I've seen done more frequently by other users is to keep project.py and signac.rc in the same place and symlink the workspace directory alone, but as long as your folder layout doesn't cause any problems I see no reason not to support it. I say we move forward assuming that this is fine and reassess if and when we run into any hurdles.

I would be comfortable with all of the first set of values being per-project or global, and to the second point, I would also be perfectly fine with making these values FlowProject constructor arguments. I don't think it matters. I do think that it is important that users can specify account names globally, ideally via an equivalent to $HOME/.signacrc that can apply to all signac usage on that cluster. I would not support a solution that requires account names to be specified in the code (or anything that is expected to be committed). This might require a $HOME/.flowconfig or $HOME/.flowaccounts or something of the sort, if $HOME/.signacrc is not a good place.

The question of the proper owner of config info (system-wide, signac project, or FlowProject) is closely intertwined with the question about public vs. private information, and also begs the question of whether we need to support "stacking" configs (e.g. ./.flowrc overrides variables in $HOME/.flowrc). On second pass, I think we may as well future-proof ourselves against flow's configuration becoming more complicated, so even though I'd like to use the class variables approach because it's much simpler in most cases, I don't want to mix class variables with config files so I think we need to stick with the more general solution.

That also means that the public/private question is partly moot. However, accounts are a bit different from our other config variables: they're really submission directives masquerading as config data so that we can hide them in .signacrc (and share them across projects). Should we instead be formalizing the concept of a "saved directive", or is this premature?

I don't understand this suggestion enough to evaluate whether it's a good or bad idea. Can you be more concrete about what you're thinking?

It's probably no longer relevant if we're sticking to config files, but for completeness' sake I was basically envisioning something like

class Project(FlowProject):
    # Use one of a special set of known class variables to denote config. In this case, ACCOUNTS,
    # which is a mapping from environments to accounts.
    ACCOUNTS = {
        SummitEnvironment: "OLCF_GRANT_ACCOUNT",
        CometEnvironment: "XSEDE_GRANT_ACCOUNT",
    }

Lastly, I support json/gzip for status and bundling. I would even suggest that we make that a SyncedCollection backend that specializes in large documents. 😉

I think you skipped the second sentence in my proposal 😉:

Bundle and status files can use JSON to get the migration going, but we should consider using alternative formats (e.g. a compressed binary file) to reduce I/O and speed up flow's internals. One option would be to formalize something like the gzip format used for the signac project cache into a new synced collections backend.

The concern about orphaning bundles/status when renaming a FlowProject class is a good question. I'd be willing to just accept the cost of orphaning data and warn users not to do that in the documentation.

👍

@bdice
Copy link
Member

bdice commented May 10, 2021

However, accounts are a bit different from our other config variables: they're really submission directives masquerading as config data so that we can hide them in .signacrc (and share them across projects). Should we instead be formalizing the concept of a "saved directive", or is this premature?

All directives are applied to operations/groups, while accounts are not. While both directives and accounts end up generating lines in the scheduler's script preamble, I don't think we should adhere to the same usage/storage paradigm because of the distinct types of functionality.

@csadorf
Copy link
Contributor

csadorf commented May 11, 2021

I actually have a slightly different view on how we would want to handle flow-specific configuration.

  1. The "configuration" is a global data structure that depends on files that are present within the user's file system environment and the current working directory.
  2. In what files the configuration is concretely stored does not really matter. For me all the .signacrc/signac.rc files etc. are aggregated into one global "signac configuration" which can hold data for all signac-related config needs. It is not specific to signac-core. Because of this, I don't really see a need to place the flow configuration in a separate file when we can store it in the [flow] section just as well. In other words, storing it in a different file does not actually address the underlying issue: the need to associate specific configuration with specific flow projects.
  3. For this we need to directly associate it with the project class defined/ instance instantiated in the project.py module and thus I don't really see an alternative to storing the configuration directly or indirectly within that module. Here are a few suggestions how one could approach that:
from pathlib import Path

# config associated with class
class Project(FlowProject):
    # config = """my hardcoded config"
    # config = dict(my_hard="coded", config="as dict")
    # config = flow.read_config("project.cfg")  # config read from file (alternative 1)
    config = Path("project.cfg")  # config read from file (alternative 2)

# config associated with instance:
if __name__ == '__main__':
    Project.get_project().main(config=...)  # config argument like above

I am not sure whether it would work exactly like this since config is already a signac.Project property, but this is just demonstrative.

The bigger question for me is whether we consider the flow-configuration to be directly associated to the class definition or a class instance. I would assume it's the latter, but I'm interested in your thoughts.

As a final note: Let's make sure that we enable and encourage to store "staticy" information in static files. Trying to parse static information from code is basically impossible and might cause huge headaches down the line. It is really important that those two things are separated, a good example is the evolution from setup.py -> pyproject.toml + setup.cfg. In fact, this example is also an argument for not necessarily distinguishing configuration files by filename. Python packages are becoming cleaner and easier to maintain now with more and more tools reading from the same configuration file: setup.cfg.

@vyasr
Copy link
Contributor Author

vyasr commented May 12, 2021

I agree with most of that. I think my question about moving some of the config information into the class definition was made in the hope of simplification without sufficient consideration of the implications.

I think the configuration should be associated with the instance, not the definition. The class definition in my view should be restricted to defining the workflow, and any customization beyond that belongs to the instance. I don't think this is very different from how things work now. Technically your approach specifies a file rather than the project root directory, but ultimately I think the two are interchangeable since the location of configuration files relative to the project root must be clearly specified by the schema anyway.

The one thing I'm not quite aligned with is the the idea of using the same files. I think that's a reasonable goal to move towards in the long term, but so long as 1) flow remains a separate package from signac with a different release cycle, and 2) flow remains in a less mature state than signac with respect to its internals, I think coupling the two will present problems to the evolution of either. I would like to think that the changes we make here will obviate that, but I think we should be realistic and expect some tweaking to be necessary. In the event that the tweaking comes in the form of renaming files (e.g. if the move from signac.rc to .signac/config doesn't happen immediately but ends up happening in a future migration) or changing formats (if we try TOML and for some odd reason find it to be unworkable) we don't want changes in one package to hinder progress on the other. As such, I think we need to decouple the storage of their configurations, at least in the short to intermediate term. We can certainly revisit that once we're more convinced that our schema is stable, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants