Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to using composite config dataset in requested jobs #394

Merged
merged 25 commits into from
Aug 7, 2023

Conversation

robertbartel
Copy link
Contributor

Updates to Nextgen request object types, dataservice package classes, and other workflow components to support use of composite config DataFormat datasets for containing the configuration data supplied to a job, instead of always requiring individual datasets for each variant of config data needed for a job.

Closes #335.

@robertbartel robertbartel added enhancement New feature or request maas MaaS Workstream labels Jul 17, 2023
Updating DataDeriveUtil.derive_datasets with conditional logic to branch
execution and derive composite configuration datasets when appropriate
(and fix the return for this function); also, adding stub function for
actually implementing the deriving of aforementioned composite
configuration datasets.
Adding missing attributes for an optional PartialRealizationConfig (when
supplied by the user) and t-route config dataset id (when requested job
should include routing).
Adding FromFilesInitialDataAdder, FromRawInitialDataAdder,
FromPartialRealizationConfigAdder, and CompositeConfigDataAdder classes
within dmod.dataservice package.
Add implementation of _derive_composite_job_config() using initial data
adder class, and update _derive_realization_config_from_formulations
function to also use an initial data adder class.
Removing _build_forcing_config_for_realization and
_build_ngen_realization_config_from_request functions for util class, as
this functionality is now encapsulated by initial data adder subtypes.
Fixing dependency definition for ngen-config to use git URL (since not
published in pypi), and adding similar definition for ngen-cal.
Updating dependency definitions to OWP packages not published to pypi,
which therefore use git urls, to change usage of NOAA-OWP to noaa-owp in
order to be consistent with usage of such urls within dependencies.
Adding new standard index value COMPOSITE_SOURCE_ID and updating the
NGEN_JOB_COMPOSITE_CONFIG DataFormat value to include the former as
one of the latter's indices.
Updating type with attribute for composite config dataset id, making
formerly non-Optional data_id values for other config datasets Optional,
and updating docstring and attribute descriptions.
Adding method to encapsulate logic for gathering data_id values for any
source dataset for a new, to-be-created composite configuration dataset,
while also encapsulating the logic for keeping 'None' values out for any
applicable data_id attributes that were not set.
Removing realization and BMI init config data requirements and replacing
with on for composite config data; also including t_route_config_data_id
property when such a data_id is provided for part of the source data for
a new composite config dataset.
Add support AbstractNgenRequest for getting a forcing dataset data_id
from request_body and using that, if provided, as part of the built
forcing_data_requirement property.
Fix implementation of formulation_configs with correct nested property
name, and updating type hint and docstring for partition config data
requirement property to properly indication it is an Optional value.
Updating NgenCalRequestEstimationConfig to align with super type to
implement support for use of composite configuration datasets.
Add new unit test class for this type and some initial tests.
Add new unit test class for this type and some initial tests.
Updating examples for new composite_config_data_id attribute of
NGENRequest.
Updating to latest versions of dmod.core and dmod.communication.
@robertbartel robertbartel force-pushed the f/composite_config_dataset/main branch from 0f664d5 to 92a127a Compare July 19, 2023 18:43
Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really solid! Great work, @robertbartel! I left a few comments that I would like to work through. However, they may or may not require any changes.

model_strategy: str = Field(description="The particular ngen calibration strategy to use.")
model_params: Dict[str, List[Dict[str, Any]]]
ngen_cal_config_data_id: str
model_strategy: str = Field(None, description="The particular ngen calibration strategy to use.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should model_strategy exist on this class if this class is only for estimation?

  strategy: 
      # Type of strategey, currently supported is estimation
      type: estimation

https://github.com/NOAA-OWP/ngen-cal/blob/8d46c053b4a6d553d0b6e2c4ac539b5c092a2a63/python/example_config.yaml#L12

disregard. I mistook this field for another field named strategy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few fields on the class that I think can be improved. I think that is outside of the scope of this PR though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #402 to track improving this class.

time_range: TimeRange = Field(description="The time range over which to run ngen simulation(s).")
hydrofabric_uid: str = Field(description="The (DMOD-generated) unique id of the hydrofabric to use.")
hydrofabric_data_id: str = Field(description="The dataset id of the hydrofabric dataset to use.")
composite_config_data_id: str = Field(None, description="Id of required ngen composite config dataset.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be valid to provide a non-None composite_config_data_id, realization_config_data_id, bmi_config_data_id, and t_route_config_data_id or some combination where a composite_config_data_id is non-None and one or more of the other three fields are non-None?

From a model exec standpoint, does that make sense? For example, i'm envisioning that you could have an existing composite_config_data_id that contains a realization config and a t-route config and a t_route_config_data_id is also included in this example request. I would assume that the file path links in the realization config and other necessary files sourced from the composite config dataset would be updated somewhere in the model exec pipeline so that the t_route_config_data_id dataset is correctly used during the simulation? If that is the case, that seems like a pretty advanced use case at least at this stage of the project. For simplicity, would you consider having two NGENRequestBody variants, one that requires a composite_config_data_id and another that requires a realization_config_data_id, bmi_config_data_id, and t_route_config_data_id? We may need to support merging composite config datasets with other config datasets (and you might / likely already do have the code for this :)) at some point, but unless it is a strict need at this point, it seems like a heavy lift for a speculative feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario I have in mind is when the composite config dataset does not already exist. If a request has non-None data_id values for the composite dataset and one or more specialized source datasets, this implies the composite needs to be created from the data in the source(s) and giving the provided data_id.

There could be a piece I am missing in the workflow for this all to come together, but that was what I was going for here.

FWIW, I have a feeling we are going to need to overhaul the job and request types/workflow to build in more distinction. The combinatorics of how data versus data_ids are represented and impact execution paths has become too convoluted. But that can be it's own issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario I have in mind is when the composite config dataset does not already exist. If a request has non-None data_id values for the composite dataset and one or more specialized source datasets, this implies the composite needs to be created from the data in the source(s) and giving the provided data_id.

What are your thoughts on simplifying this and requiring that the dataset already exist? So, it would be prerequisite for some agent to create a composite dataset and send the id of that dataset along with their model exec request. It seems that we could the data service logic you've added in this PR to create / derive composite datasets, but this would just be a clear separation of concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this might be better suited to a larger refactoring effort like you alluded to in your comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on simplifying this and requiring that the dataset already exist?

Yeah, the short version is that it may not actually simplifying things.

The request types are designed to not assume the dataset exists. They can potentially include explicit, user-supplied configuration data in places like the PartialRealizationConfig and/or several of the attributes in NgenCalRequestEstimationConfig. I suppose the same idea applies with implicit config data and source dataset ids.

The job workflow is also built to not assume datasets exist. It includes a step expressly for generating any new, required datasets (for which such generation is supported). And, at the earlier step of determining whether it has all the data needed for the job, it will (when needed) test if sufficient data is available for generating a dataset for an otherwise unfulfilled requirement.

We could separate the behavior for generating datasets from the job workflow and the job request types. Perhaps we should do this, especially if we eventually want robust CLI and API capabilities. But that would touch a lot of places. And further, if the expectation is to provide a GUI form where the user seamlessly both explicitly configures and starts a job, that would shift responsibility to the GUI for handling a good bit of behind the scenes work to create and coordinate dataset creation requests with job requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing context and background, @robertbartel! I think I went a little too far in my last comments. We have fairly good separation of concerns behind this layer that I was neglecting to consider. I dont think that we want to separate the behavior like I mentioned before. I do however think we should consider how to reduce the combinatorics of potential field combinations in this request model. So for example, splitting NGENRequestBody into two models one that supports only a single required composite dataset id and another that supports providing optional routing, realization, and bmi config dataset ids. I think this will make it more clear from the caller side what combinations of fields are allowed. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will make it more clear from the caller side what combinations of fields are allowed.

Does it? The spinoff type for an existing composite config would slightly simplify the easiest scenario - leave out everything that is optional - by instead excluding things by design. The spinoff type that handled creating a new composite dataset would basically need to be the same as the current class, unless we further separated that spinoff into different types for providing the actual config data versus providing source data_id references.

And, we'd have to do that all this again for the calibration request body type.

If the problem is ambiguity on the calling side, we could more easily add factory class methods to make the initialization paths more clear. We could also add some additional properties that indicated in a more summary fashion the different types of state represented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? The spinoff type for an existing composite config would slightly simplify the easiest scenario - leave out everything that is optional - by instead excluding things by design. The spinoff type that handled creating a new composite dataset would basically need to be the same as the current class, unless we further separated that spinoff into different types for providing the actual config data versus providing source data_id references.

Maybe it does not simplify things, but it does make the possible request variants more explicit through using types instead of logic. I think this adds a level of clarity and maintainability that is desirable. However, as you mentioned previously, we probably need to rethink the model exec request structures in the future anyways, so it might not be worth the effort.

My main concern is that we end up with a catchall ngen request type that requires a lot of logic spread across the codebase to enforce what combination of fields are valid and we can handle. This same logic would bleed into client code and need to be updated accordingly. At face value, it seems that it would easier, although more verbose, to treat each combination of fields as its own type. This should reduce the amount of imperative code required to enforce invariants, simplify introduction of new request type, and make it more explicit what combination of fields are accepted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that we end up with a catchall ngen request type that requires a lot of logic spread across the codebase to enforce what combination of fields are valid and we can handle.

Yeah, I agree. My main concern is that separating the types would only shuffle this the logic to different places, and not reduce it into the fundamentals of the design. I could be wrong it this, but regardless, I think that ends up being part of an otherwise-alluded-to larger overhaul.

@property
def composite_config_source_ids(self) -> List[str]:
"""
A list of the data ids for any datasets that are sources of data for a generated composite config dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing to me. Im likely just taking it too literally. I read this to mean that I would get back the dataset ids from a given, existing, composite config dataset. However the implementation suggests to me that this is just a list of extra dataset ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically to get dataset ids for datasets that are sources when we are creating a new composite config dataset, if there are any. Those will be put into the data requirement describing what kind of composite dataset that the job needs, which will eventually lead to certain things being done in the deriver util.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the part where im likely being overly literal is if a composite config dataset id is provided in the request body and no other datasets are included, this will return the empty list. But it would be confusing, at least to me, that the datasources from the provided composite dataset are not returned. I think I am just being overly pedantic about this method's name. Would you consider something instead? The only thing I can come up with at the moment is non_composite_config_ids, but I also dont really like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any way around this. In general, we can't know the details of an existing composite config dataset, beyond its data_id, at the time a request object is created.

It may also help to think of this not as describing the dataset, but describing a data requirement for the job being requested that some dataset must fulfill. If the requirement for config data doesn't include restrictions on source datasets, then that's perfectly valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any way around this. In general, we can't know the details of an existing composite config dataset, beyond its data_id, at the time a request object is created.

Yeah, I agree that this we need this, I am just not so fond of the name. However, if it is this difficult to come up with a different name, the current is probably our best option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would source_dataset_ids be a better name? Or perhaps build_from_datasets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current name is appropriate if it is not allowed to provide a realization_config_data_id, bmi_config_data_id, t_route_config_data_id, and an existing composite_config_data_id. So, should we enforce the invariant that you cannot create a NGENRequestBody instance using if composite_config_data_id and a realization_config_data_id, bmi_config_data_id, or t_route_config_data_id is provided?

Copy link
Contributor Author

@robertbartel robertbartel Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to add a validator for this invariant, but after thinking more, I'm not sure it applies.

The purpose of these types (one at least) is to generally describe the data and datasets that a job needs. The request and request body objects can't really know whether a dataset exists with some data_id; they only describe what's necessary. It isn't strictly invalid to describe the necessary dataset as being a composite of certain others and having a specific name.

Perhaps a user wants or needs to differentiate what is required for a job by specifying a particular unique composite config dataset name, to force something new to be created (e.g., so that freshly generated BMI init configs are used).

We can still add a validator for this; I'm just not certain it's appropriate. And given there is need for an overhaul of some of this anyway, it may be just as wise to wait until then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still add a validator for this; I'm just not certain it's appropriate. And given there is need for an overhaul of some of this anyway, it may be just as wise to wait until then.

That's fair. We can address this at a later time.

params['time'] = Time(start_time=self._job_request.time_range.begin, end_time=self._job_request.time_range.end)

if self.partial_realization_config.routing_config is not None:
params['routing'] = self.partial_realization_config.routing_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to possible adjust Routings config and path parameters here for the schedulers sake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow your question correctly, we may need to make some adjustment of this kind, and not just for the routing config. There are several places where we may need a ngen-config object that expects a config file before we are able to have that config file in place (from the perspective of the object).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasnt specifically thinking of the chicken and egg problem, although my question is related. Sorry for not communicating more effectively, let me try and re-ask. Do we need to modify the path the routing configuration file here, if it is provided, similar to what is done in _build_forcing_config_for_realization?

        forcing_cfg_params['path'] = '/dmod/datasets/'
        if self.partial_realization_config is not None and self.partial_realization_config.is_env_workaround:
            forcing_cfg_params['path'] += 'from_env'

So for example, check that it is a relative path and make any corrections to that path if it is not relative to the realization config dataset path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was approaching the same fundamental problem from the other side of it.

Yes, we probably do need to normalize path values for any created config dataset so that the paths will be valid inside the eventual worker container. Which then leads to the type of problem I alluded to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that you'd like to address in this PR? Or just TODO and issue for later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handled that separately. I've opened #407 for that.

We may also need a separate issue (perhaps on the ngen-cal/ngen-config side of things) for dealing with validations issues for some of the config objects we are making use of in situations when paths aren't there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also need a separate issue (perhaps on the ngen-cal/ngen-config side of things) for dealing with validations issues for some of the config objects we are making use of in situations when paths aren't there.

Great point. This is something Ive been thinking a lot about. There are several field types, file and directory paths are maybe the best example, where it is desirable in some cases to validate and deserialize in a stricter mode. So, in strict mode, you would want to ensure that the paths exist for example. I think ive come up with a good solution to do this, I just need to propose it in the ngen.config package.

In the short term, we could always add a utility function that walks a pydantic model's fields and if a field is a Path type, then verify that it exists. Something like:

from pydantic import BaseModel
from pathlib import Path
from typing import Type, List

def verify_path_types_exists(m: BaseModel) -> List[Path]:
  horizon: List[BaseModel] = [m]
  errors = []

  while horizon:
    model = horizon.pop()
    for field_name in model.__fields__:
      field_value = getattr(model, field_name)
      if isinstance(field_value, Path):
        if not field_value.exists():
          errors.append(field_value)
      elif isinstance(field_value, BaseModel):
        horizon.append(field_value)

  return errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about your solution. I was under the impression there wasn't a way around this without changing the type for DirectoryPath and FilePath attributes, potentially in some kind of dynamic way (i.e., a "strict" mode).

I don't follow how the short-term solution will prevent the validation problems (unless we are again talking about changing the attribute types).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about your solution. I was under the impression there wasn't a way around this without changing the type for DirectoryPath and FilePath attributes, potentially in some kind of dynamic way (i.e., a "strict" mode).

I guess you could do it using some really complex try catch stuff and .construct(). I accomplished it by doing something closer to what you are suggesting.

I don't follow how the short-term solution will prevent the validation problems (unless we are again talking about changing the attribute types).

I think file path and directory path field types in ngen.config are not currently strictly for this reason. So, it will in all cases make sure to coerce the input to a Path, but I don't think it fails if the file or directory doesnt exist.

Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a little more discussion, but we are getting really close to finishing this one. Thanks for the dialog, @robertbartel!

params['time'] = Time(start_time=self._job_request.time_range.begin, end_time=self._job_request.time_range.end)

if self.partial_realization_config.routing_config is not None:
params['routing'] = self.partial_realization_config.routing_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also need a separate issue (perhaps on the ngen-cal/ngen-config side of things) for dealing with validations issues for some of the config objects we are making use of in situations when paths aren't there.

Great point. This is something Ive been thinking a lot about. There are several field types, file and directory paths are maybe the best example, where it is desirable in some cases to validate and deserialize in a stricter mode. So, in strict mode, you would want to ensure that the paths exist for example. I think ive come up with a good solution to do this, I just need to propose it in the ngen.config package.

In the short term, we could always add a utility function that walks a pydantic model's fields and if a field is a Path type, then verify that it exists. Something like:

from pydantic import BaseModel
from pathlib import Path
from typing import Type, List

def verify_path_types_exists(m: BaseModel) -> List[Path]:
  horizon: List[BaseModel] = [m]
  errors = []

  while horizon:
    model = horizon.pop()
    for field_name in model.__fields__:
      field_value = getattr(model, field_name)
      if isinstance(field_value, Path):
        if not field_value.exists():
          errors.append(field_value)
      elif isinstance(field_value, BaseModel):
        horizon.append(field_value)

  return errors

@property
def composite_config_source_ids(self) -> List[str]:
"""
A list of the data ids for any datasets that are sources of data for a generated composite config dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current name is appropriate if it is not allowed to provide a realization_config_data_id, bmi_config_data_id, t_route_config_data_id, and an existing composite_config_data_id. So, should we enforce the invariant that you cannot create a NGENRequestBody instance using if composite_config_data_id and a realization_config_data_id, bmi_config_data_id, or t_route_config_data_id is provided?

time_range: TimeRange = Field(description="The time range over which to run ngen simulation(s).")
hydrofabric_uid: str = Field(description="The (DMOD-generated) unique id of the hydrofabric to use.")
hydrofabric_data_id: str = Field(description="The dataset id of the hydrofabric dataset to use.")
composite_config_data_id: str = Field(None, description="Id of required ngen composite config dataset.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? The spinoff type for an existing composite config would slightly simplify the easiest scenario - leave out everything that is optional - by instead excluding things by design. The spinoff type that handled creating a new composite dataset would basically need to be the same as the current class, unless we further separated that spinoff into different types for providing the actual config data versus providing source data_id references.

Maybe it does not simplify things, but it does make the possible request variants more explicit through using types instead of logic. I think this adds a level of clarity and maintainability that is desirable. However, as you mentioned previously, we probably need to rethink the model exec request structures in the future anyways, so it might not be worth the effort.

My main concern is that we end up with a catchall ngen request type that requires a lot of logic spread across the codebase to enforce what combination of fields are valid and we can handle. This same logic would bleed into client code and need to be updated accordingly. At face value, it seems that it would easier, although more verbose, to treat each combination of fields as its own type. This should reduce the amount of imperative code required to enforce invariants, simplify introduction of new request type, and make it more explicit what combination of fields are accepted.

Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth me continuing to hold this up for issues we will sort out in the future. Thanks for you patience, @robertbartel!

@aaraney aaraney merged commit 8774bce into NOAA-OWP:master Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maas MaaS Workstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add composite DataFormat for combined configs needed for ngen jobs
2 participants