-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial SDG REST API definitions #80
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
API and 'extensibilitiy' should be in the dictionary, but yaml -> YAML is reasonable. Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for drafting more API specs :) I believe the object storage piece should be more agnostic to suit any object storage system (cloud GCP/Azure/AWS/IBM). Overall this looks nice even though as commented on another PR that I don't feel this repo is the right place for putting the API specs. Thanks!
- QUEUED | ||
- RUNNING | ||
- COMPLETED | ||
- CANCELED | ||
- ERRORED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we map with whatever Kubernetes says on Job'status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good question. From my read of the Job status API, there's no equivalent enum in k8s since some of these states are represented by the absence of certain fields (e.g. QUEUED == missing status.startTime
). I think for a REST API, an enum is a more logical way to represent this, but I think we could tweak the words to be a bit more in line with k8s terminology:
QUEUED -> PENDING
RUNNING -> STARTED
COMPLETED -> SUCCEEDED
CANCELED -> DELETED (I don't like this one because in k8s deletion is an actual -X DELETE)
ERRORED -> FAILED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we feel we need to model anything for when a job goes through a "temporary failure" and let's say goes through a retry? Or would we jusft go from FAILED to QUEUED again? Or would we just consider that "process" as another job entirely.
Just thinking through how we would like to look at modeling what could happen let's say when a job hits a transient failure (let's say due to part of it running on bad infrastructure that is then replaced) and then a retry of that is scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I think there are probably a lot of detailed error semantics that could shake out of the different usage patterns, but the would probably loosely fall into the 4XX
(user error) vs 5XX
(system error) camps. I don't think we want to be too prescriptive with the job framework's error handling in the API (some implementations may retry whereas others may not), but I think it might be reasonable to consider having two errored states for user vs system. The challenge will then be figuring out how to encode those different error types in the backend library implementing the job body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the enum too as well as the remap from the Kube terminology. Thanks!
HMACCredentials: | ||
type: object | ||
properties: | ||
access_key_id: | ||
type: string | ||
description: The public Access Key ID | ||
secret_key: | ||
type: string | ||
description: The private Secret Key | ||
|
||
IAMCredentials: | ||
type: object | ||
properties: | ||
# TODO: What else goes here? | ||
apikey: | ||
type: string | ||
description: The IAM apikey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe both HMAC and IAM use the same concept of access_key and secret_key. If so, how about we consolidate both properties into one?
Something like?
Credentials:
type: object
description: Credentials for accessing the service
properties:
access_key:
type: string
description: The public access key
private_key:
type: string
description: The private key
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know at least for IBM IAM, there are different credentials (apikey
+ token endpoint + short-lived token
). I think the AWS definition of IAM is different though, so maybe IAM
is not a unique identifier.
The general question here would be whether we want to force HMAC
and use that as the only supported access mechanism or try to keep this oneOf
with an extensive list of provider-specific credential blob types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the proposal a second time, I’m okay with it. I think I missed something during my initial review. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabe-l-hart AIs its a draft, is this ready for review or is mopre to be done?
Signed-off-by: Gabe Goodhart <[email protected]>
Thanks for digging in on this @hickeyma @leseb! I have this in draft until we merge #71 since it will likely need refactoring/moving once that conversation is done. I'll dig into the comments on the content of the specs later today (I agree that the connection stuff probably needs some serious refining). |
data_example: | ||
type: string | ||
description: Example of an input data file for this task | ||
config_json_schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "config" is this where you visualize a flexible "json blob" to where users could request "advanced parameters" when necessary to feed into SDG that maybe aren't default (lower level things like num samples, algorithm utilized in SDG, etc).
I like the idea of it then starting out really flexible and I guess different implementations at potentially different moments would potentially only allow a given subset of what can be sent in the config level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's exactly the idea here. We imagine the set of tasks
to be extensible. Initially, this would be "build time" where the owner of the docker image would rebuild with new task implementations, but eventually we'd probably imagine users creating their own tasks
by binding proprietary datasets/prompts/etc to existing generation algorithms. Each generation algorithm has a set of lower-level configs that can theoretically be overridden for each job, so the idea with this API is that when creating the job, the config overrides are an opaque blob, but you can query the system to understand the right schema for that blob beforehand. This avoids the need for us to keep a giant oneOf
in the API definitions while still giving the user the ability to know the acceptable schemas that will be used for validation.
$ref: '../common/file-pointer.yaml#/schemas/DirectoryPointer' | ||
|
||
tasks: | ||
description: Mapping from task name to task config. The config will be validated against the task's config schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there's the potential for multiple "tasks" within one job: what do you visualize that being long term? Would it be something along the lines potentially of being able to have a "generate" task, a "mix" tasks to do random mixing of that data, and/or potentially a "filter task" to then filter some of the data and that should still all live within the context of a SDGJob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I took this from the CLI args to the library we're thinking of for the generic platform SDG implementation. I think for InstructLab
, we'll likely only ever run a single Task
(dataset + config + generation algorithm) per job.
…repo Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>
f2a42ea
to
e9bd599
Compare
Putting on hold given that the definitions will move to a new repo (we all agreed on that approach) once setup. Thanks! |
Signed-off-by: Gabe Goodhart <[email protected]>
Thanks @gabe-l-hart for your time!!!! This looks really cool and flexible |
Related PRs
This PR depends on #71 being merged
Description
This PR adds the first platform API definition for
Synthetic Data Generation