Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial SDG REST API definitions #80

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

gabe-l-hart
Copy link

Related PRs

This PR depends on #71 being merged

Description

This PR adds the first platform API definition for Synthetic Data Generation

@mergify mergify bot added the backend InstructLab Backend Services label Jun 6, 2024
Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting more API specs :) I believe the object storage piece should be more agnostic to suit any object storage system (cloud GCP/Azure/AWS/IBM). Overall this looks nice even though as commented on another PR that I don't feel this repo is the right place for putting the API specs. Thanks!

api-definitions/common/file-pointer.yaml Outdated Show resolved Hide resolved
Comment on lines +18 to +22
- QUEUED
- RUNNING
- COMPLETED
- CANCELED
- ERRORED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we map with whatever Kubernetes says on Job'status?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question. From my read of the Job status API, there's no equivalent enum in k8s since some of these states are represented by the absence of certain fields (e.g. QUEUED == missing status.startTime). I think for a REST API, an enum is a more logical way to represent this, but I think we could tweak the words to be a bit more in line with k8s terminology:

QUEUED -> PENDING
RUNNING -> STARTED
COMPLETED -> SUCCEEDED
CANCELED -> DELETED (I don't like this one because in k8s deletion is an actual -X DELETE)
ERRORED -> FAILED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we feel we need to model anything for when a job goes through a "temporary failure" and let's say goes through a retry? Or would we jusft go from FAILED to QUEUED again? Or would we just consider that "process" as another job entirely.

Just thinking through how we would like to look at modeling what could happen let's say when a job hits a transient failure (let's say due to part of it running on bad infrastructure that is then replaced) and then a retry of that is scheduled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think there are probably a lot of detailed error semantics that could shake out of the different usage patterns, but the would probably loosely fall into the 4XX (user error) vs 5XX (system error) camps. I don't think we want to be too prescriptive with the job framework's error handling in the API (some implementations may retry whereas others may not), but I think it might be reasonable to consider having two errored states for user vs system. The challenge will then be figuring out how to encode those different error types in the backend library implementing the job body.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the enum too as well as the remap from the Kube terminology. Thanks!

Comment on lines 7 to 23
HMACCredentials:
type: object
properties:
access_key_id:
type: string
description: The public Access Key ID
secret_key:
type: string
description: The private Secret Key

IAMCredentials:
type: object
properties:
# TODO: What else goes here?
apikey:
type: string
description: The IAM apikey
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe both HMAC and IAM use the same concept of access_key and secret_key. If so, how about we consolidate both properties into one?

Something like?

  Credentials:
    type: object
    description: Credentials for accessing the service
    properties:
      access_key:
        type: string
        description: The public access key
      private_key:
        type: string
        description: The private key

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know at least for IBM IAM, there are different credentials (apikey + token endpoint + short-lived token). I think the AWS definition of IAM is different though, so maybe IAM is not a unique identifier.

The general question here would be whether we want to force HMAC and use that as the only supported access mechanism or try to keep this oneOf with an extensive list of provider-specific credential blob types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the proposal a second time, I’m okay with it. I think I missed something during my initial review. 🤔

Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabe-l-hart AIs its a draft, is this ready for review or is mopre to be done?

@gabe-l-hart
Copy link
Author

Thanks for digging in on this @hickeyma @leseb! I have this in draft until we merge #71 since it will likely need refactoring/moving once that conversation is done. I'll dig into the comments on the content of the specs later today (I agree that the connection stuff probably needs some serious refining).

data_example:
type: string
description: Example of an input data file for this task
config_json_schema:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "config" is this where you visualize a flexible "json blob" to where users could request "advanced parameters" when necessary to feed into SDG that maybe aren't default (lower level things like num samples, algorithm utilized in SDG, etc).

I like the idea of it then starting out really flexible and I guess different implementations at potentially different moments would potentially only allow a given subset of what can be sent in the config level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's exactly the idea here. We imagine the set of tasks to be extensible. Initially, this would be "build time" where the owner of the docker image would rebuild with new task implementations, but eventually we'd probably imagine users creating their own tasks by binding proprietary datasets/prompts/etc to existing generation algorithms. Each generation algorithm has a set of lower-level configs that can theoretically be overridden for each job, so the idea with this API is that when creating the job, the config overrides are an opaque blob, but you can query the system to understand the right schema for that blob beforehand. This avoids the need for us to keep a giant oneOf in the API definitions while still giving the user the ability to know the acceptable schemas that will be used for validation.

$ref: '../common/file-pointer.yaml#/schemas/DirectoryPointer'

tasks:
description: Mapping from task name to task config. The config will be validated against the task's config schema.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's the potential for multiple "tasks" within one job: what do you visualize that being long term? Would it be something along the lines potentially of being able to have a "generate" task, a "mix" tasks to do random mixing of that data, and/or potentially a "filter task" to then filter some of the data and that should still all live within the context of a SDGJob?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I took this from the CLI args to the library we're thinking of for the generic platform SDG implementation. I think for InstructLab, we'll likely only ever run a single Task (dataset + config + generation algorithm) per job.

@gabe-l-hart
Copy link
Author

Per revisions to #71, the body of this PR will likely be moved to a new repo. I'll continue to address the comments here until the conversation on #71 is resolved

@leseb
Copy link
Contributor

leseb commented Jun 11, 2024

Putting on hold given that the definitions will move to a new repo (we all agreed on that approach) once setup. Thanks!

@relyt0925
Copy link

Thanks @gabe-l-hart for your time!!!! This looks really cool and flexible

@markmc markmc changed the title Initial SDG API definitions Initial SDG REST API definitions Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend InstructLab Backend Services hold
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants