Initial SDG REST API definitions #80

gabe-l-hart · 2024-06-06T21:42:39Z

Related PRs

This PR depends on #71 being merged

Description

This PR adds the first platform API definition for Synthetic Data Generation

Signed-off-by: Gabe Goodhart <[email protected]>

API and 'extensibilitiy' should be in the dictionary, but yaml -> YAML is reasonable. Signed-off-by: Gabe Goodhart <[email protected]>

Signed-off-by: Gabe Goodhart <[email protected]>

leseb

Thanks for drafting more API specs :) I believe the object storage piece should be more agnostic to suit any object storage system (cloud GCP/Azure/AWS/IBM). Overall this looks nice even though as commented on another PR that I don't feel this repo is the right place for putting the API specs. Thanks!

api-definitions/common/file-pointer.yaml

leseb · 2024-06-07T08:18:18Z

api-definitions/common/job-status.yaml

+          - QUEUED
+          - RUNNING
+          - COMPLETED
+          - CANCELED
+          - ERRORED


Should we map with whatever Kubernetes says on Job'status?

Hmm, good question. From my read of the Job status API, there's no equivalent enum in k8s since some of these states are represented by the absence of certain fields (e.g. QUEUED == missing status.startTime). I think for a REST API, an enum is a more logical way to represent this, but I think we could tweak the words to be a bit more in line with k8s terminology:

QUEUED -> PENDING RUNNING -> STARTED COMPLETED -> SUCCEEDED CANCELED -> DELETED (I don't like this one because in k8s deletion is an actual -X DELETE) ERRORED -> FAILED

Do we feel we need to model anything for when a job goes through a "temporary failure" and let's say goes through a retry? Or would we jusft go from FAILED to QUEUED again? Or would we just consider that "process" as another job entirely.

Just thinking through how we would like to look at modeling what could happen let's say when a job hits a transient failure (let's say due to part of it running on bad infrastructure that is then replaced) and then a retry of that is scheduled.

Good question. I think there are probably a lot of detailed error semantics that could shake out of the different usage patterns, but the would probably loosely fall into the 4XX (user error) vs 5XX (system error) camps. I don't think we want to be too prescriptive with the job framework's error handling in the API (some implementations may retry whereas others may not), but I think it might be reasonable to consider having two errored states for user vs system. The challenge will then be figuring out how to encode those different error types in the backend library implementing the job body.

I like the enum too as well as the remap from the Kube terminology. Thanks!

leseb · 2024-06-07T08:25:40Z

api-definitions/common/s3-connection.yaml

+  HMACCredentials:
+    type: object
+    properties:
+      access_key_id:
+        type: string
+        description: The public Access Key ID
+      secret_key:
+        type: string
+        description: The private Secret Key
+
+  IAMCredentials:
+    type: object
+    properties:
+      # TODO: What else goes here?
+      apikey:
+        type: string
+        description: The IAM apikey


I believe both HMAC and IAM use the same concept of access_key and secret_key. If so, how about we consolidate both properties into one?

Something like?

Credentials: type: object description: Credentials for accessing the service properties: access_key: type: string description: The public access key private_key: type: string description: The private key

?

I know at least for IBM IAM, there are different credentials (apikey + token endpoint + short-lived token). I think the AWS definition of IAM is different though, so maybe IAM is not a unique identifier.

The general question here would be whether we want to force HMAC and use that as the only supported access mechanism or try to keep this oneOf with an extensive list of provider-specific credential blob types.

After reading the proposal a second time, I’m okay with it. I think I missed something during my initial review. 🤔

hickeyma

@gabe-l-hart AIs its a draft, is this ready for review or is mopre to be done?

Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2024-06-07T13:43:28Z

Thanks for digging in on this @hickeyma @leseb! I have this in draft until we merge #71 since it will likely need refactoring/moving once that conversation is done. I'll dig into the comments on the content of the specs later today (I agree that the connection stuff probably needs some serious refining).

api-definitions/platform/synthetic-data-generations.yaml

relyt0925 · 2024-06-09T21:14:54Z

api-definitions/platform/synthetic-data-generations.yaml

+        data_example:
+          type: string
+          description: Example of an input data file for this task
+        config_json_schema:


By "config" is this where you visualize a flexible "json blob" to where users could request "advanced parameters" when necessary to feed into SDG that maybe aren't default (lower level things like num samples, algorithm utilized in SDG, etc).

I like the idea of it then starting out really flexible and I guess different implementations at potentially different moments would potentially only allow a given subset of what can be sent in the config level.

Yep, that's exactly the idea here. We imagine the set of tasks to be extensible. Initially, this would be "build time" where the owner of the docker image would rebuild with new task implementations, but eventually we'd probably imagine users creating their own tasks by binding proprietary datasets/prompts/etc to existing generation algorithms. Each generation algorithm has a set of lower-level configs that can theoretically be overridden for each job, so the idea with this API is that when creating the job, the config overrides are an opaque blob, but you can query the system to understand the right schema for that blob beforehand. This avoids the need for us to keep a giant oneOf in the API definitions while still giving the user the ability to know the acceptable schemas that will be used for validation.

relyt0925 · 2024-06-09T21:20:06Z

api-definitions/platform/synthetic-data-generations.yaml

+          $ref: '../common/file-pointer.yaml#/schemas/DirectoryPointer'
+
+        tasks:
+          description: Mapping from task name to task config. The config will be validated against the task's config schema.


I see there's the potential for multiple "tasks" within one job: what do you visualize that being long term? Would it be something along the lines potentially of being able to have a "generate" task, a "mix" tasks to do random mixing of that data, and/or potentially a "filter task" to then filter some of the data and that should still all live within the context of a SDGJob?

Good question. I took this from the CLI args to the library we're thinking of for the generic platform SDG implementation. I think for InstructLab, we'll likely only ever run a single Task (dataset + config + generation algorithm) per job.

…repo Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2024-06-10T22:15:59Z

Per revisions to #71, the body of this PR will likely be moved to a new repo. I'll continue to address the comments here until the conversation on #71 is resolved

Signed-off-by: Gabe Goodhart <[email protected]>

leseb · 2024-06-11T07:17:13Z

Putting on hold given that the definitions will move to a new repo (we all agreed on that approach) once setup. Thanks!

Signed-off-by: Gabe Goodhart <[email protected]>

relyt0925 · 2024-06-14T15:36:30Z

Thanks @gabe-l-hart for your time!!!! This looks really cool and flexible

gabe-l-hart added 6 commits May 31, 2024 14:37

ApiDefinitionGuidelines: Add README section for API Definitions

d1d0426

Signed-off-by: Gabe Goodhart <[email protected]>

ApiDefinitionGuidelines: Add stub of api-definitions directory

224e8c9

Signed-off-by: Gabe Goodhart <[email protected]>

ApiDefinitionGuidelines: Add api-definitions-guidelines doc

70a1546

Signed-off-by: Gabe Goodhart <[email protected]>

ApiDefinitionGuidelines: Fix spellcheck errors

0156135

API and 'extensibilitiy' should be in the dictionary, but yaml -> YAML is reasonable. Signed-off-by: Gabe Goodhart <[email protected]>

ApiDefinitionGuidelines: Add section on API versioning

83b627c

Signed-off-by: Gabe Goodhart <[email protected]>

ApiDefinitionGuidelines: Move API definitions under docs/backend

e1d394d

Signed-off-by: Gabe Goodhart <[email protected]>

mergify bot added the backend InstructLab Backend Services label Jun 6, 2024

leseb requested changes Jun 7, 2024

View reviewed changes

hickeyma reviewed Jun 7, 2024

View reviewed changes

ApiDefinitionGuidelines: Expand on the casing guidelines

8a1f9b9

Signed-off-by: Gabe Goodhart <[email protected]>

relyt0925 reviewed Jun 9, 2024

View reviewed changes

api-definitions/platform/synthetic-data-generations.yaml Outdated Show resolved Hide resolved

relyt0925 reviewed Jun 9, 2024

View reviewed changes

ApiDefinitionGuidelines: Adjust proposal to place API specs in a new …

7109cb7

…repo Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart added 3 commits June 10, 2024 16:16

InitialApiDefinitions: Add initial common API data structures

1a34e8c

Signed-off-by: Gabe Goodhart <[email protected]>

InitialApiDefinitions: Add initial SDG API to platform

0507301

Signed-off-by: Gabe Goodhart <[email protected]>

InitialApiDefinitions: Renames and status code updates after review

e9bd599

Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the InitialApiDefinitions branch from f2a42ea to e9bd599 Compare June 10, 2024 22:33

leseb added the hold label Jun 11, 2024

InitialApiDefinitions: Remove top-level readme API guidelines

3747c8f

Signed-off-by: Gabe Goodhart <[email protected]>

markmc changed the title ~~Initial SDG API definitions~~ Initial SDG REST API definitions Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial SDG REST API definitions #80

Initial SDG REST API definitions #80

gabe-l-hart commented Jun 6, 2024

leseb left a comment

leseb Jun 7, 2024

gabe-l-hart Jun 7, 2024

relyt0925 Jun 9, 2024

gabe-l-hart Jun 10, 2024

leseb Jun 11, 2024

leseb Jun 7, 2024

gabe-l-hart Jun 7, 2024

leseb Jun 11, 2024

hickeyma left a comment

gabe-l-hart commented Jun 7, 2024

relyt0925 Jun 9, 2024

gabe-l-hart Jun 10, 2024

relyt0925 Jun 9, 2024

gabe-l-hart Jun 10, 2024

gabe-l-hart commented Jun 10, 2024

leseb commented Jun 11, 2024

relyt0925 commented Jun 14, 2024

Initial SDG REST API definitions #80

Are you sure you want to change the base?

Initial SDG REST API definitions #80

Conversation

gabe-l-hart commented Jun 6, 2024

Related PRs

Description

leseb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hickeyma left a comment

Choose a reason for hiding this comment

gabe-l-hart commented Jun 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabe-l-hart commented Jun 10, 2024

leseb commented Jun 11, 2024

relyt0925 commented Jun 14, 2024