Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integer and enum types #19

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add integer and enum types #19

wants to merge 7 commits into from

Conversation

Ryul0rd
Copy link

@Ryul0rd Ryul0rd commented May 15, 2023

This PR adds 2 new types and enforces their correct generation:

  • Integers: These are treated as json numbers but will never contain a "." character so can be parsed as ints. (Regular numbers will practically always have a "." character currently. Is this intended?)
  • Enums: These are treated as json strings but will only ever be one of several options specified in the schema.

Example schema:

car = {
    "type": "object",
    "properties": {
        "make": {"type": "string"},
        "model": {"type": "string"},
        "year": {"type": "integer"},
        "color": {
            "type": "enum",
            "values": ["red", "green", "blue", "brown", "white", "black"],
        },
    },
}

I also went ahead and fixed the issue described in the todo in the generate bool method since it was an easy fix.

BTW, the performance is much better since I last checked up on this library so great job on that!

if (
len(decoded) > 1
and any(c.isdigit() for c in decoded)
and decoded[-1] in [" ", "\n"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check all chars to account for the case where the last sampled token was something like 1<space>h

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be necessary since OutputIntegersTokens only allows tokens consisting of digits with optional leading and trailing whitespace.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just doing some more testing with this and am now running into some bugs when these features are used with llama-7b and dolly-v2-12b.

@Ryul0rd Ryul0rd marked this pull request as draft May 17, 2023 07:40
@Ryul0rd Ryul0rd marked this pull request as ready for review May 18, 2023 09:34
@Ryul0rd
Copy link
Author

Ryul0rd commented May 18, 2023

Everything seems to be working now. One of the issues was actually a problem with both integer generation and number generation so that bug is also fixed. The model wasn't actually being allowed to generate a comma, which would actually be the correct way to terminate a JSON number. The result was that the numbers would just go to the max allowed length in most cases.

@JamesHill0
Copy link

Hey I wanted to ask about the performance issues and see if there is any way I can help. I am running this:

state = {
    "type": "object",
    "properties": {
        "state": {
            "type": "enum",
            "values": ["CA", "WA", "VA", "PA", "NY"],
        },
    },
}

builder = Jsonformer(
    model=model,
    tokenizer=tokenizer,
    json_schema=state,
    prompt="Please generate a JSON for the state PA: ",
    max_string_token_length=20,
)

print("Generating...")
output = builder()

highlight_values(output)

And getting this result:

Generating... { state: "CA" }

This is happening when I run with num_beams=10 in the generate method.

It feels strange that the model is struggling on simple cases like this. Is this an issue of model performance? Or perhaps something to do with the way we are masking tokens being too restrictive? Let me know your thoughts I have been stuck on this for a couple days trying to implement an SSN format and it really starts to struggle on comparable seemingly simple tasks.

@Ryul0rd
Copy link
Author

Ryul0rd commented Jun 6, 2023

@JamesHill0 What mode are you using? It's a bit hard to say if that's the issue without knowing what the model is. All jsonformer can do is guarantee you get valid output and you did. num_beams also isn't going to do anything here because we already treat each option like a beam. You could try other models or another similar library like guidance or LMQL and see if either of those work better.

@wassname
Copy link

wassname commented May 10, 2024

I merged it in this branch, where I added probabilities too https://github.com/wassname/prob_jsonformer

also I made a list of other libs here https://github.com/wassname/awesome-interpretability/tree/main?tab=readme-ov-file#structured-output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants