Add integer and enum types #19

Ryul0rd · 2023-05-15T07:55:57Z

This PR adds 2 new types and enforces their correct generation:

Integers: These are treated as json numbers but will never contain a "." character so can be parsed as ints. (Regular numbers will practically always have a "." character currently. Is this intended?)
Enums: These are treated as json strings but will only ever be one of several options specified in the schema.

Example schema:

car = {
    "type": "object",
    "properties": {
        "make": {"type": "string"},
        "model": {"type": "string"},
        "year": {"type": "integer"},
        "color": {
            "type": "enum",
            "values": ["red", "green", "blue", "brown", "white", "black"],
        },
    },
}

I also went ahead and fixed the issue described in the todo in the generate bool method since it was an easy fix.

BTW, the performance is much better since I last checked up on this library so great job on that!

1rgs · 2023-05-16T22:13:35Z

jsonformer/logits_processors.py

+        if (
+            len(decoded) > 1
+            and any(c.isdigit() for c in decoded)
+            and decoded[-1] in [" ", "\n"]


need to check all chars to account for the case where the last sampled token was something like 1<space>h

Shouldn't be necessary since OutputIntegersTokens only allows tokens consisting of digits with optional leading and trailing whitespace.

Was just doing some more testing with this and am now running into some bugs when these features are used with llama-7b and dolly-v2-12b.

Ryul0rd · 2023-05-18T09:41:47Z

Everything seems to be working now. One of the issues was actually a problem with both integer generation and number generation so that bug is also fixed. The model wasn't actually being allowed to generate a comma, which would actually be the correct way to terminate a JSON number. The result was that the numbers would just go to the max allowed length in most cases.

JamesHill0 · 2023-06-05T17:28:20Z

Hey I wanted to ask about the performance issues and see if there is any way I can help. I am running this:

state = {
    "type": "object",
    "properties": {
        "state": {
            "type": "enum",
            "values": ["CA", "WA", "VA", "PA", "NY"],
        },
    },
}

builder = Jsonformer(
    model=model,
    tokenizer=tokenizer,
    json_schema=state,
    prompt="Please generate a JSON for the state PA: ",
    max_string_token_length=20,
)

print("Generating...")
output = builder()

highlight_values(output)

And getting this result:

Generating... { state: "CA" }

This is happening when I run with num_beams=10 in the generate method.

It feels strange that the model is struggling on simple cases like this. Is this an issue of model performance? Or perhaps something to do with the way we are masking tokens being too restrictive? Let me know your thoughts I have been stuck on this for a couple days trying to implement an SSN format and it really starts to struggle on comparable seemingly simple tasks.

Ryul0rd · 2023-06-06T03:26:38Z

@JamesHill0 What mode are you using? It's a bit hard to say if that's the issue without knowing what the model is. All jsonformer can do is guarantee you get valid output and you did. num_beams also isn't going to do anything here because we already treat each option like a beam. You could try other models or another similar library like guidance or LMQL and see if either of those work better.

wassname · 2024-05-10T12:53:05Z

I merged it in this branch, where I added probabilities too https://github.com/wassname/prob_jsonformer

also I made a list of other libs here https://github.com/wassname/awesome-interpretability/tree/main?tab=readme-ov-file#structured-output

Ryul0rd added 3 commits May 14, 2023 21:51

add integer type

a891783

add enum type

76bec08

Fix tokenizer edgecase todo in generate_boolean

ab4ad40

1rgs reviewed May 16, 2023

View reviewed changes

Ryul0rd marked this pull request as draft May 17, 2023 07:40

Ryul0rd added 3 commits May 17, 2023 03:29

Fix bug in enum generation

e34793b

Fix bug in enum generation

c0e51b3

Fix bug resulting in overly long numbers/ints

58e8270

Ryul0rd marked this pull request as ready for review May 18, 2023 09:34

Remove leftover garbage from testing

3756cc0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add integer and enum types #19

Add integer and enum types #19

Ryul0rd commented May 15, 2023

1rgs May 16, 2023

Ryul0rd May 17, 2023

Ryul0rd May 17, 2023

Ryul0rd commented May 18, 2023

JamesHill0 commented Jun 5, 2023

Ryul0rd commented Jun 6, 2023

wassname commented May 10, 2024 •

edited

Loading

Add integer and enum types #19

Are you sure you want to change the base?

Add integer and enum types #19

Conversation

Ryul0rd commented May 15, 2023

1rgs May 16, 2023

Choose a reason for hiding this comment

Ryul0rd May 17, 2023

Choose a reason for hiding this comment

Ryul0rd May 17, 2023

Choose a reason for hiding this comment

Ryul0rd commented May 18, 2023

JamesHill0 commented Jun 5, 2023

Ryul0rd commented Jun 6, 2023

wassname commented May 10, 2024 • edited Loading

wassname commented May 10, 2024 •

edited

Loading