feat: allow malformed json parsing #292

TwistingTwists · 2025-02-10T14:04:00Z

addresses #227

This PR aims at more robust parsing of LLM text to yield Json. and hence helps in parsing the struct.

There are many ways json could be malformed when returned by LLM.

``` backticks

let response = r#"
    ```
    {
        "name": "Bob",
        "age": 42,
        "address": {
            "street": "789 Pine Rd",
            "city": "Metropolis",
            "country": "USA",
            "coordinates": {"lat": 40.7128, "lng": -74.0060}
        },
        "hobbies": [
            {"name": "Cooking", "years_active": 5, "proficiency": "intermediate"},
            {"name": "Cycling", "years_active": 10, "proficiency": "expert"}
        ]
    }
    ```
    "#;

``` json backticks with tag json
introductory string

let response = r#"
    Here is your json response:
    
    {
        "name": "Bob",
        "age": 42,
        "address": {
            "street": "789 Pine Rd",
            "city": "Metropolis",
            "country": "USA",
            "coordinates": {"lat": 40.7128, "lng": -74.0060}
        },
        "hobbies": [
            {"name": "Cooking", "years_active": 5, "proficiency": "intermediate"},
            {"name": "Cycling", "years_active": 10, "proficiency": "expert"}
        ]
    }
    
    You can use this now.
    "#;

Current approach using serde_json::from_str does not cover these cases. When parsing with json_partial, all these fixes are handled to give a struct.

P.S. I am the author of json_partial

more tests related to parsing json using json_partial can be seen at https://github.com/TwistingTwists/aiworker/blob/36eedcc280a624c3f30cfd5fc712a522d8dbb938/examples/structured-jsonish/src/main.rs#L117-L261

- allows parsing with malformed json returned by LLM

0xMochan

Hey! We really appreciate your PR, your crate looks very useful for LLM applications! For rig specifically, it's important that we are very careful with which dependencies we are using for our projects as we strive to be lean (and are continuing auditing our current set of deps!)

For this PR specifically, I think the idea of having more flexible json parsing for LLMs is a great idea but it's something we want to logic and handle within the rig repo. We also are looking on adding better support for structured outputs as several providers now have direct ways of ensuring proper json and even schema'd output!

TwistingTwists · 2025-02-12T02:43:12Z

Thanks for the feedback @0xMochan

I appreciate the 'low dependency ' scenario.

Here are my thoughts:

I could raise another PR to include the code of jsonish crate in rig itself. That way, you don't have another dependency.
larger models claim to give structured outputs, smaller and cheaper models cannot follow the instruction accurately. Having worked on LLMs for the past year or so, this is a repeated feedback I've had for libraries i have used.

Here is the scenario that bites often Enough, in my experience.

I have a pipeline using large model and everything works fine. The moment I switch to smaller model, things go haywire.

Would it be possible for you to make extractor a trait ? That way, I can write a rig-json-partial-extractor and it doesn't have to be included in the rig itself.

People can use the default extractor or any other crate.

Let me know your thoughts.

0xMochan · 2025-02-18T22:54:23Z

I could raise another PR to include the code of jsonish crate in rig itself. That way, you don't have another dependency.

This is interesting, I'll bring it up with the team. I don't want the norm to be us usurping crates from the community but there might be some way we can co-maintain or something. From our POV, it's a big ask for rig users to add a new dep to their tree and it's our responsibilities as maintainers to ensure we are diligent with every required dep we add (and optional ones too!)

larger models claim to give structured outputs, smaller and cheaper models cannot follow the instruction accurately. Having worked on LLMs for the past year or so, this is a repeated feedback I've had for libraries i have used.

Correct, and i also agree that smaller models are going to be a big deal in larger, multi-agent systems, the type where rig could be extremely proficient in!

Would it be possible for you to make extractor a trait ? That way, I can write a rig-json-partial-extractor and it doesn't have to be included in the rig itself.

There actually was a sizable attempt to revamp extracts to be better integrated into the agent stack. I had a StructuredPrompt and StructuredChat trait defined to allow you to dynamically cast the output of an agent. You would do:

// Roughly
let agent = client.agent().cast::<Vec<MyType>>().build();
let output: Vec<MyType> = agent.prompt("...")

There was some issues we didn't resolve before I started work on the more urgent #199, and now I'd like to come back to that so extractors feel a lot more at "home". there's also the case of structured outputs not being able to be configured and for that I'm inclined to look into #149 Client traits as a way for providers to super-power agents with extra, defined behavior.

TwistingTwists · 2025-02-20T08:34:25Z

Thanks @0xMochan for the detailed explanations.

Let me know with any updates in future. Tag me in the the comments :)

TwistingTwists added 2 commits February 10, 2025 19:28

feat: allow malformed json parsing

6fb8e83

- allows parsing with malformed json returned by LLM

fix: clippy

1e23397

cvauclair added this to the 2025-02-24 milestone Feb 10, 2025

0xMochan reviewed Feb 11, 2025

View reviewed changes

joshua-mo-143 added the breaking label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow malformed json parsing #292

feat: allow malformed json parsing #292

TwistingTwists commented Feb 10, 2025 •

edited

Loading

0xMochan left a comment

TwistingTwists commented Feb 12, 2025

0xMochan commented Feb 18, 2025

TwistingTwists commented Feb 20, 2025

feat: allow malformed json parsing #292

Are you sure you want to change the base?

feat: allow malformed json parsing #292

Conversation

TwistingTwists commented Feb 10, 2025 • edited Loading

0xMochan left a comment

Choose a reason for hiding this comment

TwistingTwists commented Feb 12, 2025

0xMochan commented Feb 18, 2025

TwistingTwists commented Feb 20, 2025

TwistingTwists commented Feb 10, 2025 •

edited

Loading