Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Sanitize function names and trim descriptions for LLM compatibility #41

Closed
wants to merge 5 commits into from

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Jul 22, 2024

Why:

Fixes subtle naming bugs to meet LLM naming requirements for function naming patterns and description length limitations.
Not all operation names from OpenAPI spec and long descriptions are accepted by LLMs.

What:

  • Implemented function name sanitization to match pattern ^[a-zA-Z0-9_-]+$
  • Trimmed descriptions to max 1024 characters
  • Updated schema conversion and operation finding logic
  • Added and expanded tests

How:

sanitized_name = sanitize_function_name("function-name/with/special_chars")
# Result: "function_name_with_special_chars"

sanitized_desc = sanitize({"description": "Long text..."})
# Result: {"description": "Trimmed to 1024 characters..."}

How did you test it:

  • Unit tests for sanitization functions
  • Updated integration tests for edge cases
  • Verified compatibility with existing schemas

Notes for the reviewer:

  • Regex pattern alignment with LLM conventions
  • Consistency of sanitized names across the codebase

@vblagoje vblagoje requested a review from a team as a code owner July 22, 2024 11:01
@vblagoje vblagoje requested review from davidsbatista and removed request for a team July 22, 2024 11:01
@coveralls
Copy link

coveralls commented Jul 22, 2024

Pull Request Test Coverage Report for Build 10042587214

Details

  • 37 of 37 (100.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 90.914%

Totals Coverage Status
Change from base Build 10038469299: 0.2%
Covered Lines: 1751
Relevant Lines: 1926

💛 - Coveralls

@shadeMe shadeMe requested review from shadeMe and removed request for davidsbatista July 22, 2024 12:58

def sanitize(data: Dict[str, Any]) -> Dict[str, Any]:
"""
Sanitizes the given function calling definition by adjusting its properties to LLM requirements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM requirements is rather vague. Is there some documentation that defines what that means, i.e., where are these requirements coming from?

Copy link
Member Author

@vblagoje vblagoje Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shadeMe I couldn't find this in documentation but you hit it in the error reports returned by LLM once you use enough openapi specs. For example:

# BadRequestError: Error code: 400 - {'error': {'message': "Invalid 'tools[0].function.name': string does not match pattern. Expected a string that matches the pattern '^[a-zA-Z0-9_-]+$'.", 'type': 'invalid_request_error', 'param': 'tools[0].function.name', 'code': 'invalid_value'}}

Once you apply this pattern to method is starts to work. Same for length of description field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which LLM? Which provider? Are there providers/models that don't accept underscores? If this code doesn't follow an official spec, that we should not make any assumptions and perform this sanitization (normalization, actually) on a per-LLM provider basis at the least.

Also, this kind of impl. detail of LLM-specific requirements shouldn't leak into code that handles OpenAPI - one should perform this normalization just-in-time, i.e, just before passing it to the generator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll close this PR and rethink it now

:param data: The function calling definition(s) to sanitize.
:return: A sanitized function calling definition.
"""
if isinstance(data, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check required?

Copy link
Member Author

@vblagoje vblagoje Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required for recursive search of the function calling dict structure and sanitize recursive call might not be called on dict. We need recursion to find all those nested description keys and make sure they are "sanitized"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the check should be on line 58.

if isinstance(data, dict):
sanitized_data: Dict[str, Any] = {}
for key, value in data.items():
if key == "name" and isinstance(value, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the type check needed here? What are the expected types of the value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes a parameter in function calling def can have a name name. And this parameter is not a string but another dict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we formally codify this in some manner? I feel like we have far too many Any type hints w.r.t the OpenAPI specific code, and having some kind of schema (or at least better type annotation) would help make this more readable.

Copy link
Member Author

@vblagoje vblagoje Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the sentiment @shadeMe . We have nicely, as much as needed imho, typed OpenAPISpec itself. Here we are dealing with these tools (function) definitions specified by various LLM providers. They look like this:

In other words they are just dictionaries. And although it would be nice to bring some order and typing perhaps it is not worth it given how they change and get expanded with new options. Maybe just providing references to these dictionary structures would help a reader quite a lot.

for key, value in data.items():
if key == "name" and isinstance(value, str):
sanitized_data[key] = sanitize_function_name(value)
elif key == "description" and isinstance(value, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly

return sanitize_function_name(path + "_" + http_method.lower())


def sanitize(data: Dict[str, Any]) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with a more explicit name - sanitize_function_calling_definition?

return data


def sanitize_function_name(name: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More explicit naming.

haystack_experimental/components/tools/openapi/types.py Outdated Show resolved Hide resolved
haystack_experimental/components/tools/openapi/types.py Outdated Show resolved Hide resolved
@vblagoje
Copy link
Member Author

I'll rethink how to do this in a better way outside of OpenAPISpec across LLMs so we can reuse it in the future. Closing

@vblagoje vblagoje closed this Jul 22, 2024
@vblagoje vblagoje deleted the sanitize_functions branch July 23, 2024 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants