-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added llm classifier #1613
added llm classifier #1613
Conversation
WalkthroughThe changes in this pull request involve multiple updates across several files in the Kairon project. Key modifications include the introduction of the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (7)
kairon/nlu/classifiers/llm.py (4)
18-18
: Remove unused importbatch
The import of
batch
fromtensorflow.python.ops.gen_batch_ops
is not used in the code.Apply this diff to remove the unused import:
-from tensorflow.python.ops.gen_batch_ops import batch
🧰 Tools
🪛 Ruff (0.8.0)
18-18:
tensorflow.python.ops.gen_batch_ops.batch
imported but unusedRemove unused import:
tensorflow.python.ops.gen_batch_ops.batch
(F401)
102-102
: Remove unnecessaryf
prefix in stringThe f-string at line 102 has no placeholders, so the
f
prefix is unnecessary.Apply this diff to remove the unnecessary
f
prefix:- f"either set bot_id'in LLMClassifier config or set LLM_API_KEY in environment variables" + "either set bot_id'in LLMClassifier config or set LLM_API_KEY in environment variables"🧰 Tools
🪛 Ruff (0.8.0)
102-102: f-string without any placeholders
Remove extraneous
f
prefix(F541)
171-171
: Simplify key existence checks in dictionariesInstead of using
"key" in dict.keys()
, use"key" in dict
when checking for the existence of keys in a dictionary.Apply this diff to simplify the key checks:
- intent = responses["intent"] if "intent" in responses.keys() else None + intent = responses["intent"] if "intent" in responses else None ... - responses["explanation"] - if "explanation" in responses.keys() + responses["explanation"] + if "explanation" in responsesAlso applies to: 174-174
🧰 Tools
🪛 Ruff (0.8.0)
171-171: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
178-178
: Include exception traceback in logs to aid debuggingIncluding the exception traceback in the logs will help with debugging.
Apply this diff to include the traceback in the log:
- logger.error(e) + logger.error(e, exc_info=True)kairon/shared/nlu/featurizer/openai.py (2)
Line range hint
44-49
: Consider usingLLM_API_KEY
environment variable for consistencyCurrently, the code uses
OPENAI_API_KEY
as the environment variable for the API key. For consistency withLLMClassifier
and to simplify configuration, consider changing it toLLM_API_KEY
.Apply this diff to update the environment variable:
elif os.environ.get("OPENAI_API_KEY"): - self.api_key = os.environ.get("OPENAI_API_KEY") + self.api_key = os.environ.get("LLM_API_KEY") else: raise KeyError( - f"either set bot_id'in OpenAIFeaturizer config or set OPENAI_API_KEY in environment variables" + "either set bot_id'in OpenAIFeaturizer config or set LLM_API_KEY in environment variables" )
202-212
: Consider processing messages in batches for improved performanceIn the
process
method, messages are processed individually in a loop. To improve performance, especially for larger batches, consider processing messages in batches by modifying the method to handle multiple messages at once.kairon/nlu/__init__.py (1)
1-1
: AddLLMClassifier
to__all__
to make it an explicit exportThe
LLMClassifier
is imported but not declared as a public API component. To clarify the intended public API and avoid linter warnings, consider addingLLMClassifier
to the__all__
list.Apply this diff to add
__all__
and includeLLMClassifier
:+__all__ = ["LLMClassifier"] from .classifiers.llm import LLMClassifier
🧰 Tools
🪛 Ruff (0.8.0)
1-1:
.classifiers.llm.LLMClassifier
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
kairon/nlu/__init__.py
(1 hunks)kairon/nlu/classifiers/llm.py
(8 hunks)kairon/shared/actions/utils.py
(0 hunks)kairon/shared/nlu/featurizer/openai.py
(2 hunks)requirements/prod.txt
(1 hunks)
💤 Files with no reviewable changes (1)
- kairon/shared/actions/utils.py
🧰 Additional context used
🪛 Ruff (0.8.0)
kairon/nlu/classifiers/llm.py
18-18: tensorflow.python.ops.gen_batch_ops.batch
imported but unused
Remove unused import: tensorflow.python.ops.gen_batch_ops.batch
(F401)
102-102: f-string without any placeholders
Remove extraneous f
prefix
(F541)
171-171: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
174-174: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
kairon/nlu/__init__.py
1-1: .classifiers.llm.LLMClassifier
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
🔇 Additional comments (1)
requirements/prod.txt (1)
70-71
: Dependencies updated appropriately
The additions of huggingface-hub==0.25.2
and more-itertools
are appropriate and necessary based on the code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
kairon/nlu/classifiers/llm.py (5)
24-24
: Remove duplicate importThe
os
module is already imported on line 15.-import os
🧰 Tools
🪛 Ruff (0.8.0)
24-24: Redefinition of unused
os
from line 15Remove definition:
os
(F811)
34-43
: Add docstrings to Pydantic modelsThe
Entities
andClassifierOutput
models would benefit from docstrings explaining their purpose and field descriptions.class Entities(BaseModel): + """Represents an entity extracted from text. + + Attributes: + value: The actual text value of the entity + start: Starting position of the entity in text + end: Ending position of the entity in text + entity: Type of the entity + """ value: str start: int end: int entity: str class ClassifierOutput(BaseModel): + """Represents the output from the LLM classifier. + + Attributes: + intent: The classified intent + entities: List of extracted entities, if any + """ intent: str entities: List[Entities] = None
192-192
: Review hardcoded confidence valueSetting a hardcoded confidence value of 1 may be misleading as it suggests absolute certainty in the classification. Consider:
- Getting a confidence score from the LLM response
- Using a configurable default value
- Adding a comment explaining why a hardcoded value is used
103-103
: Remove unnecessary f-stringThe string doesn't contain any placeholders, so the
f
prefix is unnecessary.- f"either set bot_id'in LLMClassifier config or set LLM_API_KEY in environment variables" + "either set bot_id'in LLMClassifier config or set LLM_API_KEY in environment variables"🧰 Tools
🪛 Ruff (0.8.0)
103-103: f-string without any placeholders
Remove extraneous
f
prefix(F541)
48-49
: Enhance class documentationThe class documentation could be more comprehensive. Consider adding:
- Detailed description of the classification approach
- Documentation for configuration options
- Usage examples
class LLMClassifier(IntentClassifier, GraphComponent, ABC): - """Intent and Entity classifiers using the OpenAI Completion framework""" + """Intent and Entity classifier using Large Language Models. + + This classifier uses LLMs through the litellm library to perform intent classification + and entity extraction. It maintains a vector store of training examples using FAISS + for efficient similarity search. + + Configuration Options: + bot_id: Optional bot identifier for API key lookup + prediction_model: LLM model to use for predictions (default: gpt-4o-mini) + embedding_model: Model for text embeddings (default: text-embedding-3-small) + embedding_size: Size of embeddings (default: 1536) + top_k: Number of similar examples to use (default: 5) + temperature: LLM temperature parameter (default: 0.0) + max_tokens: Maximum tokens in response (default: 50) + retry: Number of retry attempts (default: 3) + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
kairon/nlu/classifiers/llm.py
(8 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
kairon/nlu/classifiers/llm.py
24-24: Redefinition of unused os
from line 15
Remove definition: os
(F811)
103-103: f-string without any placeholders
Remove extraneous f
prefix
(F541)
172-172: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
175-175: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
🔇 Additional comments (2)
kairon/nlu/classifiers/llm.py (2)
143-143
:
Ensure proper type conversion when concatenating entity data
self.data[i][ENTITIES]
may be a list or None
, and concatenating it directly to a string may cause a TypeError
. Convert entities to a JSON string before concatenating.
- context += "text: "+self.data[i]["intent"]+"\nclassifier: {'intent': "+self.data[i][INTENT]+", 'entities': "+self.data[i][ENTITIES]+"}"
+ context += "text: "+self.data[i]["intent"]+"\nclassifier: {'intent': "+self.data[i][INTENT]+", 'entities': "+json.dumps(self.data[i][ENTITIES])+"}"
26-27
: Review global litellm configuration
Setting global configurations (drop_params
and LITELLM_LOG
) might affect other parts of the application that use litellm. Consider moving these configurations to a more isolated scope or documenting the impact.
✅ Verification successful
Global litellm configuration is safe to use
The global configuration settings are only used in kairon/nlu/classifiers/llm.py
and while other files import litellm, they don't interact with these specific configurations. The drop_params
setting and log level are reasonable defaults that won't negatively impact other parts of the application.
litellm.drop_params = True
is a safe optimization to reduce memory usageLITELLM_LOG = "ERROR"
is a reasonable log level that only suppresses non-critical messages
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for other usages of litellm in the codebase
rg "import litellm" -l
Length of output: 302
Script:
#!/bin/bash
# Check for actual usage of litellm.drop_params and LITELLM_LOG in the codebase
rg "litellm\.drop_params" -A 2
rg "LITELLM_LOG" -A 2
Length of output: 372
Script:
#!/bin/bash
# Check the context of litellm usage in the file to understand the scope
rg -B 5 -A 10 "litellm\.drop_params" kairon/nlu/classifiers/llm.py
Length of output: 382
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
kairon/nlu/classifiers/llm.py (2)
90-90
: Remove unnecessary f-string prefix in error messageThe f-string prefix
f
is unnecessary here because the string does not contain any variables to interpolate. Removing thef
prefix simplifies the code.Apply this diff to remove the unnecessary
f
prefix:- f"either set bot_id'in LLMClassifier config or set LLM_API_KEY in environment variables" + "either set bot_id'in LLMClassifier config or set LLM_API_KEY in environment variables"🧰 Tools
🪛 Ruff (0.8.0)
90-90: f-string without any placeholders
Remove extraneous
f
prefix(F541)
196-198
: Simplify dictionary key checks by removing.keys()
When checking if a key exists in a dictionary, use
key in dict
instead ofkey in dict.keys()
. This is more efficient and idiomatic Python.Apply this diff to simplify the key checks:
- intent = responses["intent"] if "intent" in responses.keys() else "nlu_fallback" - explanation = responses["explanation"] if "explanation" in responses.keys() else None - entities = responses["entities"]if "entities" in responses.keys() else [] + intent = responses["intent"] if "intent" in responses else "nlu_fallback" + explanation = responses["explanation"] if "explanation" in responses else None + entities = responses["entities"] if "entities" in responses else []🧰 Tools
🪛 Ruff (0.8.0)
196-196: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
197-197: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
198-198: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
kairon/nlu/classifiers/llm.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
kairon/nlu/classifiers/llm.py
90-90: f-string without any placeholders
Remove extraneous f
prefix
(F541)
196-196: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
197-197: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
198-198: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
for i in indx[0]: | ||
if self.data[i].get(INTENT): | ||
intents.add(self.data[i][INTENT]) | ||
entities = set([entity[ENTITY_ATTRIBUTE_TYPE] for entity in entities]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential logical error when building entities
set
At this line, entities
is being reassigned using a list comprehension over itself, which was initialized as an empty set. This means entities
will remain empty. To extract entity types correctly, you should iterate over the entities from the data point.
Consider correcting the logic as follows:
- entities = set([entity[ENTITY_ATTRIBUTE_TYPE] for entity in entities])
+ entities.update(set([entity[ENTITY_ATTRIBUTE_TYPE] for entity in self.data[i][ENTITIES] or []]))
Would you like assistance in updating this logic?
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
entities = set([entity[ENTITY_ATTRIBUTE_TYPE] for entity in entities]) | |
entities.update(set([entity[ENTITY_ATTRIBUTE_TYPE] for entity in self.data[i][ENTITIES] or []])) |
with tqdm(len(training_data.intent_examples)) as pbar: | ||
counter = 1 | ||
for chunks in chunked(training_data.intent_examples, batch_size): | ||
data = [{"text": example.get(TEXT).strip(), INTENT: example.get(INTENT).strip(), ENTITIES: example.get(ENTITIES)} for example in chunks if example.get(INTENT) and example.get(INTENT)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct redundant condition in list comprehension filter
The condition if example.get(INTENT) and example.get(INTENT)
is redundant and likely a typo. The second example.get(INTENT)
should probably be example.get(TEXT)
to ensure both the intent and text are present.
Apply this diff to correct the condition:
- data = [{"text": example.get(TEXT).strip(), INTENT: example.get(INTENT).strip(), ENTITIES: example.get(ENTITIES)} for example in chunks if example.get(INTENT) and example.get(INTENT)]
+ data = [{"text": example.get(TEXT).strip(), INTENT: example.get(INTENT).strip(), ENTITIES: example.get(ENTITIES)} for example in chunks if example.get(INTENT) and example.get(TEXT)]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
data = [{"text": example.get(TEXT).strip(), INTENT: example.get(INTENT).strip(), ENTITIES: example.get(ENTITIES)} for example in chunks if example.get(INTENT) and example.get(INTENT)] | |
data = [{"text": example.get(TEXT).strip(), INTENT: example.get(INTENT).strip(), ENTITIES: example.get(ENTITIES)} for example in chunks if example.get(INTENT) and example.get(TEXT)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
tests/testing_data/system.yaml (1)
236-236
: Consider documenting the component changesThe replacement of traditional NLP components with LLMClassifier is a significant change. Consider adding documentation about:
- The rationale behind this architectural change
- Performance implications
- Required dependencies
Also applies to: 240-240
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
kairon/shared/nlu/featurizer/openai.py
(0 hunks)system.yaml
(1 hunks)tests/testing_data/system.yaml
(1 hunks)
💤 Files with no reviewable changes (1)
- kairon/shared/nlu/featurizer/openai.py
🔇 Additional comments (3)
tests/testing_data/system.yaml (1)
236-236
: Verify the LLMClassifier integration with existing components
The addition of kairon.nlu.LLMClassifier
represents a significant architectural change. Ensure that it properly integrates with the existing LanguageModelFeaturizer
component.
✅ Verification successful
Integration between LLMClassifier and LanguageModelFeaturizer is properly designed
Based on the code analysis, the LLMClassifier and LanguageModelFeaturizer are independent components that work in a pipeline architecture, where:
- LanguageModelFeaturizer is responsible for generating dense features from text using transformer models
- LLMClassifier is a separate classifier that uses LLM APIs (via litellm) for intent classification and entity extraction
- Both components are properly registered in the pipeline configuration (system.yaml) and follow Rasa's component interface
- They operate independently without direct dependencies, which is the recommended approach for Rasa pipeline components
The architectural design follows best practices by:
- Maintaining separation of concerns between feature extraction and classification
- Following Rasa's component interface and registration patterns
- Using standard input/output message formats for pipeline compatibility
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any potential conflicts or dependencies between components
ast-grep --pattern 'class LLMClassifier {
$$$
def process($$$) {
$$$
}
}'
# Check if the component is properly registered
rg -A 5 "register\(.*LLMClassifier.*\)"
Length of output: 127
Script:
#!/bin/bash
# Search for LLMClassifier implementation
rg -l "class LLMClassifier"
# Search for any imports of LLMClassifier
rg "from.*import.*LLMClassifier|import.*LLMClassifier"
# Look for any usage or integration with LanguageModelFeaturizer
rg -A 5 "LanguageModelFeaturizer"
# Check the system.yaml file content
cat tests/testing_data/system.yaml
Length of output: 19420
Script:
#!/bin/bash
# Check LLMClassifier implementation details
cat kairon/nlu/classifiers/llm.py
# Check for any integration points between LLMClassifier and LanguageModelFeaturizer
rg -A 5 "from kairon.shared.nlu.featurizer.lm_featurizer import LanguageModelFeaturizer" kairon/nlu/classifiers/llm.py
rg -A 5 "LanguageModelFeaturizer" kairon/nlu/classifiers/llm.py
Length of output: 10920
system.yaml (2)
240-240
:
Review security implications of LLM integration
The integration of LLM components requires careful security consideration:
- Ensure proper rate limiting is in place
- Verify token/key management
- Check for potential data privacy concerns
Also applies to: 236-236
240-240
: Verify the LLMClassifier integration in production environment
The addition of kairon.nlu.LLMClassifier
in production configuration should be thoroughly tested:
- Ensure backward compatibility
- Verify performance metrics
- Check resource requirements
@@ -236,11 +236,8 @@ properties: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Implement comprehensive testing strategy
Given the significant architectural change with LLMClassifier:
- Add integration tests for the new component
- Update existing test cases
- Document testing requirements for both development and production environments
Also applies to: 240-240
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved
* added llm classifier * log corrected * prompt corrected * removed unused import * removed unused import
Summary by CodeRabbit
New Features
LLMClassifier
for improved intent and entity classification.trigger_rephrase
for rephrasing text responses.enable_onboarding
feature for improved user onboarding processes.Bug Fixes
Chores
more-itertools
and re-inclusion ofhuggingface-hub
.