-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model training time optimization #1584
Model training time optimization #1584
Conversation
WalkthroughThe changes in the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (1)
kairon/shared/llm/processor.py (1)
129-140
: Improve method encapsulation by making 'truncate_text' privateSince
truncate_text
is intended for internal use, consider prefixing it with an underscore to indicate that it's a private method.Apply this diff:
-def truncate_text(self, texts: List[Text]) -> List[Text]: +def _truncate_text(self, texts: List[Text]) -> List[Text]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- kairon/shared/llm/processor.py (3 hunks)
🔇 Additional comments (2)
kairon/shared/llm/processor.py (2)
152-153
:⚠️ Potential issueVerify the correctness of the embedding model name
In line 152, the model name is specified as
"text-embedding-3-small"
. Please confirm that this is the correct model name supported by thelitellm
library. Common models include"text-embedding-ada-002"
. Using an incorrect model name may result in API errors.Apply this diff to use a valid model name:
result = await litellm.aembedding( - model="text-embedding-3-small", + model="text-embedding-ada-002", input=truncated_texts, # existing parameters )Likely invalid or redundant comment.
145-148
: Verify all calls to 'get_embedding' use the updated method signatureThe
get_embedding
method now acceptstexts: Union[Text, List[Text]]
. Ensure that all calls to this method throughout the codebase pass the appropriate argument type to prevent runtime errors.Run the following script to locate and inspect all calls to
get_embedding
:✅ Verification successful
Let me analyze the calls to
get_embedding
more thoroughly by checking the actual argument types being passed.This will help us verify the type of arguments being passed to
get_embedding
in both files.
Let me check one more thing to verify the type of
embedding_payloads
and see how it's constructed.
Based on the gathered information, I can now provide the final response:
All calls to
get_embedding
are compatible with the updated signatureThe codebase inspection reveals that
get_embedding
is called in two scenarios:
- In
processor.py
, it's called withembedding_payloads
which is a list constructed by appending items in a loop- In
qdrant.py
, it's called through__get_embedding
which accepts a single text parameter and passes it directlyBoth usage patterns are compatible with the new signature
texts: Union[Text, List[Text]]
and the code that handles the conversion of single text to list.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Find all calls to 'get_embedding' and display surrounding context # Expected result: Calls should pass either a string or a list of strings as 'texts' parameter rg 'get_embedding\(' -A 2Length of output: 1488
Script:
#!/bin/bash # Check the context of the first call in qdrant.py rg -B 5 -A 2 "__get_embedding\(text," kairon/shared/vector_embeddings/db/qdrant.py # Check the context of embedding_payloads and query variables rg -B 10 -A 2 "embedding_payloads|query = " kairon/shared/llm/processor.pyLength of output: 2549
Script:
#!/bin/bash # Check the type hints and construction of embedding_payloads ast-grep --pattern 'embedding_payloads: $_' # Get more context around where user_msg is used with __get_embedding rg -B 5 -A 2 "user_msg.*__get_embedding" kairon/shared/vector_embeddings/db/qdrant.pyLength of output: 133
batch_size = 100 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider making 'batch_size' a configurable parameter
Currently, batch_size
is hard-coded to 100
. To enhance flexibility and allow for optimization based on different datasets or environments, consider making batch_size
a configurable parameter or class attribute.
embeddings = await self.get_embedding(embedding_payloads, user, invocation=invocation) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check API limits for embedding batch sizes
When calling self.get_embedding(embedding_payloads, user, invocation=invocation)
, ensure that the batch size does not exceed the limits imposed by the embedding API. Large batch sizes might lead to API errors or throttling. Verify the maximum allowed batch size and adjust if necessary.
is_single_text = isinstance(texts, str) | ||
if is_single_text: | ||
texts = [texts] | ||
|
||
truncated_texts = self.truncate_text(texts) | ||
|
||
result = await litellm.aembedding( | ||
model="text-embedding-3-small", | ||
input=truncated_texts, | ||
metadata={'user': user, 'bot': self.bot, 'invocation': kwargs.get("invocation")}, | ||
api_key=self.llm_secret_embedding.get('api_key'), | ||
num_retries=3 | ||
) | ||
|
||
embeddings = [embedding["embedding"] for embedding in result["data"]] | ||
|
||
if is_single_text: | ||
return embeddings[0] | ||
|
||
return embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for empty or invalid embeddings
In the get_embedding
method, consider adding error handling to manage cases where the embedding API returns empty or invalid results. This will enhance the robustness of your code.
Apply this diff to include error checking:
embeddings = [embedding["embedding"] for embedding in result["data"]]
+if not embeddings or len(embeddings) != len(truncated_texts):
+ raise AppException("Failed to retrieve embeddings for all texts.")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
is_single_text = isinstance(texts, str) | |
if is_single_text: | |
texts = [texts] | |
truncated_texts = self.truncate_text(texts) | |
result = await litellm.aembedding( | |
model="text-embedding-3-small", | |
input=truncated_texts, | |
metadata={'user': user, 'bot': self.bot, 'invocation': kwargs.get("invocation")}, | |
api_key=self.llm_secret_embedding.get('api_key'), | |
num_retries=3 | |
) | |
embeddings = [embedding["embedding"] for embedding in result["data"]] | |
if is_single_text: | |
return embeddings[0] | |
return embeddings | |
is_single_text = isinstance(texts, str) | |
if is_single_text: | |
texts = [texts] | |
truncated_texts = self.truncate_text(texts) | |
result = await litellm.aembedding( | |
model="text-embedding-3-small", | |
input=truncated_texts, | |
metadata={'user': user, 'bot': self.bot, 'invocation': kwargs.get("invocation")}, | |
api_key=self.llm_secret_embedding.get('api_key'), | |
num_retries=3 | |
) | |
embeddings = [embedding["embedding"] for embedding in result["data"]] | |
if not embeddings or len(embeddings) != len(truncated_texts): | |
raise AppException("Failed to retrieve embeddings for all texts.") | |
if is_single_text: | |
return embeddings[0] | |
return embeddings |
truncated_texts = self.truncate_text(texts) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optimize 'truncate_text' using list comprehension
You can simplify the truncate_text
method by using a list comprehension for improved readability and performance.
Apply this diff to refactor the method:
def truncate_text(self, texts: List[Text]) -> List[Text]:
"""
Truncate multiple texts to 8191 tokens for openai
"""
- truncated_texts = []
-
- for text in texts:
- tokens = self.tokenizer.encode(text)[:self.EMBEDDING_CTX_LENGTH]
- truncated_texts.append(self.tokenizer.decode(tokens))
-
- return truncated_texts
+ return [
+ self.tokenizer.decode(self.tokenizer.encode(text)[:self.EMBEDDING_CTX_LENGTH])
+ for text in texts
+ ]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
truncated_texts = self.truncate_text(texts) | |
def truncate_text(self, texts: List[Text]) -> List[Text]: | |
""" | |
Truncate multiple texts to 8191 tokens for openai | |
""" | |
return [ | |
self.tokenizer.decode(self.tokenizer.encode(text)[:self.EMBEDDING_CTX_LENGTH]) | |
for text in texts | |
] |
if content['content_type'] == CognitionDataType.json.value: | ||
metadata = processor.find_matching_metadata(self.bot, content['data'], | ||
content.get('collection')) | ||
search_payload, embedding_payload = Utility.retrieve_search_payload_and_embedding_payload( | ||
content['data'], metadata) | ||
else: | ||
search_payload, embedding_payload = {'content': content["data"]}, content["data"] | ||
|
||
embedding_payloads.append(embedding_payload) | ||
search_payloads.append(search_payload) | ||
vector_ids.append(content['vector_id']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure 'vector_id' is always present in 'content'
In line 90, vector_ids.append(content['vector_id'])
, there is a potential risk of a KeyError
if 'vector_id'
is not present in content
. Ensure that every content
dictionary contains the 'vector_id'
key or handle the case where it might be missing to prevent runtime errors.
Apply this diff to add a safety check:
for content in batch_contents:
if content['content_type'] == CognitionDataType.json.value:
# existing code
else:
# existing code
+ if 'vector_id' in content:
embedding_payloads.append(embedding_payload)
search_payloads.append(search_payload)
vector_ids.append(content['vector_id'])
+ else:
+ logging.warning(f"Missing 'vector_id' in content: {content}")
Committable suggestion was skipped due to low confidence.
* retry action multiple times(3), lowered timeout * retry action multiple times(3), lowered timeout --------- Co-authored-by: spandan.mondal <[email protected]>
…h#1585) Added and fixed testcases related to the same
…on_schema for upload download of bot_content (digiteinfotech#1587) * added forced conversion for congition_data to strictly follow cognition_schema for upload download of bot_content * test fix --------- Co-authored-by: spandan.mondal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 24
🧹 Outside diff range and nitpick comments (13)
tests/integration_test/services_test.py (1)
7922-8000
: LGTM with suggestions for test improvements.The test case is well-structured and covers different slot scenarios. However, consider:
- Adding negative test cases (e.g., non-existent slots)
- Using constants for repeated values
- Adding docstring to explain the test purpose
Example improvement:
def test_get_slot_actions(save_actions): """ Test the retrieval of actions associated with different slots. Verifies: 1. Actions associated with 'bot' slot 2. Actions associated with 'location' slot 3. Actions associated with 'name' slot """ # Test implementation...tests/unit_test/data_processor/data_processor2_test.py (3)
12-35
: Add docstring and consider splitting into focused test cases.The test covers multiple scenarios which could be better organized as separate test cases for improved maintainability and clarity.
Add docstring and split the test:
def test_data_format_correction(): """Test data format correction with various scenarios. Test cases: 1. Basic type conversion (int, float, str) 2. List value conversion 3. None value handling 4. Unsupported type handling """Consider splitting into:
def test_data_format_correction_basic_types(): # Test only basic type conversions def test_data_format_correction_list_values(): # Test list value handling def test_data_format_correction_none_values(): # Test None value handling def test_data_format_correction_unsupported_types(): # Test unsupported type handling
37-42
: Add docstring to clarify test purpose.Add documentation to explain the edge case being tested.
def test_empty_entries(): """Test data format correction with empty data entries. Ensures the method handles empty input gracefully by returning an empty list. """
44-49
: Improve test name and add docstring for clarity.The test name could better reflect that it's testing mixed data type handling.
-def test_non_list_non_string_values(): +def test_data_format_correction_mixed_types(): """Test data format correction with mixed data types. Ensures the method correctly handles entries containing: - String values requiring conversion - Already typed numeric values (float) - String values not requiring conversion """tests/unit_test/action/kremote_action_test.py (2)
26-46
: Consider extracting error message to a constantThe test effectively validates the error scenario. Consider extracting the error message to a constant for better maintainability.
+ERROR_MESSAGE = "Not Found" @pytest.mark.asyncio async def test_multi_try_rasa_request_failure(): endpoint_config = EndpointConfig(url="http://test.com") subpath = "invalid/path" method = "post" with aioresponses() as mock: mock.post( f"http://test.com/{subpath}", status=404, - body=b'{"error": "Not Found"}' + body=f'{{"error": "{ERROR_MESSAGE}"}}'.encode() ) with pytest.raises(ClientResponseError) as exc_info: await KRemoteAction.multi_try_rasa_request( endpoint_config=endpoint_config, method=method, subpath=subpath ) assert exc_info.value.status == 404 - assert exc_info.value.message == "Not Found" + assert exc_info.value.message == ERROR_MESSAGE
1-103
: Well-structured test suite with comprehensive coverageThe test suite effectively covers various scenarios for the
multi_try_rasa_request
method:
- Happy path with successful requests
- Error handling (404, SSL errors)
- Retry mechanism with both success and failure cases
- Proper use of pytest fixtures and aioresponses for HTTP mocking
Consider adding the following test cases for even better coverage:
- Testing with different HTTP methods (GET, PUT, DELETE)
- Testing with various request payloads
- Testing with different retry attempt counts
Would you like me to help implement these additional test cases?
🧰 Tools
🪛 Ruff
3-3:
yarl.URL
imported but unusedRemove unused import:
yarl.URL
(F401)
95-95: Local variable
result
is assigned to but never usedRemove assignment to unused variable
result
(F841)
kairon/shared/cognition/data_objects.py (2)
43-43
: LGTM! Consider renaming for consistency.The addition of the soft deletion flag is a good practice. However, consider renaming
activeStatus
tois_active
oractive
to better align with Python naming conventions.Apply this diff to improve naming consistency:
- activeStatus = BooleanField(default=True) + is_active = BooleanField(default=True)
43-43
: Add docstring to document the schema change.Since this is a significant change in how schema deletion works, consider adding documentation to explain the purpose of this field and its relationship to soft deletion.
Add a docstring like this:
timestamp = DateTimeField(default=datetime.utcnow) + """Timestamp when the schema was last modified""" + activeStatus = BooleanField(default=True) + """ + Indicates if the schema is active (True) or soft-deleted (False). + Used instead of hard deletion to maintain data history. + """kairon/api/app/routers/bot/data.py (2)
96-98
: LGTM! Consider implementing a cleanup strategy.The implementation of soft deletion by marking schemas as inactive is a good practice. However, consider implementing a cleanup strategy (e.g., periodic purge of old inactive schemas) to prevent unnecessary database growth.
Line range hint
279-293
: Enhance error handling coverage.While the HTTPException handling is good, consider catching other potential exceptions that could occur during file operations (e.g., FileNotFoundError, PermissionError).
Here's a suggested implementation:
try: file_path = processor.get_error_report_file_path(current_user.get_bot(), event_id) response = FileResponse(file_path, filename=os.path.basename(file_path)) response.headers["Content-Disposition"] = f"attachment; filename={os.path.basename(file_path)}" return response except HTTPException as e: return Response( success=False, message=e.detail, data=None, error_code=e.status_code ) + except Exception as e: + return Response( + success=False, + message=str(e), + data=None, + error_code=500 + )tests/integration_test/action_service_test.py (1)
12831-12832
: Consider using deterministic values for embedding mocks.Using random values for embeddings in tests could lead to non-deterministic behavior. Consider using a fixed seed or predetermined embedding values to ensure consistent test results across different runs.
- embedding = list(np.random.random(OPENAI_EMBEDDING_OUTPUT)) + # Use fixed values for consistent test results + np.random.seed(42) # Set fixed seed if random values are necessary + embedding = list(np.random.random(OPENAI_EMBEDDING_OUTPUT))tests/unit_test/data_processor/data_processor_test.py (1)
2118-2175
: Remove debug print statements.Consider removing the
print(actions)
statements as they are typically used for debugging and should not be part of the final test code.- print(actions) assert actions == { 'http_action': ['http_action_1', 'http_action_2'], # ... } - print(actions) assert actions == { 'http_action': ['http_action_1', 'http_action_2'], # ... } - print(actions) assert actions == { 'http_action': ['http_action_2'], # ... }🧰 Tools
🪛 Ruff
2119-2119: Redefinition of unused
processor
from line 14(F811)
tests/unit_test/llm_test.py (1)
125-128
: Replace real names with placeholder values in test dataThe
CognitionData
instance uses actual names like'Nupur'
and'Pune'
. To prevent potential exposure of personal identifiable information (PII), consider replacing these with generic placeholder values such as'User1'
and'City1'
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (12)
- kairon/api/app/routers/bot/bot.py (1 hunks)
- kairon/api/app/routers/bot/data.py (1 hunks)
- kairon/chat/actions.py (6 hunks)
- kairon/shared/cognition/data_objects.py (1 hunks)
- kairon/shared/cognition/processor.py (1 hunks)
- kairon/shared/data/processor.py (4 hunks)
- tests/integration_test/action_service_test.py (2 hunks)
- tests/integration_test/services_test.py (4 hunks)
- tests/unit_test/action/kremote_action_test.py (1 hunks)
- tests/unit_test/data_processor/data_processor2_test.py (1 hunks)
- tests/unit_test/data_processor/data_processor_test.py (3 hunks)
- tests/unit_test/llm_test.py (4 hunks)
🔥 Files not summarized due to errors (2)
- tests/integration_test/action_service_test.py: Error: Server error: no LLM provider could handle the message
- tests/unit_test/data_processor/data_processor_test.py: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🪛 Ruff
kairon/api/app/routers/bot/bot.py
1742-1742: Do not perform function call
Security
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
kairon/chat/actions.py
305-314: Use a single
with
statement with multiple contexts instead of nestedwith
statementsCombine
with
statements(SIM117)
kairon/shared/data/processor.py
307-310: Return the condition
config.get("dynamic_url_slot_name") == slot
directlyReplace with
return config.get("dynamic_url_slot_name") == slot
(SIM103)
320-324: Use
return any(set_slot.get("name") == slot for set_slot in set_slots)
instead offor
loopReplace with
return any(set_slot.get("name") == slot for set_slot in set_slots)
(SIM110)
334-337: Return the condition directly
Inline condition
(SIM103)
348-352: Use
return any(MongoProcessor.is_slot_in_field(note, slot) for note in notes)
instead offor
loopReplace with
return any(MongoProcessor.is_slot_in_field(note, slot) for note in notes)
(SIM110)
357-360: Return the condition
set_slot == slot
directlyReplace with
return set_slot == slot
(SIM103)
374-378: Use
return any(set_slot.get("name") == slot for set_slot in set_slots)
instead offor
loopReplace with
return any(set_slot.get("name") == slot for set_slot in set_slots)
(SIM110)
393-397: Use
return any(value == slot for value in metadata.values())
instead offor
loopReplace with
return any(value == slot for value in metadata.values())
(SIM110)
406-409: Return the condition
config.get("set_slot") == slot
directlyReplace with
return config.get("set_slot") == slot
(SIM103)
414-417: Use
return any(set_slot.get("name") == slot for set_slot in set_slots)
instead offor
loopReplace with
return any(set_slot.get("name") == slot for set_slot in set_slots)
(SIM110)
447-451: Use
return any(set_slot.get("name") == slot for set_slot in set_slots)
instead offor
loopReplace with
return any(set_slot.get("name") == slot for set_slot in set_slots)
(SIM110)
tests/unit_test/action/kremote_action_test.py
3-3:
yarl.URL
imported but unusedRemove unused import:
yarl.URL
(F401)
95-95: Local variable
result
is assigned to but never usedRemove assignment to unused variable
result
(F841)
tests/unit_test/data_processor/data_processor2_test.py
3-3: Use capitalized environment variable
SYSTEM_FILE
instead ofsystem_file
Replace
system_file
withSYSTEM_FILE
(SIM112)
tests/unit_test/data_processor/data_processor_test.py
2119-2119: Redefinition of unused
processor
from line 14(F811)
🔇 Additional comments (20)
tests/integration_test/services_test.py (3)
32-32
: LGTM!The new imports are correctly added and necessary for the action-related test cases.
2127-2127
: LGTM!The assertion has been correctly updated to check for an integer value (200) instead of a string.
2347-2347
: LGTM!The assertion has been consistently updated across test cases.
tests/unit_test/data_processor/data_processor2_test.py (1)
1-49
: Verify test coverage for MongoProcessor methods.Let's ensure all scenarios from the implementation are covered by tests.
✅ Verification successful
Test coverage is comprehensive and matches implementation
The test coverage for
data_format_correction_cognition_data
is thorough and aligns well with the implementation. The tests cover:
- Data type conversions (int, float, str) with both single and list values
- Handling of unsupported data types
- Null/None values
- Empty input lists
- Non-list values
- Edge cases with mixed data types
All core functionality and edge cases from the implementation are properly tested.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check test coverage for MongoProcessor methods # Find the implementation file echo "Searching for MongoProcessor implementation..." rg -l "class MongoProcessor" # Find all test files for MongoProcessor echo "Finding all test files for MongoProcessor..." fd -g "*test*.py" | xargs rg -l "MongoProcessor" # Look for data_format_correction_cognition_data implementation echo "Checking implementation details..." rg -A 5 "def data_format_correction_cognition_data"Length of output: 1706
Script:
#!/bin/bash # Let's examine the implementation and test coverage more thoroughly # First, get the full implementation of data_format_correction_cognition_data echo "Full implementation of data_format_correction_cognition_data:" rg -A 20 "def data_format_correction_cognition_data" kairon/shared/data/processor.py # Then, check all test cases for this method echo -e "\nTest cases for data_format_correction_cognition_data:" rg -A 5 "data_format_correction_cognition_data" tests/unit_test/data_processor/data_processor2_test.pyLength of output: 1906
🧰 Tools
🪛 Ruff
3-3: Use capitalized environment variable
SYSTEM_FILE
instead ofsystem_file
Replace
system_file
withSYSTEM_FILE
(SIM112)
tests/unit_test/action/kremote_action_test.py (3)
10-24
: LGTM! Well-structured test caseThe test effectively validates the successful request scenario with proper setup, mocking, and assertion.
49-59
: LGTM! Good SSL error handling testThe test effectively validates the SSL certificate error scenario.
61-80
: LGTM! Comprehensive retry mechanism testThe test effectively validates the retry logic with proper sequence of failure and success responses.
kairon/shared/cognition/data_objects.py (1)
43-43
: Verify the impact on existing queries.The addition of this field might affect existing queries that fetch schemas. Let's verify that related queries have been updated to consider this field.
✅ Verification successful
Based on the search results, I can now provide a final response:
No impact on existing queries - field has default value
The addition of
activeStatus
field with a default value ofTrue
does not require updates to existing queries because:
- The field is only used in one query in
cognition/processor.py
where it's explicitly filtered:for value in CognitionSchema.objects(bot=bot, activeStatus=True):
All other queries will automatically get documents with
activeStatus=True
due to the default value, maintaining backward compatibility.The remaining queries focus on specific fields or use different filters (bot, collection_name, id) and are not impacted by this addition.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for queries that might need updating to include activeStatus filter rg -l "CognitionSchema\.objects" | xargs rg "CognitionSchema\.objects\.(get|find|first)"Length of output: 90
Script:
#!/bin/bash # Let's try a broader search for CognitionSchema usage rg -A 3 "CognitionSchema"Length of output: 20955
Script:
#!/bin/bash # Also search for any database queries in the cognition module fd "cognition" | xargs rg "objects\.(get|find|first|filter)"Length of output: 60
Script:
#!/bin/bash # Check the test files for query patterns fd "test.*cognition" | xargs rg "CognitionSchema"Length of output: 5158
kairon/api/app/routers/bot/data.py (1)
Line range hint
1-293
: Clarify PR scope and documentation.While the changes in this file (schema management and error handling) appear sound, they seem peripheral to the stated PR objective of "Model training time optimization". Consider:
- Updating the PR description to explain how these changes relate to training optimization
- Adding comments explaining the relationship between soft deletion and training performance (if any)
Let's verify the relationship between these changes and training optimization:
kairon/shared/cognition/processor.py (1)
101-101
: Consider handling migration and documentation for activeStatus filterThe addition of
activeStatus=True
filter implements a soft-delete pattern, but there are a few considerations:
- Existing documents without the
activeStatus
field might be excluded from results- This change in behavior should be documented in the method's docstring
- Consider adding error handling for potential
SchemaValidationError
s during the transitionLet's verify the potential impact:
Consider these improvements:
- Add a data migration script to set
activeStatus=True
for existing documents- Update the method's docstring to reflect the filtering behavior:
def list_cognition_schema(self, bot: Text): """ fetches metadata + Note: Only returns schemas where activeStatus=True :param bot: bot id :return: yield dict """
kairon/api/app/routers/bot/bot.py (2)
1736-1737
: LGTM!The return statement correctly uses the Response model to return LLM metadata.
1739-1747
: Well-structured endpoint implementation!The endpoint follows RESTful conventions, has clear documentation, and properly handles the retrieval of slot-mapped actions.
🧰 Tools
🪛 Ruff
1742-1742: Do not perform function call
Security
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
tests/integration_test/action_service_test.py (1)
12805-12805
: LGTM! Test setup with mocks is well structured.The test function is properly decorated with all necessary mocks for external dependencies, making it suitable for testing the integration between prompt actions and slot setting.
tests/unit_test/data_processor/data_processor_test.py (2)
55-59
: LGTM: Import statements are well-organized.The imports are logically grouped and necessary for the test cases.
3983-3983
: LGTM: Mock setup is appropriate.The mock configuration for embeddings and training is well-structured and provides the necessary test data.
kairon/chat/actions.py (2)
188-194
: Ensure consistent timeout application across requestsThe
timeout
parameter is set toACTION_SERVER_REQUEST_TIMEOUT
in themulti_try_rasa_request
call. Verify that this timeout value is consistently used across all requests and consider making it configurable if it's not already.
250-325
: Ensure proper session closure to prevent resource leaksWhen combining the
with
statements, verify that thesession
is properly closed after use to prevent any resource leaks. If thesession
needs to be explicitly closed, ensure that it's appropriately managed.🧰 Tools
🪛 Ruff
305-314: Use a single
with
statement with multiple contexts instead of nestedwith
statementsCombine
with
statements(SIM117)
tests/unit_test/llm_test.py (3)
108-113
: Clarify theenable_search
setting for the 'city' columnIn the
CognitionSchema
forUser_details
, the 'city' column hasenable_search
set toFalse
whilecreate_embeddings
isTrue
. This means embeddings will be created for 'city', but it won't be included in search payloads. Is this the intended behavior?
141-141
: Ensuremock_embedding.side_effect
aligns with embedding callsThe
mock_embedding.side_effect
is set to return two responses: one with two embeddings and another with one embedding. Please verify that this setup matches the expected number of embeddings based on the test data and how the embeddings are processed in batches.
201-205
: Confirm payload fields correspond toenable_search
settingsIn the assertion at lines 201-205, only the fields with
enable_search
set toTrue
('country'
and'role'
) are included in the payload, while fields like'lang'
are excluded. This behavior seems correct based on the schema definitions. Good job on ensuring the payload respects the schema settings.
bot_response = slots | ||
type = text | ||
""" | ||
script = textwrap.dedent(script) | ||
PyscriptActionConfig( | ||
name="pyscript_action_1", | ||
source_code=script, | ||
dispatch_response=True, | ||
bot=pytest.bot, | ||
user="user" | ||
).save() | ||
script = """ | ||
slots = {"location": "Bangalore"} | ||
bot_response = slots | ||
type = text | ||
""" | ||
script = textwrap.dedent(script) | ||
PyscriptActionConfig( | ||
name="pyscript_action_2", | ||
source_code=script, | ||
dispatch_response=True, | ||
bot=pytest.bot, | ||
user="user" | ||
).save() | ||
DatabaseAction( | ||
name="database_action_1", | ||
collection='vector_db_collection', | ||
payload=[DbQuery(query_type=DbActionOperationType.payload_search.value, | ||
type=DbQueryValueType.from_slot.value, | ||
value="name")], | ||
response=HttpActionResponse(value="The value of ${data.0.city} with color ${data.0.color} is ${data.0.id}"), | ||
set_slots=[SetSlotsFromResponse(name="city_value", value="${data.0.id}")], | ||
bot=pytest.bot, | ||
user="user" | ||
).save() | ||
DatabaseAction( | ||
name="database_action_2", | ||
collection='vector_db_collection', | ||
payload=[DbQuery(query_type=DbActionOperationType.payload_search.value, | ||
type=DbQueryValueType.from_slot.value, | ||
value="search")], | ||
response=HttpActionResponse(value="The value of ${data.0.city} with color ${data.0.color} is ${data.0.id}"), | ||
set_slots=[SetSlotsFromResponse(name="location", value="${data.0.id}")], | ||
bot=pytest.bot, | ||
user="user" | ||
).save() | ||
|
||
CallbackConfig( | ||
name="callback_script2", | ||
pyscript_code="bot_response='hello world'", | ||
validation_secret=encrypt_secret( | ||
"gAAAAABmqK71xDb4apnxOAfJjDUv1lrCTooWNX0GPyBHhqW1KBlblUqGNPwsX1V7FlIlgpwWGRWljiYp9mYAf1eG4AcG1dTXQuZCndCewox"), | ||
execution_mode="sync", | ||
bot="6697add6b8e47524eb983373", | ||
).save() | ||
|
||
CallbackActionConfig( | ||
name="callback_action1", | ||
callback_name="callback_script2", | ||
dynamic_url_slot_name="location", | ||
metadata_list=[], | ||
bot_response="Hello", | ||
dispatch_bot_response=True, | ||
bot=pytest.bot, | ||
user="user" | ||
).save() | ||
CallbackActionConfig( | ||
name="callback_action2", | ||
callback_name="callback_script2", | ||
dynamic_url_slot_name="name", | ||
metadata_list=[], | ||
bot_response="Hello", | ||
dispatch_bot_response=True, | ||
bot=pytest.bot, | ||
user="user" | ||
).save() | ||
ScheduleAction( | ||
name="schedule_action_1", | ||
bot=pytest.bot, | ||
user="user", | ||
schedule_time=CustomActionDynamicParameters(parameter_type=ActionParameterType.slot.value, | ||
value="name"), | ||
schedule_action="callback_script2", | ||
timezone="Asia/Kolkata", | ||
response_text="Action schedule", | ||
params_list=[CustomActionRequestParameters(key="bot", parameter_type="slot", value="bot", encrypt=True), | ||
CustomActionRequestParameters(key="user", parameter_type="value", value="1011", encrypt=False)] | ||
).save() | ||
ScheduleAction( | ||
name="schedule_action_2", | ||
bot=pytest.bot, | ||
user="user", | ||
schedule_time=CustomActionDynamicParameters(parameter_type=ActionParameterType.slot.value, | ||
value="location"), | ||
schedule_action="callback_script2", | ||
timezone="Asia/Kolkata", | ||
response_text="Action schedule", | ||
params_list=[CustomActionRequestParameters(key="name", parameter_type="slot", value="name", encrypt=True), | ||
CustomActionRequestParameters(key="user", parameter_type="value", value="1011", encrypt=False)] | ||
).save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider breaking down the fixture into smaller, focused fixtures.
The save_actions
fixture is quite large and sets up many different types of actions. Consider splitting it into smaller fixtures based on action types (e.g., save_http_actions
, save_email_actions
, etc.) for better maintainability and reusability.
validation_secret=encrypt_secret( | ||
"gAAAAABmqK71xDb4apnxOAfJjDUv1lrCTooWNX0GPyBHhqW1KBlblUqGNPwsX1V7FlIlgpwWGRWljiYp9mYAf1eG4AcG1dTXQuZCndCewox"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hardcoding encrypted secrets in test files.
The validation secret appears to be hardcoded. Consider moving it to a test configuration file or using a mock secret for testing.
- validation_secret=encrypt_secret(
- "gAAAAABmqK71xDb4apnxOAfJjDUv1lrCTooWNX0GPyBHhqW1KBlblUqGNPwsX1V7FlIlgpwWGRWljiYp9mYAf1eG4AcG1dTXQuZCndCewox"),
+ validation_secret=encrypt_secret(TEST_SECRET),
Committable suggestion was skipped due to low confidence.
os.environ["system_file"] = "./tests/testing_data/system.yaml" | ||
Utility.load_environment() | ||
Utility.load_system_metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider using pytest fixture for environment setup.
Moving the environment setup into a fixture would make it reusable across test files and ensure proper cleanup after tests.
import pytest
@pytest.fixture(autouse=True)
def setup_environment():
os.environ["SYSTEM_FILE"] = "./tests/testing_data/system.yaml"
Utility.load_environment()
Utility.load_system_metadata()
yield
# Cleanup if needed
if "SYSTEM_FILE" in os.environ:
del os.environ["SYSTEM_FILE"]
🧰 Tools
🪛 Ruff
3-3: Use capitalized environment variable
SYSTEM_FILE
instead ofsystem_file
Replace
system_file
withSYSTEM_FILE
(SIM112)
@@ -0,0 +1,49 @@ | |||
import os | |||
from kairon.shared.utils import Utility | |||
os.environ["system_file"] = "./tests/testing_data/system.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use uppercase for environment variable name.
Following Python conventions, environment variables should be in uppercase.
-os.environ["system_file"] = "./tests/testing_data/system.yaml"
+os.environ["SYSTEM_FILE"] = "./tests/testing_data/system.yaml"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
os.environ["system_file"] = "./tests/testing_data/system.yaml" | |
os.environ["SYSTEM_FILE"] = "./tests/testing_data/system.yaml" |
🧰 Tools
🪛 Ruff
3-3: Use capitalized environment variable
SYSTEM_FILE
instead ofsystem_file
Replace
system_file
withSYSTEM_FILE
(SIM112)
import pytest | ||
from aioresponses import aioresponses | ||
from yarl import URL | ||
|
||
from kairon.chat.actions import KRemoteAction | ||
from rasa.utils.endpoints import EndpointConfig | ||
from rasa.utils.endpoints import ClientResponseError | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import yarl.URL
The yarl.URL
import on line 3 is not used in any of the test cases.
Apply this diff:
import pytest
from aioresponses import aioresponses
-from yarl import URL
from kairon.chat.actions import KRemoteAction
from rasa.utils.endpoints import EndpointConfig
from rasa.utils.endpoints import ClientResponseError
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import pytest | |
from aioresponses import aioresponses | |
from yarl import URL | |
from kairon.chat.actions import KRemoteAction | |
from rasa.utils.endpoints import EndpointConfig | |
from rasa.utils.endpoints import ClientResponseError | |
import pytest | |
from aioresponses import aioresponses | |
from kairon.chat.actions import KRemoteAction | |
from rasa.utils.endpoints import EndpointConfig | |
from rasa.utils.endpoints import ClientResponseError |
🧰 Tools
🪛 Ruff
3-3:
yarl.URL
imported but unusedRemove unused import:
yarl.URL
(F401)
for set_slot in set_slots: | ||
if set_slot.get('name') == slot: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use the any()
function to simplify loops
In the is_slot_in_prompt_action_config
method, you can simplify the loops with the any()
function.
Replace:
for set_slot in set_slots:
if set_slot.get('name') == slot:
return True
return False
with:
return any(set_slot.get('name') == slot for set_slot in set_slots)
🧰 Tools
🪛 Ruff
374-378: Use
return any(set_slot.get("name") == slot for set_slot in set_slots)
instead offor
loopReplace with
return any(set_slot.get("name") == slot for set_slot in set_slots)
(SIM110)
for value in metadata.values(): | ||
if value == slot: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify the loop using the any()
function
In the is_slot_in_pipedrive_leads_action_config
method, you can simplify the loop.
Replace:
for value in metadata.values():
if value == slot:
return True
return False
with:
return any(value == slot for value in metadata.values())
🧰 Tools
🪛 Ruff
393-397: Use
return any(value == slot for value in metadata.values())
instead offor
loopReplace with
return any(value == slot for value in metadata.values())
(SIM110)
if config.get("set_slot") == slot: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify the return statement by returning the condition directly
In the is_slot_in_google_search_action_config
method, you can return the condition directly.
Replace:
if config.get("set_slot") == slot:
return True
return False
with:
return config.get("set_slot") == slot
🧰 Tools
🪛 Ruff
406-409: Return the condition
config.get("set_slot") == slot
directlyReplace with
return config.get("set_slot") == slot
(SIM103)
for set_slot in set_slots: | ||
if set_slot.get('name') == slot: | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use the any()
function to simplify loops
In the is_slot_in_slot_set_action_config
method, you can use the any()
function.
Replace:
for set_slot in set_slots:
if set_slot.get('name') == slot:
return True
return False
with:
return any(set_slot.get('name') == slot for set_slot in set_slots)
🧰 Tools
🪛 Ruff
414-417: Use
return any(set_slot.get("name") == slot for set_slot in set_slots)
instead offor
loopReplace with
return any(set_slot.get("name") == slot for set_slot in set_slots)
(SIM110)
for set_slot in set_slots: | ||
if set_slot.get('name') == slot: | ||
return True | ||
|
||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use the any()
function to simplify loops
In the is_slot_in_http_config
method, simplify the loops using the any()
function.
Replace:
for set_slot in set_slots:
if set_slot.get('name') == slot:
return True
return False
with:
return any(set_slot.get('name') == slot for set_slot in set_slots)
🧰 Tools
🪛 Ruff
447-451: Use
return any(set_slot.get("name") == slot for set_slot in set_slots)
instead offor
loopReplace with
return any(set_slot.get("name") == slot for set_slot in set_slots)
(SIM110)
Summary by CodeRabbit
New Features
Bug Fixes
Tests