Model training time optimization #1584

himanshugt16 · 2024-10-25T03:09:49Z

Summary by CodeRabbit

New Features
- Enhanced batch processing for collections, improving efficiency in handling embeddings.
- New endpoint for retrieving actions mapped to specific slot names, secured by access permissions.
- Updated methods now support processing multiple texts simultaneously for truncation and embedding retrieval.
- Added functionality to mark schemas as inactive instead of outright deletion.
Bug Fixes
- Improved error logging during embedding creation or retrieval to better capture context in case of failures.
- Enhanced error handling and structured responses for file download errors.
Tests
- Comprehensive testing added for action retrieval and embedding processes, including retry logic and error handling scenarios.

coderabbitai · 2024-10-25T03:09:57Z

Walkthrough

The changes in the LLMProcessor class within kairon/shared/llm/processor.py involve significant enhancements to the handling of collections and embeddings. The train method has been refactored to process collections in batches of 100, improving the efficiency of content handling. Additionally, the truncate_text method now accepts multiple texts, and the get_embedding method has been updated to accept either a single text or a list of texts. These modifications streamline the processing and retrieval of embeddings while maintaining error logging in the predict method.

Changes

File Path	Change Summary
kairon/shared/llm/processor.py	- Refactored `train` method to process collections in batches of 100.
	- Updated `truncate_text` to accept a list of texts instead of a single text.
	- Modified `get_embedding` to accept either a single text or a list of texts.
	- Maintained error logging in the `predict` method.
kairon/api/app/routers/bot/bot.py	- Added new endpoint `get_slot_actions` to retrieve actions mapped to a specific slot name.
	- Secured endpoint with `TESTER_ACCESS` permissions.
kairon/api/app/routers/bot/data.py	- Modified `delete_cognition_schema` to mark schemas as inactive instead of deleting.
	- Updated error handling in `download_error_csv` for structured responses.
kairon/chat/actions.py	- Updated `KRemoteAction` constructor to include `retry_attempts`.
	- Added `multi_try_rasa_request` for HTTP requests with retry logic.
	- Enhanced error handling in the `run` method.
kairon/shared/cognition/data_objects.py	- Added `activeStatus` field to `CognitionSchema` class.
kairon/shared/cognition/processor.py	- Updated `list_cognition_schema` to filter by `activeStatus`.
kairon/shared/data/processor.py	- Added multiple static methods in `MongoProcessor` for slot management and action retrieval.
	- Introduced `data_format_correction_cognition_data` method for data format correction.
tests/integration_test/services_test.py	- Added pytest fixture `save_actions` for action configurations.
	- Updated tests to validate actions associated with slots.
tests/unit_test/action/kremote_action_test.py	- Introduced unit tests for `multi_try_rasa_request` method.
tests/unit_test/data_processor/data_processor2_test.py	- New tests for `data_format_correction_cognition_data` method.
tests/unit_test/llm_test.py	- Enhanced unit tests for LLM functionality, focusing on embedding training and error handling.

Poem

In the meadow where data flows,
Batches of knowledge, how quickly it grows!
Texts now truncated, all in a line,
Embeddings retrieved, oh, how they shine!
With each little change, we hop and we cheer,
For the magic of coding brings us all near! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (1)

kairon/shared/llm/processor.py (1)
129-140: Improve method encapsulation by making 'truncate_text' private

Since truncate_text is intended for internal use, consider prefixing it with an underscore to indicate that it's a private method.

Apply this diff:
-def truncate_text(self, texts: List[Text]) -> List[Text]:
+def _truncate_text(self, texts: List[Text]) -> List[Text]:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 3ac4d3a and 189888d.

📒 Files selected for processing (1)

kairon/shared/llm/processor.py (3 hunks)

🔇 Additional comments (2)

kairon/shared/llm/processor.py (2)
152-153: ⚠️ Potential issue

Verify the correctness of the embedding model name

In line 152, the model name is specified as "text-embedding-3-small". Please confirm that this is the correct model name supported by the litellm library. Common models include "text-embedding-ada-002". Using an incorrect model name may result in API errors.

Apply this diff to use a valid model name:
 result = await litellm.aembedding(
-    model="text-embedding-3-small",
+    model="text-embedding-ada-002",
     input=truncated_texts,
     # existing parameters
 )
Likely invalid or redundant comment.

145-148: Verify all calls to 'get_embedding' use the updated method signature

The get_embedding method now accepts texts: Union[Text, List[Text]]. Ensure that all calls to this method throughout the codebase pass the appropriate argument type to prevent runtime errors.

Run the following script to locate and inspect all calls to get_embedding:

✅ Verification successful

Let me analyze the calls to get_embedding more thoroughly by checking the actual argument types being passed.

This will help us verify the type of arguments being passed to get_embedding in both files.

Let me check one more thing to verify the type of embedding_payloads and see how it's constructed.

Based on the gathered information, I can now provide the final response:

All calls to get_embedding are compatible with the updated signature

The codebase inspection reveals that get_embedding is called in two scenarios:

In processor.py, it's called with embedding_payloads which is a list constructed by appending items in a loop

In qdrant.py, it's called through __get_embedding which accepts a single text parameter and passes it directly

Both usage patterns are compatible with the new signature texts: Union[Text, List[Text]] and the code that handles the conversion of single text to list.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Find all calls to 'get_embedding' and display surrounding context

# Expected result: Calls should pass either a string or a list of strings as 'texts' parameter
rg 'get_embedding\(' -A 2
Length of output: 1488

Script:
#!/bin/bash
# Check the context of the first call in qdrant.py
rg -B 5 -A 2 "__get_embedding\(text," kairon/shared/vector_embeddings/db/qdrant.py

# Check the context of embedding_payloads and query variables
rg -B 10 -A 2 "embedding_payloads|query = " kairon/shared/llm/processor.py
Length of output: 2549

Script:
#!/bin/bash
# Check the type hints and construction of embedding_payloads
ast-grep --pattern 'embedding_payloads: $_'

# Get more context around where user_msg is used with __get_embedding
rg -B 5 -A 2 "user_msg.*__get_embedding" kairon/shared/vector_embeddings/db/qdrant.py
Length of output: 133

coderabbitai · 2024-10-25T03:13:05Z

kairon/shared/llm/processor.py

+        batch_size = 100
+


🛠️ Refactor suggestion

Consider making 'batch_size' a configurable parameter

Currently, batch_size is hard-coded to 100. To enhance flexibility and allow for optimization based on different datasets or environments, consider making batch_size a configurable parameter or class attribute.

coderabbitai · 2024-10-25T03:13:05Z

kairon/shared/llm/processor.py

+                embeddings = await self.get_embedding(embedding_payloads, user, invocation=invocation)
+


⚠️ Potential issue

Check API limits for embedding batch sizes

When calling self.get_embedding(embedding_payloads, user, invocation=invocation), ensure that the batch size does not exceed the limits imposed by the embedding API. Large batch sizes might lead to API errors or throttling. Verify the maximum allowed batch size and adjust if necessary.

coderabbitai · 2024-10-25T03:13:05Z

kairon/shared/llm/processor.py

+        is_single_text = isinstance(texts, str)
+        if is_single_text:
+            texts = [texts]
+
+        truncated_texts = self.truncate_text(texts)
+
+        result = await litellm.aembedding(
+            model="text-embedding-3-small",
+            input=truncated_texts,
+            metadata={'user': user, 'bot': self.bot, 'invocation': kwargs.get("invocation")},
+            api_key=self.llm_secret_embedding.get('api_key'),
+            num_retries=3
+        )
+
+        embeddings = [embedding["embedding"] for embedding in result["data"]]
+
+        if is_single_text:
+            return embeddings[0]
+
+        return embeddings


⚠️ Potential issue

Add error handling for empty or invalid embeddings

In the get_embedding method, consider adding error handling to manage cases where the embedding API returns empty or invalid results. This will enhance the robustness of your code.

Apply this diff to include error checking:

embeddings = [embedding["embedding"] for embedding in result["data"]] +if not embeddings or len(embeddings) != len(truncated_texts): + raise AppException("Failed to retrieve embeddings for all texts.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

is_single_text = isinstance(texts, str)

if is_single_text:

texts = [texts]

truncated_texts = self.truncate_text(texts)

result = await litellm.aembedding(

model="text-embedding-3-small",

input=truncated_texts,

metadata={'user': user, 'bot': self.bot, 'invocation': kwargs.get("invocation")},

api_key=self.llm_secret_embedding.get('api_key'),

num_retries=3

)

embeddings = [embedding["embedding"] for embedding in result["data"]]

if is_single_text:

return embeddings[0]

return embeddings

is_single_text = isinstance(texts, str)

if is_single_text:

texts = [texts]

truncated_texts = self.truncate_text(texts)

result = await litellm.aembedding(

model="text-embedding-3-small",

input=truncated_texts,

metadata={'user': user, 'bot': self.bot, 'invocation': kwargs.get("invocation")},

api_key=self.llm_secret_embedding.get('api_key'),

num_retries=3

)

embeddings = [embedding["embedding"] for embedding in result["data"]]

if not embeddings or len(embeddings) != len(truncated_texts):

raise AppException("Failed to retrieve embeddings for all texts.")

if is_single_text:

return embeddings[0]

return embeddings

coderabbitai · 2024-10-25T03:13:05Z

kairon/shared/llm/processor.py

+        truncated_texts = self.truncate_text(texts)
+


🛠️ Refactor suggestion

Optimize 'truncate_text' using list comprehension

You can simplify the truncate_text method by using a list comprehension for improved readability and performance.

Apply this diff to refactor the method:

def truncate_text(self, texts: List[Text]) -> List[Text]: """ Truncate multiple texts to 8191 tokens for openai """ - truncated_texts = [] - - for text in texts: - tokens = self.tokenizer.encode(text)[:self.EMBEDDING_CTX_LENGTH] - truncated_texts.append(self.tokenizer.decode(tokens)) - - return truncated_texts + return [ + self.tokenizer.decode(self.tokenizer.encode(text)[:self.EMBEDDING_CTX_LENGTH]) + for text in texts + ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

truncated_texts = self.truncate_text(texts)

def truncate_text(self, texts: List[Text]) -> List[Text]:

"""

Truncate multiple texts to 8191 tokens for openai

"""

return [

self.tokenizer.decode(self.tokenizer.encode(text)[:self.EMBEDDING_CTX_LENGTH])

for text in texts

]

coderabbitai · 2024-10-25T03:13:06Z

kairon/shared/llm/processor.py

+                    if content['content_type'] == CognitionDataType.json.value:
+                        metadata = processor.find_matching_metadata(self.bot, content['data'],
+                                                                    content.get('collection'))
+                        search_payload, embedding_payload = Utility.retrieve_search_payload_and_embedding_payload(
+                            content['data'], metadata)
+                    else:
+                        search_payload, embedding_payload = {'content': content["data"]}, content["data"]
+
+                    embedding_payloads.append(embedding_payload)
+                    search_payloads.append(search_payload)
+                    vector_ids.append(content['vector_id'])
+


⚠️ Potential issue

Ensure 'vector_id' is always present in 'content'

In line 90, vector_ids.append(content['vector_id']), there is a potential risk of a KeyError if 'vector_id' is not present in content. Ensure that every content dictionary contains the 'vector_id' key or handle the case where it might be missing to prevent runtime errors.

Apply this diff to add a safety check:

for content in batch_contents: if content['content_type'] == CognitionDataType.json.value: # existing code else: # existing code + if 'vector_id' in content: embedding_payloads.append(embedding_payload) search_payloads.append(search_payload) vector_ids.append(content['vector_id']) + else: + logging.warning(f"Missing 'vector_id' in content: {content}")

Committable suggestion was skipped due to low confidence.

* retry action multiple times(3), lowered timeout * retry action multiple times(3), lowered timeout --------- Co-authored-by: spandan.mondal <[email protected]>

…h#1585) Added and fixed testcases related to the same

…on_schema for upload download of bot_content (digiteinfotech#1587) * added forced conversion for congition_data to strictly follow cognition_schema for upload download of bot_content * test fix --------- Co-authored-by: spandan.mondal <[email protected]>

coderabbitai

Actionable comments posted: 24

🧹 Outside diff range and nitpick comments (13)

tests/integration_test/services_test.py (1)
7922-8000: LGTM with suggestions for test improvements.

The test case is well-structured and covers different slot scenarios. However, consider:

Adding negative test cases (e.g., non-existent slots)

Using constants for repeated values

Adding docstring to explain the test purpose

Example improvement:
def test_get_slot_actions(save_actions):
    """
    Test the retrieval of actions associated with different slots.
    
    Verifies:
    1. Actions associated with 'bot' slot
    2. Actions associated with 'location' slot
    3. Actions associated with 'name' slot
    """
    # Test implementation...
tests/unit_test/data_processor/data_processor2_test.py (3)
12-35: Add docstring and consider splitting into focused test cases.

The test covers multiple scenarios which could be better organized as separate test cases for improved maintainability and clarity.

Add docstring and split the test:
def test_data_format_correction():
    """Test data format correction with various scenarios.
    
    Test cases:
    1. Basic type conversion (int, float, str)
    2. List value conversion
    3. None value handling
    4. Unsupported type handling
    """
Consider splitting into:
def test_data_format_correction_basic_types():
    # Test only basic type conversions

def test_data_format_correction_list_values():
    # Test list value handling

def test_data_format_correction_none_values():
    # Test None value handling

def test_data_format_correction_unsupported_types():
    # Test unsupported type handling
37-42: Add docstring to clarify test purpose.

Add documentation to explain the edge case being tested.
def test_empty_entries():
    """Test data format correction with empty data entries.
    
    Ensures the method handles empty input gracefully by returning an empty list.
    """
44-49: Improve test name and add docstring for clarity.

The test name could better reflect that it's testing mixed data type handling.
-def test_non_list_non_string_values():
+def test_data_format_correction_mixed_types():
    """Test data format correction with mixed data types.
    
    Ensures the method correctly handles entries containing:
    - String values requiring conversion
    - Already typed numeric values (float)
    - String values not requiring conversion
    """
tests/unit_test/action/kremote_action_test.py (2)
26-46: Consider extracting error message to a constant

The test effectively validates the error scenario. Consider extracting the error message to a constant for better maintainability.
+ERROR_MESSAGE = "Not Found"

@pytest.mark.asyncio
async def test_multi_try_rasa_request_failure():
    endpoint_config = EndpointConfig(url="http://test.com")
    subpath = "invalid/path"
    method = "post"

    with aioresponses() as mock:
        mock.post(
            f"http://test.com/{subpath}",
            status=404,
-           body=b'{"error": "Not Found"}'
+           body=f'{{"error": "{ERROR_MESSAGE}"}}'.encode()
        )

        with pytest.raises(ClientResponseError) as exc_info:
            await KRemoteAction.multi_try_rasa_request(
                endpoint_config=endpoint_config, method=method, subpath=subpath
            )
        assert exc_info.value.status == 404
-       assert exc_info.value.message == "Not Found"
+       assert exc_info.value.message == ERROR_MESSAGE
1-103: Well-structured test suite with comprehensive coverage

The test suite effectively covers various scenarios for the multi_try_rasa_request method:

Happy path with successful requests

Error handling (404, SSL errors)

Retry mechanism with both success and failure cases

Proper use of pytest fixtures and aioresponses for HTTP mocking

Consider adding the following test cases for even better coverage:

Testing with different HTTP methods (GET, PUT, DELETE)

Testing with various request payloads

Testing with different retry attempt counts

Would you like me to help implement these additional test cases?

🧰 Tools

🪛 Ruff

3-3: yarl.URL imported but unused

Remove unused import: yarl.URL

(F401)

95-95: Local variable result is assigned to but never used

Remove assignment to unused variable result

(F841)
kairon/shared/cognition/data_objects.py (2)
43-43: LGTM! Consider renaming for consistency.

The addition of the soft deletion flag is a good practice. However, consider renaming activeStatus to is_active or active to better align with Python naming conventions.

Apply this diff to improve naming consistency:
-    activeStatus = BooleanField(default=True)
+    is_active = BooleanField(default=True)
43-43: Add docstring to document the schema change.

Since this is a significant change in how schema deletion works, consider adding documentation to explain the purpose of this field and its relationship to soft deletion.

Add a docstring like this:
     timestamp = DateTimeField(default=datetime.utcnow)
+    """Timestamp when the schema was last modified"""
+
     activeStatus = BooleanField(default=True)
+    """
+    Indicates if the schema is active (True) or soft-deleted (False).
+    Used instead of hard deletion to maintain data history.
+    """
kairon/api/app/routers/bot/data.py (2)
96-98: LGTM! Consider implementing a cleanup strategy.

The implementation of soft deletion by marking schemas as inactive is a good practice. However, consider implementing a cleanup strategy (e.g., periodic purge of old inactive schemas) to prevent unnecessary database growth.

Line range hint 279-293: Enhance error handling coverage.

While the HTTPException handling is good, consider catching other potential exceptions that could occur during file operations (e.g., FileNotFoundError, PermissionError).

Here's a suggested implementation:
     try:
         file_path = processor.get_error_report_file_path(current_user.get_bot(), event_id)
 
         response = FileResponse(file_path, filename=os.path.basename(file_path))
         response.headers["Content-Disposition"] = f"attachment; filename={os.path.basename(file_path)}"
 
         return response
     except HTTPException as e:
         return Response(
             success=False,
             message=e.detail,
             data=None,
             error_code=e.status_code
         )
+    except Exception as e:
+        return Response(
+            success=False,
+            message=str(e),
+            data=None,
+            error_code=500
+        )
tests/integration_test/action_service_test.py (1)
12831-12832: Consider using deterministic values for embedding mocks.

Using random values for embeddings in tests could lead to non-deterministic behavior. Consider using a fixed seed or predetermined embedding values to ensure consistent test results across different runs.
-    embedding = list(np.random.random(OPENAI_EMBEDDING_OUTPUT))
+    # Use fixed values for consistent test results
+    np.random.seed(42)  # Set fixed seed if random values are necessary
+    embedding = list(np.random.random(OPENAI_EMBEDDING_OUTPUT))
tests/unit_test/data_processor/data_processor_test.py (1)
2118-2175: Remove debug print statements.

Consider removing the print(actions) statements as they are typically used for debugging and should not be part of the final test code.
-        print(actions)
         assert actions == {
             'http_action': ['http_action_1', 'http_action_2'],
             # ...
         }
-        print(actions)
         assert actions == {
             'http_action': ['http_action_1', 'http_action_2'],
             # ...
         }
-        print(actions)
         assert actions == {
             'http_action': ['http_action_2'],
             # ...
         }
🧰 Tools

🪛 Ruff

2119-2119: Redefinition of unused processor from line 14

(F811)
tests/unit_test/llm_test.py (1)

125-128: Replace real names with placeholder values in test data

The CognitionData instance uses actual names like 'Nupur' and 'Pune'. To prevent potential exposure of personal identifiable information (PII), consider replacing these with generic placeholder values such as 'User1' and 'City1'.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 189888d and 1da3975.

📒 Files selected for processing (12)

kairon/api/app/routers/bot/bot.py (1 hunks)
kairon/api/app/routers/bot/data.py (1 hunks)
kairon/chat/actions.py (6 hunks)
kairon/shared/cognition/data_objects.py (1 hunks)
kairon/shared/cognition/processor.py (1 hunks)
kairon/shared/data/processor.py (4 hunks)
tests/integration_test/action_service_test.py (2 hunks)
tests/integration_test/services_test.py (4 hunks)
tests/unit_test/action/kremote_action_test.py (1 hunks)
tests/unit_test/data_processor/data_processor2_test.py (1 hunks)
tests/unit_test/data_processor/data_processor_test.py (3 hunks)
tests/unit_test/llm_test.py (4 hunks)

🔥 Files not summarized due to errors (2)

tests/integration_test/action_service_test.py: Error: Server error: no LLM provider could handle the message
tests/unit_test/data_processor/data_processor_test.py: Error: Server error: no LLM provider could handle the message

🧰 Additional context used

🪛 Ruff

kairon/api/app/routers/bot/bot.py

1742-1742: Do not perform function call Security in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

kairon/chat/actions.py

305-314: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

kairon/shared/data/processor.py

307-310: Return the condition config.get("dynamic_url_slot_name") == slot directly

Replace with return config.get("dynamic_url_slot_name") == slot

(SIM103)

320-324: Use return any(set_slot.get("name") == slot for set_slot in set_slots) instead of for loop

Replace with return any(set_slot.get("name") == slot for set_slot in set_slots)

(SIM110)

334-337: Return the condition directly

Inline condition

(SIM103)

348-352: Use return any(MongoProcessor.is_slot_in_field(note, slot) for note in notes) instead of for loop

Replace with return any(MongoProcessor.is_slot_in_field(note, slot) for note in notes)

(SIM110)

357-360: Return the condition set_slot == slot directly

Replace with return set_slot == slot

(SIM103)

374-378: Use return any(set_slot.get("name") == slot for set_slot in set_slots) instead of for loop

Replace with return any(set_slot.get("name") == slot for set_slot in set_slots)

(SIM110)

393-397: Use return any(value == slot for value in metadata.values()) instead of for loop

Replace with return any(value == slot for value in metadata.values())

(SIM110)

406-409: Return the condition config.get("set_slot") == slot directly

Replace with return config.get("set_slot") == slot

(SIM103)

414-417: Use return any(set_slot.get("name") == slot for set_slot in set_slots) instead of for loop

Replace with return any(set_slot.get("name") == slot for set_slot in set_slots)

(SIM110)

447-451: Use return any(set_slot.get("name") == slot for set_slot in set_slots) instead of for loop

Replace with return any(set_slot.get("name") == slot for set_slot in set_slots)

(SIM110)

tests/unit_test/action/kremote_action_test.py

3-3: yarl.URL imported but unused

Remove unused import: yarl.URL

(F401)

95-95: Local variable result is assigned to but never used

Remove assignment to unused variable result

(F841)

tests/unit_test/data_processor/data_processor2_test.py

3-3: Use capitalized environment variable SYSTEM_FILE instead of system_file

Replace system_file with SYSTEM_FILE

(SIM112)

tests/unit_test/data_processor/data_processor_test.py

2119-2119: Redefinition of unused processor from line 14

(F811)

🔇 Additional comments (20)

tests/integration_test/services_test.py (3)

32-32: LGTM!

The new imports are correctly added and necessary for the action-related test cases.

2127-2127: LGTM!

The assertion has been correctly updated to check for an integer value (200) instead of a string.

2347-2347: LGTM!

The assertion has been consistently updated across test cases.
tests/unit_test/data_processor/data_processor2_test.py (1)
1-49: Verify test coverage for MongoProcessor methods.

Let's ensure all scenarios from the implementation are covered by tests.

✅ Verification successful

Test coverage is comprehensive and matches implementation

The test coverage for data_format_correction_cognition_data is thorough and aligns well with the implementation. The tests cover:

Data type conversions (int, float, str) with both single and list values

Handling of unsupported data types

Null/None values

Empty input lists

Non-list values

Edge cases with mixed data types

All core functionality and edge cases from the implementation are properly tested.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check test coverage for MongoProcessor methods

# Find the implementation file
echo "Searching for MongoProcessor implementation..."
rg -l "class MongoProcessor"

# Find all test files for MongoProcessor
echo "Finding all test files for MongoProcessor..."
fd -g "*test*.py" | xargs rg -l "MongoProcessor"

# Look for data_format_correction_cognition_data implementation
echo "Checking implementation details..."
rg -A 5 "def data_format_correction_cognition_data"
Length of output: 1706

Script:
#!/bin/bash
# Let's examine the implementation and test coverage more thoroughly

# First, get the full implementation of data_format_correction_cognition_data
echo "Full implementation of data_format_correction_cognition_data:"
rg -A 20 "def data_format_correction_cognition_data" kairon/shared/data/processor.py

# Then, check all test cases for this method
echo -e "\nTest cases for data_format_correction_cognition_data:"
rg -A 5 "data_format_correction_cognition_data" tests/unit_test/data_processor/data_processor2_test.py
Length of output: 1906
🧰 Tools

🪛 Ruff

3-3: Use capitalized environment variable SYSTEM_FILE instead of system_file

Replace system_file with SYSTEM_FILE

(SIM112)
tests/unit_test/action/kremote_action_test.py (3)

10-24: LGTM! Well-structured test case

The test effectively validates the successful request scenario with proper setup, mocking, and assertion.

49-59: LGTM! Good SSL error handling test

The test effectively validates the SSL certificate error scenario.

61-80: LGTM! Comprehensive retry mechanism test

The test effectively validates the retry logic with proper sequence of failure and success responses.
kairon/shared/cognition/data_objects.py (1)
43-43: Verify the impact on existing queries.

The addition of this field might affect existing queries that fetch schemas. Let's verify that related queries have been updated to consider this field.
✅ Verification successful

Based on the search results, I can now provide a final response:

No impact on existing queries - field has default value

The addition of activeStatus field with a default value of True does not require updates to existing queries because:

The field is only used in one query in cognition/processor.py where it's explicitly filtered:
for value in CognitionSchema.objects(bot=bot, activeStatus=True):
All other queries will automatically get documents with activeStatus=True due to the default value, maintaining backward compatibility.

The remaining queries focus on specific fields or use different filters (bot, collection_name, id) and are not impacted by this addition.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for queries that might need updating to include activeStatus filter
rg -l "CognitionSchema\.objects" | xargs rg "CognitionSchema\.objects\.(get|find|first)"
Length of output: 90

Script:
#!/bin/bash
# Let's try a broader search for CognitionSchema usage
rg -A 3 "CognitionSchema"
Length of output: 20955

Script:
#!/bin/bash
# Also search for any database queries in the cognition module
fd "cognition" | xargs rg "objects\.(get|find|first|filter)"
Length of output: 60

Script:
#!/bin/bash
# Check the test files for query patterns
fd "test.*cognition" | xargs rg "CognitionSchema"
Length of output: 5158
kairon/api/app/routers/bot/data.py (1)

Line range hint 1-293: Clarify PR scope and documentation.

While the changes in this file (schema management and error handling) appear sound, they seem peripheral to the stated PR objective of "Model training time optimization". Consider:

Updating the PR description to explain how these changes relate to training optimization

Adding comments explaining the relationship between soft deletion and training performance (if any)

Let's verify the relationship between these changes and training optimization:
kairon/shared/cognition/processor.py (1)
101-101: Consider handling migration and documentation for activeStatus filter

The addition of activeStatus=True filter implements a soft-delete pattern, but there are a few considerations:

Existing documents without the activeStatus field might be excluded from results

This change in behavior should be documented in the method's docstring

Consider adding error handling for potential SchemaValidationErrors during the transition

Let's verify the potential impact:

Consider these improvements:

Add a data migration script to set activeStatus=True for existing documents

Update the method's docstring to reflect the filtering behavior:
     def list_cognition_schema(self, bot: Text):
         """
         fetches metadata
+        Note: Only returns schemas where activeStatus=True
 
         :param bot: bot id
         :return: yield dict
         """
kairon/api/app/routers/bot/bot.py (2)

1736-1737: LGTM!

The return statement correctly uses the Response model to return LLM metadata.

1739-1747: Well-structured endpoint implementation!

The endpoint follows RESTful conventions, has clear documentation, and properly handles the retrieval of slot-mapped actions.

🧰 Tools

🪛 Ruff

1742-1742: Do not perform function call Security in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

tests/integration_test/action_service_test.py (1)

12805-12805: LGTM! Test setup with mocks is well structured.

The test function is properly decorated with all necessary mocks for external dependencies, making it suitable for testing the integration between prompt actions and slot setting.

tests/unit_test/data_processor/data_processor_test.py (2)

55-59: LGTM: Import statements are well-organized.

The imports are logically grouped and necessary for the test cases.

3983-3983: LGTM: Mock setup is appropriate.

The mock configuration for embeddings and training is well-structured and provides the necessary test data.

kairon/chat/actions.py (2)

188-194: Ensure consistent timeout application across requests

The timeout parameter is set to ACTION_SERVER_REQUEST_TIMEOUT in the multi_try_rasa_request call. Verify that this timeout value is consistently used across all requests and consider making it configurable if it's not already.

250-325: Ensure proper session closure to prevent resource leaks

When combining the with statements, verify that the session is properly closed after use to prevent any resource leaks. If the session needs to be explicitly closed, ensure that it's appropriately managed.

🧰 Tools

🪛 Ruff

305-314: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

tests/unit_test/llm_test.py (3)

108-113: Clarify the enable_search setting for the 'city' column

In the CognitionSchema for User_details, the 'city' column has enable_search set to False while create_embeddings is True. This means embeddings will be created for 'city', but it won't be included in search payloads. Is this the intended behavior?

141-141: Ensure mock_embedding.side_effect aligns with embedding calls

The mock_embedding.side_effect is set to return two responses: one with two embeddings and another with one embedding. Please verify that this setup matches the expected number of embeddings based on the test data and how the embeddings are processed in batches.

201-205: Confirm payload fields correspond to enable_search settings

In the assertion at lines 201-205, only the fields with enable_search set to True ('country' and 'role') are included in the payload, while fields like 'lang' are excluded. This behavior seems correct based on the schema definitions. Good job on ensuring the payload respects the schema settings.

coderabbitai · 2024-10-28T17:40:16Z

tests/integration_test/services_test.py

+    bot_response = slots
+    type = text
+    """
+    script = textwrap.dedent(script)
+    PyscriptActionConfig(
+        name="pyscript_action_1",
+        source_code=script,
+        dispatch_response=True,
+        bot=pytest.bot,
+        user="user"
+    ).save()
+    script = """
+    slots = {"location": "Bangalore"}
+    bot_response = slots
+    type = text
+    """
+    script = textwrap.dedent(script)
+    PyscriptActionConfig(
+        name="pyscript_action_2",
+        source_code=script,
+        dispatch_response=True,
+        bot=pytest.bot,
+        user="user"
+    ).save()
+    DatabaseAction(
+        name="database_action_1",
+        collection='vector_db_collection',
+        payload=[DbQuery(query_type=DbActionOperationType.payload_search.value,
+                         type=DbQueryValueType.from_slot.value,
+                         value="name")],
+        response=HttpActionResponse(value="The value of ${data.0.city} with color ${data.0.color} is ${data.0.id}"),
+        set_slots=[SetSlotsFromResponse(name="city_value", value="${data.0.id}")],
+        bot=pytest.bot,
+        user="user"
+    ).save()
+    DatabaseAction(
+        name="database_action_2",
+        collection='vector_db_collection',
+        payload=[DbQuery(query_type=DbActionOperationType.payload_search.value,
+                         type=DbQueryValueType.from_slot.value,
+                         value="search")],
+        response=HttpActionResponse(value="The value of ${data.0.city} with color ${data.0.color} is ${data.0.id}"),
+        set_slots=[SetSlotsFromResponse(name="location", value="${data.0.id}")],
+        bot=pytest.bot,
+        user="user"
+    ).save()
+
+    CallbackConfig(
+        name="callback_script2",
+        pyscript_code="bot_response='hello world'",
+        validation_secret=encrypt_secret(
+            "gAAAAABmqK71xDb4apnxOAfJjDUv1lrCTooWNX0GPyBHhqW1KBlblUqGNPwsX1V7FlIlgpwWGRWljiYp9mYAf1eG4AcG1dTXQuZCndCewox"),
+        execution_mode="sync",
+        bot="6697add6b8e47524eb983373",
+    ).save()
+
+    CallbackActionConfig(
+        name="callback_action1",
+        callback_name="callback_script2",
+        dynamic_url_slot_name="location",
+        metadata_list=[],
+        bot_response="Hello",
+        dispatch_bot_response=True,
+        bot=pytest.bot,
+        user="user"
+    ).save()
+    CallbackActionConfig(
+        name="callback_action2",
+        callback_name="callback_script2",
+        dynamic_url_slot_name="name",
+        metadata_list=[],
+        bot_response="Hello",
+        dispatch_bot_response=True,
+        bot=pytest.bot,
+        user="user"
+    ).save()
+    ScheduleAction(
+        name="schedule_action_1",
+        bot=pytest.bot,
+        user="user",
+        schedule_time=CustomActionDynamicParameters(parameter_type=ActionParameterType.slot.value,
+                                                    value="name"),
+        schedule_action="callback_script2",
+        timezone="Asia/Kolkata",
+        response_text="Action schedule",
+        params_list=[CustomActionRequestParameters(key="bot", parameter_type="slot", value="bot", encrypt=True),
+                     CustomActionRequestParameters(key="user", parameter_type="value", value="1011", encrypt=False)]
+    ).save()
+    ScheduleAction(
+        name="schedule_action_2",
+        bot=pytest.bot,
+        user="user",
+        schedule_time=CustomActionDynamicParameters(parameter_type=ActionParameterType.slot.value,
+                                                    value="location"),
+        schedule_action="callback_script2",
+        timezone="Asia/Kolkata",
+        response_text="Action schedule",
+        params_list=[CustomActionRequestParameters(key="name", parameter_type="slot", value="name", encrypt=True),
+                     CustomActionRequestParameters(key="user", parameter_type="value", value="1011", encrypt=False)]
+    ).save()


🛠️ Refactor suggestion

Consider breaking down the fixture into smaller, focused fixtures.

The save_actions fixture is quite large and sets up many different types of actions. Consider splitting it into smaller fixtures based on action types (e.g., save_http_actions, save_email_actions, etc.) for better maintainability and reusability.

coderabbitai · 2024-10-28T17:40:16Z

tests/integration_test/services_test.py

+        validation_secret=encrypt_secret(
+            "gAAAAABmqK71xDb4apnxOAfJjDUv1lrCTooWNX0GPyBHhqW1KBlblUqGNPwsX1V7FlIlgpwWGRWljiYp9mYAf1eG4AcG1dTXQuZCndCewox"),


⚠️ Potential issue

Avoid hardcoding encrypted secrets in test files.

The validation secret appears to be hardcoded. Consider moving it to a test configuration file or using a mock secret for testing.

- validation_secret=encrypt_secret( - "gAAAAABmqK71xDb4apnxOAfJjDUv1lrCTooWNX0GPyBHhqW1KBlblUqGNPwsX1V7FlIlgpwWGRWljiYp9mYAf1eG4AcG1dTXQuZCndCewox"), + validation_secret=encrypt_secret(TEST_SECRET),

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-28T17:40:16Z

tests/unit_test/data_processor/data_processor2_test.py

+os.environ["system_file"] = "./tests/testing_data/system.yaml"
+Utility.load_environment()
+Utility.load_system_metadata()


🛠️ Refactor suggestion

Consider using pytest fixture for environment setup.

Moving the environment setup into a fixture would make it reusable across test files and ensure proper cleanup after tests.

import pytest @pytest.fixture(autouse=True) def setup_environment(): os.environ["SYSTEM_FILE"] = "./tests/testing_data/system.yaml" Utility.load_environment() Utility.load_system_metadata() yield # Cleanup if needed if "SYSTEM_FILE" in os.environ: del os.environ["SYSTEM_FILE"]

🧰 Tools

🪛 Ruff

3-3: Use capitalized environment variable SYSTEM_FILE instead of system_file

Replace system_file with SYSTEM_FILE

(SIM112)

coderabbitai · 2024-10-28T17:40:16Z

tests/unit_test/data_processor/data_processor2_test.py

@@ -0,0 +1,49 @@
+import os
+from kairon.shared.utils import Utility
+os.environ["system_file"] = "./tests/testing_data/system.yaml"


⚠️ Potential issue

Use uppercase for environment variable name.

Following Python conventions, environment variables should be in uppercase.

-os.environ["system_file"] = "./tests/testing_data/system.yaml" +os.environ["SYSTEM_FILE"] = "./tests/testing_data/system.yaml"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

os.environ["system_file"] = "./tests/testing_data/system.yaml"

os.environ["SYSTEM_FILE"] = "./tests/testing_data/system.yaml"

🧰 Tools

🪛 Ruff

3-3: Use capitalized environment variable SYSTEM_FILE instead of system_file

Replace system_file with SYSTEM_FILE

(SIM112)

coderabbitai · 2024-10-28T17:40:17Z

tests/unit_test/action/kremote_action_test.py

+import pytest
+from aioresponses import aioresponses
+from yarl import URL
+
+from kairon.chat.actions import KRemoteAction
+from rasa.utils.endpoints import EndpointConfig
+from rasa.utils.endpoints import ClientResponseError
+


⚠️ Potential issue

Remove unused import yarl.URL

The yarl.URL import on line 3 is not used in any of the test cases.

Apply this diff:

import pytest from aioresponses import aioresponses -from yarl import URL from kairon.chat.actions import KRemoteAction from rasa.utils.endpoints import EndpointConfig from rasa.utils.endpoints import ClientResponseError

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import pytest

from aioresponses import aioresponses

from yarl import URL

from kairon.chat.actions import KRemoteAction

from rasa.utils.endpoints import EndpointConfig

from rasa.utils.endpoints import ClientResponseError

import pytest

from aioresponses import aioresponses

from kairon.chat.actions import KRemoteAction

from rasa.utils.endpoints import EndpointConfig

from rasa.utils.endpoints import ClientResponseError

🧰 Tools

🪛 Ruff

3-3: yarl.URL imported but unused

Remove unused import: yarl.URL

(F401)

coderabbitai · 2024-10-28T17:40:19Z