-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated code to save data in cognition data in respective data types #1618
Updated code to save data in cognition data in respective data types #1618
Conversation
…rather than strings and updated corresponding test cases
WalkthroughThe changes in this pull request focus on enhancing the Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (8)
tests/unit_test/data_processor/data_processor_test.py (5)
1637-1640
: Consider adding edge case tests for data type validationWhile the current tests cover column presence and naming, consider adding tests for:
- Mixed data types in the same column
- Empty values
- Null values
- Special characters in string fields
Also applies to: 1679-1682
1828-1838
: Consider adding boundary test cases for numeric fieldsWhile the current upsert tests are good, consider adding tests for:
- Maximum/minimum values for numeric fields
- Zero values
- Negative values (if applicable)
- Decimal precision edge cases for price
Also applies to: 1944-1954
2041-2044
: Consider enhancing data retrieval test robustnessWhile the current tests verify basic retrieval, consider:
- Testing retrieval with multiple documents
- Verifying data type preservation after retrieval
- Testing sorting/filtering if supported
Also applies to: 2095-2095
18000-18000
: Consider adding more data type validation test casesWhile the current test verifies integer validation, consider adding tests for:
- Float/decimal validation
- Boolean validation
- Date/time validation (if applicable)
- String length/format validation
Line range hint
1475-18000
: Consider implementing a comprehensive test data factoryTo improve test maintainability and coverage, consider:
- Creating a test data factory to generate test data with proper types
- Implementing helper methods for common test scenarios
- Using parameterized tests for edge cases
- Adding documentation about expected data types for each field
This will make it easier to maintain tests and ensure consistent data type handling across all test cases.
tests/integration_test/services_test.py (1)
2047-2056
: Consider documenting the error response formatThe error response format is well-structured and includes detailed information about validation failures. Consider adding a comment or documentation that describes this format for better maintainability.
Add a comment above the assertion:
+ # Error response format: + # {document_id: [{status: str, primary_key: Any, message: str}]} assert actual["data"] == {'2': [{'status': 'Document does not exist', 'primary_key': 2, 'message': "No document found for 'id': 2"}]}kairon/shared/cognition/processor.py (2)
386-399
: Improved type validation, but could be more maintainableThe type validation logic is more robust with explicit type checking, but could be further improved for maintainability.
Consider this more maintainable approach using a type mapping:
@staticmethod def validate_column_values(data: Any, schema: Dict) -> Any: if schema and isinstance(data, dict): data_type = schema['data_type'] column_name = schema['column_name'] if column_name in data and data[column_name] is not None: value = data[column_name] - if data_type == CognitionMetadataType.int.value and not isinstance(value, int): - raise AppException( - f"Invalid data type for '{column_name}': Expected integer value") - - if data_type == CognitionMetadataType.float.value and not isinstance(value, float): - raise AppException( - f"Invalid data type for '{column_name}': Expected float value") + type_mapping = { + CognitionMetadataType.int.value: (int, "integer"), + CognitionMetadataType.float.value: (float, "float") + } + + if data_type in type_mapping: + expected_type, type_name = type_mapping[data_type] + if not isinstance(value, expected_type): + raise AppException( + f"Invalid data type for '{column_name}': Expected {type_name} value") return value else: raise AppException(f"Column '{column_name}' does not exist or has no value.")
488-489
: Simplify nested conditionsThe nested if statements can be combined for better readability.
- if "document_non_existence" in event_validations: - if row_key not in existing_document_map: + if "document_non_existence" in event_validations and row_key not in existing_document_map: row_errors.append({ "status": "Document does not exist", "primary_key": row_key, "message": f"No document found for '{primary_key_col}': {row_key}" })🧰 Tools
🪛 Ruff (0.8.0)
488-489: Use a single
if
statement instead of nestedif
statements(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
kairon/shared/cognition/processor.py
(2 hunks)tests/integration_test/services_test.py
(12 hunks)tests/unit_test/data_processor/data_processor_test.py
(22 hunks)
🔥 Files not summarized due to errors (1)
- tests/unit_test/data_processor/data_processor_test.py: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🪛 Ruff (0.8.0)
kairon/shared/cognition/processor.py
488-489: Use a single if
statement instead of nested if
statements
(SIM102)
🔇 Additional comments (6)
tests/unit_test/data_processor/data_processor_test.py (2)
1475-1478
: LGTM: Test data uses appropriate data types
The test data correctly uses proper data types for numeric fields:
id
: integer type for identifiersprice
: float type for monetary valuesquantity
: integer type for count values
Also applies to: 1492-1495
1737-1739
: LGTM: Proper document existence validation
The test correctly verifies:
- Document lookup using primary key
- Error handling for non-existent documents
- Appropriate error messages
tests/integration_test/services_test.py (3)
1455-1458
: LGTM! Data types are properly updated
The conversion from string representations to proper numeric types (int for id/quantity, float for price) is consistent and aligns with the PR objective of storing data in appropriate data types.
Also applies to: 1474-1475, 1493-1494
Line range hint 1455-2056
: Verify coverage of data type edge cases
While the current test cases cover basic numeric types, we should verify handling of edge cases.
Let's check for existing edge case tests:
✅ Verification successful
Edge cases for numeric data types are well covered
The codebase has comprehensive test coverage for numeric data type edge cases:
- Type validation tests:
- Invalid float/numeric value validation
- Integer type validation
- Type conversion checks
- Boundary tests:
- Min/max value validation
- Min value must be less than max value
- Initial value validation against min/max bounds
- Format validation:
- String to numeric conversion validation
- Numeric string format validation
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test cases handling extreme values or invalid types
rg -A 5 "test.*invalid.*type" tests/
rg -A 5 "test.*edge.*case" tests/
Length of output: 45633
Script:
#!/bin/bash
# Search for numeric data type tests and edge cases
rg -A 5 "test.*numeric" tests/
rg -A 5 "test.*float" tests/
rg -A 5 "test.*integer" tests/
rg -A 5 "test.*decimal" tests/
rg -A 5 "test.*number" tests/
rg -A 5 "test.*max.*value" tests/
rg -A 5 "test.*min.*value" tests/
Length of output: 11229
1578-1581
: LGTM! Field update test cases properly handle data types
The test cases correctly validate both full and partial updates while maintaining proper data types. The expected data accurately reflects the merged state after updates.
Let's verify the schema validation in the main code:
Also applies to: 1593-1596, 1612-1613, 1631-1632
kairon/shared/cognition/processor.py (1)
489-489
: LGTM! Simplified document existence check
The direct key comparison is more efficient than string representation comparison.
🧰 Tools
🪛 Ruff (0.8.0)
488-489: Use a single if
statement instead of nested if
statements
(SIM102)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
kairon/shared/cognition/processor.py (1)
492-493
: Simplify nested if statements into a single conditionCombine the nested
if
statements into a single condition to improve code readability and reduce unnecessary nesting.Apply this diff to make the change:
- if "document_non_existence" in event_validations: - if row_key not in existing_document_map: + if "document_non_existence" in event_validations and row_key not in existing_document_map: row_errors.append({ "status": "Document does not exist", "primary_key": row_key, "message": f"No document found for '{primary_key_col}': {row_key}" })🧰 Tools
🪛 Ruff (0.8.0)
492-493: Use a single
if
statement instead of nestedif
statements(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
kairon/shared/cognition/processor.py
(2 hunks)kairon/shared/data/processor.py
(1 hunks)tests/integration_test/services_test.py
(15 hunks)tests/unit_test/data_processor/data_processor2_test.py
(2 hunks)tests/unit_test/llm_test.py
(1 hunks)tests/unit_test/validator/content_importer_test.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration_test/services_test.py
🧰 Additional context used
🪛 Ruff (0.8.0)
kairon/shared/data/processor.py
8737-8738: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
kairon/shared/cognition/processor.py
492-493: Use a single if
statement instead of nested if
statements
(SIM102)
🔇 Additional comments (8)
kairon/shared/data/processor.py (1)
8728-8729
:
Correct indentation for nested loops
The for column, value in row.items():
loop should be nested inside the for row in reversed(doc_content):
loop. Currently, both loops are at the same indentation level, which will cause row
to be undefined in the inner loop.
Apply this diff to fix the indentation:
- for row in reversed(doc_content):
- for column, value in row.items():
+ for row in reversed(doc_content):
+ for column, value in row.items():
Likely invalid or redundant comment.
tests/unit_test/data_processor/data_processor2_test.py (1)
56-138
: LGTM: Newly added test functions appropriately cover validation scenarios
The test functions test_validate_metadata_and_payload_valid
, test_validate_metadata_and_payload_invalid_str
, test_validate_metadata_and_payload_invalid_int
, test_validate_metadata_and_payload_invalid_float
, and test_validate_metadata_and_payload_missing_column
effectively test the validate_column_values
method for different cases, ensuring robust validation logic.
tests/unit_test/validator/content_importer_test.py (3)
154-157
: Update expected values to match correct data types
The expected values for 'order_id'
, 'sales'
, and 'profit'
have been updated from strings to numeric types, aligning with the data types defined in the schema. This ensures that the assertions correctly reflect the actual data returned.
188-191
: Update expected values to match correct data types
In test_import_data_partial_success
, the expected values for 'order_id'
, 'sales'
, and 'profit'
have been updated to numeric types. This change ensures that the test accurately validates the data according to the schema.
196-199
: Update expected values to match correct data types
The adjustments to the expected values for 'order_id'
, 'sales'
, and 'profit'
ensure that the test assertions are consistent with the actual data types, improving the test's reliability.
kairon/shared/cognition/processor.py (1)
386-403
: LGTM: Enhanced validation logic in validate_column_values
The updated validate_column_values
method now includes precise type-checking for integer, float, and string data types, raising specific exceptions for invalid data types and missing columns. This enhancement improves the robustness and clarity of data validation.
tests/unit_test/llm_test.py (2)
238-238
: LGTM! Data type consistency improvement.
The change correctly sets the age field as an integer value, maintaining consistency with the schema definition where age is specified as an integer data type.
Line range hint 1-1186
: Excellent test coverage and structure!
The test suite demonstrates comprehensive coverage of the LLMProcessor functionality:
- Thorough testing of data type handling
- Robust error scenario coverage
- Clear separation of test cases
- Proper use of mocking for external dependencies
- Detailed verification of API interactions and responses
raise ValueError( | ||
f"Error converting column '{column}' with value '{value}' to type '{data_type}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use raise ... from ...
in except clause to preserve original traceback
When re-raising an exception in the except
block, include the original exception using from err
to preserve the traceback and help distinguish it from errors in exception handling.
Apply this diff to fix the issue:
- except (ValueError, TypeError):
- raise ValueError(
- f"Error converting column '{column}' with value '{value}' to type '{data_type}'")
+ except (ValueError, TypeError) as err:
+ raise ValueError(
+ f"Error converting column '{column}' with value '{value}' to type '{data_type}'") from err
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
raise ValueError( | |
f"Error converting column '{column}' with value '{value}' to type '{data_type}'") | |
except (ValueError, TypeError) as err: | |
raise ValueError( | |
f"Error converting column '{column}' with value '{value}' to type '{data_type}'") from err |
🧰 Tools
🪛 Ruff (0.8.0)
8737-8738: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first() | ||
|
||
metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle the case where the schema is not found
In the save_doc_content
method, if no schema is found for the given table_name
, schema
will be None
, leading to an AttributeError
when accessing schema.metadata
. Please add a check to handle the case where the schema does not exist and raise an appropriate exception.
Apply this diff to fix the issue:
schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first()
+ if schema is None:
+ raise ValueError(f"No schema found for table '{table_name}'")
metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first() | |
metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata} | |
schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first() | |
if schema is None: | |
raise ValueError(f"No schema found for table '{table_name}'") | |
metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved
Summary by CodeRabbit
New Features
Bug Fixes
Tests