Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated code to save data in cognition data in respective data types #1618

Merged

Conversation

himanshugt16
Copy link
Contributor

@himanshugt16 himanshugt16 commented Dec 9, 2024

Summary by CodeRabbit

  • New Features

    • Enhanced data validation with clearer error reporting and more efficient checks.
    • Introduced a method for saving document content with type validation and error handling.
  • Bug Fixes

    • Improved handling of data types in validation methods.
  • Tests

    • Updated test cases to reflect correct data types for fields, ensuring consistency and accuracy.
    • Added new tests to validate metadata and payloads in the data processor.
    • Enhanced test coverage for the LLMProcessor, focusing on data integrity and error handling.

…rather than strings and updated corresponding test cases
Copy link

coderabbitai bot commented Dec 9, 2024

Walkthrough

The changes in this pull request focus on enhancing the CognitionDataProcessor class within the kairon/shared/cognition/processor.py file, specifically improving the validation methods validate_column_values and validate_data. The validation logic has been refined to enhance error handling and clarity, including more descriptive exception messages. Additionally, the tests/integration_test/services_test.py file has been updated to reflect changes in data types for various fields in test cases, ensuring consistency with the modified validation logic. New tests have also been added for the CognitionDataProcessor class to validate metadata and payloads.

Changes

File Path Change Summary
kairon/shared/cognition/processor.py Modified validate_column_values and validate_data methods for improved validation logic and error handling.
kairon/shared/data/processor.py Added save_doc_content method for saving document content with type validation and error handling; introduced get_error_report_file_path static method.
tests/integration_test/services_test.py Updated data types for id, price, and quantity fields in multiple test cases to align with the refined validation logic.
tests/unit_test/data_processor/data_processor2_test.py Added new tests for metadata and payload validation using CognitionDataProcessor.
tests/unit_test/llm_test.py Adjusted data types and enhanced assertions in tests for the LLMProcessor class.
tests/unit_test/validator/content_importer_test.py Updated expected values in assertions from string representations to numeric types for imported data validation.

Possibly related PRs

  • Upload Document Content(CSV) #1519: The changes in this PR involve enhancements to document content management, which relates to the validation logic in the CognitionDataProcessor class in the main PR, as both focus on improving data handling and validation processes.
  • Extended knowledge vault sync functionality to accomodate different events #1606: This PR extends the knowledge vault sync functionality, which includes validation logic that aligns with the changes made in the CognitionDataProcessor for handling different event types during data validation.
  • Added api to update knowldege vault real time #1599: The introduction of the knowledge_vault_sync API in this PR is directly related to the validation logic enhancements in the CognitionDataProcessor, as both involve data validation and synchronization processes.
  • Model training time optimization #1589: This PR optimizes the LLMProcessor class for batch processing, which may indirectly relate to the validation improvements in the CognitionDataProcessor as both deal with data processing efficiency.

Suggested reviewers

  • hiteshghuge

🐇 In the land of code where bunnies hop,
Validation's refined, it won't ever stop!
With types now corrected, and errors so clear,
Our tests are all ready, let’s give a cheer!
For each little change, we celebrate bright,
In the world of our logic, all feels just right! 🌟


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between dadc3f4 and 0bd395c.

📒 Files selected for processing (1)
  • tests/integration_test/services_test.py (16 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration_test/services_test.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (8)
tests/unit_test/data_processor/data_processor_test.py (5)

1637-1640: Consider adding edge case tests for data type validation

While the current tests cover column presence and naming, consider adding tests for:

  • Mixed data types in the same column
  • Empty values
  • Null values
  • Special characters in string fields

Also applies to: 1679-1682


1828-1838: Consider adding boundary test cases for numeric fields

While the current upsert tests are good, consider adding tests for:

  • Maximum/minimum values for numeric fields
  • Zero values
  • Negative values (if applicable)
  • Decimal precision edge cases for price

Also applies to: 1944-1954


2041-2044: Consider enhancing data retrieval test robustness

While the current tests verify basic retrieval, consider:

  • Testing retrieval with multiple documents
  • Verifying data type preservation after retrieval
  • Testing sorting/filtering if supported

Also applies to: 2095-2095


18000-18000: Consider adding more data type validation test cases

While the current test verifies integer validation, consider adding tests for:

  • Float/decimal validation
  • Boolean validation
  • Date/time validation (if applicable)
  • String length/format validation

Line range hint 1475-18000: Consider implementing a comprehensive test data factory

To improve test maintainability and coverage, consider:

  1. Creating a test data factory to generate test data with proper types
  2. Implementing helper methods for common test scenarios
  3. Using parameterized tests for edge cases
  4. Adding documentation about expected data types for each field

This will make it easier to maintain tests and ensure consistent data type handling across all test cases.

tests/integration_test/services_test.py (1)

2047-2056: Consider documenting the error response format

The error response format is well-structured and includes detailed information about validation failures. Consider adding a comment or documentation that describes this format for better maintainability.

Add a comment above the assertion:

+   # Error response format:
+   # {document_id: [{status: str, primary_key: Any, message: str}]}
    assert actual["data"] == {'2': [{'status': 'Document does not exist', 'primary_key': 2, 'message': "No document found for 'id': 2"}]}
kairon/shared/cognition/processor.py (2)

386-399: Improved type validation, but could be more maintainable

The type validation logic is more robust with explicit type checking, but could be further improved for maintainability.

Consider this more maintainable approach using a type mapping:

@staticmethod
def validate_column_values(data: Any, schema: Dict) -> Any:
    if schema and isinstance(data, dict):
        data_type = schema['data_type']
        column_name = schema['column_name']
        
        if column_name in data and data[column_name] is not None:
            value = data[column_name]
            
-            if data_type == CognitionMetadataType.int.value and not isinstance(value, int):
-                raise AppException(
-                    f"Invalid data type for '{column_name}': Expected integer value")
-
-            if data_type == CognitionMetadataType.float.value and not isinstance(value, float):
-                raise AppException(
-                    f"Invalid data type for '{column_name}': Expected float value")
+            type_mapping = {
+                CognitionMetadataType.int.value: (int, "integer"),
+                CognitionMetadataType.float.value: (float, "float")
+            }
+            
+            if data_type in type_mapping:
+                expected_type, type_name = type_mapping[data_type]
+                if not isinstance(value, expected_type):
+                    raise AppException(
+                        f"Invalid data type for '{column_name}': Expected {type_name} value")
            
            return value
        else:
            raise AppException(f"Column '{column_name}' does not exist or has no value.")

488-489: Simplify nested conditions

The nested if statements can be combined for better readability.

-            if "document_non_existence" in event_validations:
-                if row_key not in existing_document_map:
+            if "document_non_existence" in event_validations and row_key not in existing_document_map:
                    row_errors.append({
                        "status": "Document does not exist",
                        "primary_key": row_key,
                        "message": f"No document found for '{primary_key_col}': {row_key}"
                    })
🧰 Tools
🪛 Ruff (0.8.0)

488-489: Use a single if statement instead of nested if statements

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b7f6d23 and 14ab65c.

📒 Files selected for processing (3)
  • kairon/shared/cognition/processor.py (2 hunks)
  • tests/integration_test/services_test.py (12 hunks)
  • tests/unit_test/data_processor/data_processor_test.py (22 hunks)
🔥 Files not summarized due to errors (1)
  • tests/unit_test/data_processor/data_processor_test.py: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🪛 Ruff (0.8.0)
kairon/shared/cognition/processor.py

488-489: Use a single if statement instead of nested if statements

(SIM102)

🔇 Additional comments (6)
tests/unit_test/data_processor/data_processor_test.py (2)

1475-1478: LGTM: Test data uses appropriate data types

The test data correctly uses proper data types for numeric fields:

  • id: integer type for identifiers
  • price: float type for monetary values
  • quantity: integer type for count values

Also applies to: 1492-1495


1737-1739: LGTM: Proper document existence validation

The test correctly verifies:

  • Document lookup using primary key
  • Error handling for non-existent documents
  • Appropriate error messages
tests/integration_test/services_test.py (3)

1455-1458: LGTM! Data types are properly updated

The conversion from string representations to proper numeric types (int for id/quantity, float for price) is consistent and aligns with the PR objective of storing data in appropriate data types.

Also applies to: 1474-1475, 1493-1494


Line range hint 1455-2056: Verify coverage of data type edge cases

While the current test cases cover basic numeric types, we should verify handling of edge cases.

Let's check for existing edge case tests:

✅ Verification successful

Edge cases for numeric data types are well covered

The codebase has comprehensive test coverage for numeric data type edge cases:

  • Type validation tests:
    • Invalid float/numeric value validation
    • Integer type validation
    • Type conversion checks
  • Boundary tests:
    • Min/max value validation
    • Min value must be less than max value
    • Initial value validation against min/max bounds
  • Format validation:
    • String to numeric conversion validation
    • Numeric string format validation
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test cases handling extreme values or invalid types
rg -A 5 "test.*invalid.*type" tests/
rg -A 5 "test.*edge.*case" tests/

Length of output: 45633


Script:

#!/bin/bash
# Search for numeric data type tests and edge cases
rg -A 5 "test.*numeric" tests/
rg -A 5 "test.*float" tests/
rg -A 5 "test.*integer" tests/
rg -A 5 "test.*decimal" tests/
rg -A 5 "test.*number" tests/
rg -A 5 "test.*max.*value" tests/
rg -A 5 "test.*min.*value" tests/

Length of output: 11229


1578-1581: LGTM! Field update test cases properly handle data types

The test cases correctly validate both full and partial updates while maintaining proper data types. The expected data accurately reflects the merged state after updates.

Let's verify the schema validation in the main code:

Also applies to: 1593-1596, 1612-1613, 1631-1632

kairon/shared/cognition/processor.py (1)

489-489: LGTM! Simplified document existence check

The direct key comparison is more efficient than string representation comparison.

🧰 Tools
🪛 Ruff (0.8.0)

488-489: Use a single if statement instead of nested if statements

(SIM102)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
kairon/shared/cognition/processor.py (1)

492-493: Simplify nested if statements into a single condition

Combine the nested if statements into a single condition to improve code readability and reduce unnecessary nesting.

Apply this diff to make the change:

-                if "document_non_existence" in event_validations:
-                    if row_key not in existing_document_map:
+                if "document_non_existence" in event_validations and row_key not in existing_document_map:
                     row_errors.append({
                         "status": "Document does not exist",
                         "primary_key": row_key,
                         "message": f"No document found for '{primary_key_col}': {row_key}"
                     })
🧰 Tools
🪛 Ruff (0.8.0)

492-493: Use a single if statement instead of nested if statements

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 14ab65c and dadc3f4.

📒 Files selected for processing (6)
  • kairon/shared/cognition/processor.py (2 hunks)
  • kairon/shared/data/processor.py (1 hunks)
  • tests/integration_test/services_test.py (15 hunks)
  • tests/unit_test/data_processor/data_processor2_test.py (2 hunks)
  • tests/unit_test/llm_test.py (1 hunks)
  • tests/unit_test/validator/content_importer_test.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration_test/services_test.py
🧰 Additional context used
🪛 Ruff (0.8.0)
kairon/shared/data/processor.py

8737-8738: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

kairon/shared/cognition/processor.py

492-493: Use a single if statement instead of nested if statements

(SIM102)

🔇 Additional comments (8)
kairon/shared/data/processor.py (1)

8728-8729: ⚠️ Potential issue

Correct indentation for nested loops

The for column, value in row.items(): loop should be nested inside the for row in reversed(doc_content): loop. Currently, both loops are at the same indentation level, which will cause row to be undefined in the inner loop.

Apply this diff to fix the indentation:

-                for row in reversed(doc_content):
-                for column, value in row.items():
+                for row in reversed(doc_content):
+                    for column, value in row.items():

Likely invalid or redundant comment.

tests/unit_test/data_processor/data_processor2_test.py (1)

56-138: LGTM: Newly added test functions appropriately cover validation scenarios

The test functions test_validate_metadata_and_payload_valid, test_validate_metadata_and_payload_invalid_str, test_validate_metadata_and_payload_invalid_int, test_validate_metadata_and_payload_invalid_float, and test_validate_metadata_and_payload_missing_column effectively test the validate_column_values method for different cases, ensuring robust validation logic.

tests/unit_test/validator/content_importer_test.py (3)

154-157: Update expected values to match correct data types

The expected values for 'order_id', 'sales', and 'profit' have been updated from strings to numeric types, aligning with the data types defined in the schema. This ensures that the assertions correctly reflect the actual data returned.


188-191: Update expected values to match correct data types

In test_import_data_partial_success, the expected values for 'order_id', 'sales', and 'profit' have been updated to numeric types. This change ensures that the test accurately validates the data according to the schema.


196-199: Update expected values to match correct data types

The adjustments to the expected values for 'order_id', 'sales', and 'profit' ensure that the test assertions are consistent with the actual data types, improving the test's reliability.

kairon/shared/cognition/processor.py (1)

386-403: LGTM: Enhanced validation logic in validate_column_values

The updated validate_column_values method now includes precise type-checking for integer, float, and string data types, raising specific exceptions for invalid data types and missing columns. This enhancement improves the robustness and clarity of data validation.

tests/unit_test/llm_test.py (2)

238-238: LGTM! Data type consistency improvement.

The change correctly sets the age field as an integer value, maintaining consistency with the schema definition where age is specified as an integer data type.


Line range hint 1-1186: Excellent test coverage and structure!

The test suite demonstrates comprehensive coverage of the LLMProcessor functionality:

  • Thorough testing of data type handling
  • Robust error scenario coverage
  • Clear separation of test cases
  • Proper use of mocking for external dependencies
  • Detailed verification of API interactions and responses

Comment on lines +8737 to +8738
raise ValueError(
f"Error converting column '{column}' with value '{value}' to type '{data_type}'")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use raise ... from ... in except clause to preserve original traceback

When re-raising an exception in the except block, include the original exception using from err to preserve the traceback and help distinguish it from errors in exception handling.

Apply this diff to fix the issue:

-                        except (ValueError, TypeError):
-                            raise ValueError(
-                                f"Error converting column '{column}' with value '{value}' to type '{data_type}'")
+                        except (ValueError, TypeError) as err:
+                            raise ValueError(
+                                f"Error converting column '{column}' with value '{value}' to type '{data_type}'") from err
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raise ValueError(
f"Error converting column '{column}' with value '{value}' to type '{data_type}'")
except (ValueError, TypeError) as err:
raise ValueError(
f"Error converting column '{column}' with value '{value}' to type '{data_type}'") from err
🧰 Tools
🪛 Ruff (0.8.0)

8737-8738: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Comment on lines +8720 to +8723
schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first()

metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle the case where the schema is not found

In the save_doc_content method, if no schema is found for the given table_name, schema will be None, leading to an AttributeError when accessing schema.metadata. Please add a check to handle the case where the schema does not exist and raise an appropriate exception.

Apply this diff to fix the issue:

            schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first()
+           if schema is None:
+               raise ValueError(f"No schema found for table '{table_name}'")

            metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first()
metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata}
schema = CognitionSchema.objects(bot=bot, collection_name=table_name).first()
if schema is None:
raise ValueError(f"No schema found for table '{table_name}'")
metadata_map = {meta['column_name']: meta['data_type'] for meta in schema.metadata}

Copy link
Collaborator

@sfahad1414 sfahad1414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@sfahad1414 sfahad1414 merged commit cfbf50e into digiteinfotech:master Dec 10, 2024
8 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants