Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Every time a vectorized document is generated, the entire vectorized data of the document is deleted #2721

Merged
merged 1 commit into from
Mar 28, 2025

Conversation

shaohuzhang1
Copy link
Contributor

fix: Every time a vectorized document is generated, the entire vectorized data of the document is deleted

Copy link

f2c-ci-robot bot commented Mar 28, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented Mar 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -272,8 +272,6 @@ def is_the_task_interrupted():
ListenerManagement.update_status(QuerySet(Document).filter(id=document_id), TaskType.EMBEDDING,
State.STARTED)

# 删除文档向量数据
VectorStore.get_embedding_vector().delete_by_document_id(document_id)

# 根据段落进行向量化处理
page_desc(QuerySet(Paragraph)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code snippet does not contain any obvious irregularities or potential issues. However, there are some general optimizations that can be made:

Potential Improvements

  1. Use of Generators for Data Handling: If QuerySet supports generator operations, you might use generators to handle data more efficiently, especially if dealing with large datasets.

  2. Exception Handling: Consider adding try-except blocks around database operations and document management tasks to ensure robustness against exceptions during execution.

  3. Code Clarity: Ensure that variable names and function structures are clear and concise. This makes the code easier to understand and maintain.

  4. Performance Optimization: If VectorStore's delete_by_document_id method is slow due to indexing or other optimizations, consider caching results or using batch deletion when applicable.

Here's a slightly optimized version of the relevant part of the code (assuming document_id is valid):

def is_the_task_interrupted():
    try:
        # Update listener status
        ListenerManagement.update_status(
            QuerySet(Document).filter(id=document_id),
            TaskType.EMBEDDING,
            State.STARTED
        )
        
        # Delete document vector data
        VectorStore.get_embedding_vector().delete_by_document_id(document_id)
        
        # According to paragraphs for vectorization processing
        page_desc(QuerySet(Paragraph))
    
    except Exception as e:
        print(f"An error occurred: {e}")

By implementing these suggestions, the code will be more robust, efficient, and readable. Adjustments may vary based on specific requirements and constraints within your project.

@shaohuzhang1 shaohuzhang1 merged commit dcc80a4 into main Mar 28, 2025
4 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@fix_document_embedding branch March 28, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant