Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Every time a vectorized document is generated, the entire vectorized data of the document is deleted #2721

Merged
merged 1 commit into from
Mar 28, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions apps/common/event/listener_manage.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,8 +272,6 @@ def is_the_task_interrupted():
ListenerManagement.update_status(QuerySet(Document).filter(id=document_id), TaskType.EMBEDDING,
State.STARTED)

# 删除文档向量数据
VectorStore.get_embedding_vector().delete_by_document_id(document_id)

# 根据段落进行向量化处理
page_desc(QuerySet(Paragraph)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code snippet does not contain any obvious irregularities or potential issues. However, there are some general optimizations that can be made:

Potential Improvements

  1. Use of Generators for Data Handling: If QuerySet supports generator operations, you might use generators to handle data more efficiently, especially if dealing with large datasets.

  2. Exception Handling: Consider adding try-except blocks around database operations and document management tasks to ensure robustness against exceptions during execution.

  3. Code Clarity: Ensure that variable names and function structures are clear and concise. This makes the code easier to understand and maintain.

  4. Performance Optimization: If VectorStore's delete_by_document_id method is slow due to indexing or other optimizations, consider caching results or using batch deletion when applicable.

Here's a slightly optimized version of the relevant part of the code (assuming document_id is valid):

def is_the_task_interrupted():
    try:
        # Update listener status
        ListenerManagement.update_status(
            QuerySet(Document).filter(id=document_id),
            TaskType.EMBEDDING,
            State.STARTED
        )
        
        # Delete document vector data
        VectorStore.get_embedding_vector().delete_by_document_id(document_id)
        
        # According to paragraphs for vectorization processing
        page_desc(QuerySet(Paragraph))
    
    except Exception as e:
        print(f"An error occurred: {e}")

By implementing these suggestions, the code will be more robust, efficient, and readable. Adjustments may vary based on specific requirements and constraints within your project.

Expand Down