Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] General Deduplication Utility for String-based Data. #1517

Closed
2 of 4 tasks
keli-wen opened this issue Jan 28, 2025 · 0 comments · Fixed by #1568
Closed
2 of 4 tasks

[Feature Request] General Deduplication Utility for String-based Data. #1517

keli-wen opened this issue Jan 28, 2025 · 0 comments · Fixed by #1568
Assignees
Milestone

Comments

@keli-wen
Copy link
Collaborator

keli-wen commented Jan 28, 2025

Required prerequisites

Motivation / Background

We have observed frequent duplicates or near-duplicates across different modules,
such as user inputs, knowledge graph nodes, or text-based data in our pipeline.

This results in:

  • Increased confusion and potential redundancy in search or retrieval,
  • Larger storage requirements,
  • Potential performance degradation over time.

Previously, we had a specialized deduplication utility targeting Node objects (FinDKG project),
but this approach does not generalize well to other text-based data.

Proposal

Introduce a deduplication.py module with more flexible deduplication related functions.

  • deduplication_internally
  • deduplication_externally

Key features and improvements:

  1. Works with strings: Instead of requiring Node objects, we now accept lists of raw strings.
  2. Optional embeddings parameter:
    • Users can supply a BaseEmbedding instance from the Camel embedding library,
      and the function will internally handle embeddings.
    • Alternatively, users can pass in precomputed embeddings directly if they have
      their own embedding process or data is pre-embedded.
  3. Multiple strategies:
    • Initially supports a "top1" strategy (i.e., find the highest similarity above threshold).
    • A future "llm-supervise" strategy will rely on an LLM to decide whether two
      texts are duplicates, especially when borderline or semantic similarity is unclear.
      (Currently not implemented, but planned.)

Example

See the updated function deduplicate_internally in deduplication.py:

def deduplicate_internally(
    texts: List[str],
    threshold: float = 0.65,
    embedding_instance: Optional[BaseEmbedding[str]] = None,
    embeddings: Optional[List[List[float]]] = None,
    strategy: Literal["top1", "llm-supervise"] = "top1",
) -> DeduplicationResult:
    ...
@keli-wen keli-wen added the enhancement New feature or request label Jan 28, 2025
@keli-wen keli-wen self-assigned this Jan 28, 2025
@Wendong-Fan Wendong-Fan added New Feature and removed enhancement New feature or request labels Feb 6, 2025
@Wendong-Fan Wendong-Fan added this to the Sprint 22 milestone Feb 6, 2025
@Wendong-Fan Wendong-Fan linked a pull request Feb 6, 2025 that will close this issue
10 tasks
@keli-wen keli-wen linked a pull request Feb 7, 2025 that will close this issue
10 tasks
@keli-wen keli-wen removed a link to a pull request Feb 7, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants