[Feature Request] General Deduplication Utility for String-based Data. #1517

keli-wen · 2025-01-28T08:00:26Z

Required prerequisites

I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Motivation / Background

We have observed frequent duplicates or near-duplicates across different modules,
such as user inputs, knowledge graph nodes, or text-based data in our pipeline.

This results in:

Increased confusion and potential redundancy in search or retrieval,
Larger storage requirements,
Potential performance degradation over time.

Previously, we had a specialized deduplication utility targeting Node objects (FinDKG project),
but this approach does not generalize well to other text-based data.

Proposal

Introduce a deduplication.py module with more flexible deduplication related functions.

deduplication_internally
deduplication_externally

Key features and improvements:

Works with strings: Instead of requiring Node objects, we now accept lists of raw strings.
Optional embeddings parameter:
- Users can supply a BaseEmbedding instance from the Camel embedding library,
  and the function will internally handle embeddings.
- Alternatively, users can pass in precomputed embeddings directly if they have
  their own embedding process or data is pre-embedded.
Multiple strategies:
- Initially supports a "top1" strategy (i.e., find the highest similarity above threshold).
- A future "llm-supervise" strategy will rely on an LLM to decide whether two
  texts are duplicates, especially when borderline or semantic similarity is unclear.
  (Currently not implemented, but planned.)

Example

See the updated function deduplicate_internally in deduplication.py:

def deduplicate_internally(
    texts: List[str],
    threshold: float = 0.65,
    embedding_instance: Optional[BaseEmbedding[str]] = None,
    embeddings: Optional[List[List[float]]] = None,
    strategy: Literal["top1", "llm-supervise"] = "top1",
) -> DeduplicationResult:
    ...

The text was updated successfully, but these errors were encountered:

keli-wen added the enhancement New feature or request label Jan 28, 2025

keli-wen self-assigned this Jan 28, 2025

keli-wen mentioned this issue Jan 28, 2025

feat: Internal deduplication impl. #1518

Closed

10 tasks

Wendong-Fan added New Feature and removed enhancement New feature or request labels Feb 6, 2025

Wendong-Fan added this to Project Camel Feb 6, 2025

Wendong-Fan added this to the Sprint 22 milestone Feb 6, 2025

Wendong-Fan linked a pull request Feb 6, 2025 that will close this issue

feat: Internal deduplication impl. #1518

Closed

10 tasks

keli-wen mentioned this issue Feb 7, 2025

feat: Internal deduplication impl. #1568

Merged

10 tasks

keli-wen linked a pull request Feb 7, 2025 that will close this issue

feat: Internal deduplication impl. #1568

Merged

10 tasks

keli-wen removed a link to a pull request Feb 7, 2025

feat: Internal deduplication impl. #1518

Closed

10 tasks

keli-wen closed this as completed in #1568 Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] General Deduplication Utility for String-based Data. #1517

[Feature Request] General Deduplication Utility for String-based Data. #1517

keli-wen commented Jan 28, 2025 •

edited

Loading

[Feature Request] General Deduplication Utility for String-based Data. #1517

[Feature Request] General Deduplication Utility for String-based Data. #1517

Comments

keli-wen commented Jan 28, 2025 • edited Loading

Required prerequisites

Motivation / Background

Proposal

Example

keli-wen commented Jan 28, 2025 •

edited

Loading