langchain-huggingface: use separate kwargs for queries and docs #27857

Samoed · 2024-11-02T14:31:49Z

Now encode_kwargs used for both for documents and queries and this leads to wrong embeddings. E. g.:

    model_kwargs = {"device": "cuda", "trust_remote_code": True}
    encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}

    model = HuggingFaceEmbeddings(
        model_name="dunzhang/stella_en_400M_v5",
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
    )

    query_embedding = np.array(
        model.embed_query("What are some ways to reduce stress?",)
    )
    document_embedding = np.array(
        model.embed_documents(
            [
                "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
                "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
            ]
        )
    )
    print(model._client.similarity(query_embedding, document_embedding)) # output: tensor([[0.8421, 0.3317]], dtype=torch.float64)

But from the model card expexted like this:

    model_kwargs = {"device": "cuda", "trust_remote_code": True}
    encode_kwargs = {"normalize_embeddings": False}
    query_encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}

    model = HuggingFaceEmbeddings(
        model_name="dunzhang/stella_en_400M_v5",
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        query_encode_kwargs=query_encode_kwargs,
    )

    query_embedding = np.array(
        model.embed_query("What are some ways to reduce stress?", )
    )
    document_embedding = np.array(
        model.embed_documents(
            [
                "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
                "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
            ]
        )
    )
    print(model._client.similarity(query_embedding, document_embedding)) # tensor([[0.8398, 0.2990]], dtype=torch.float64)

vercel · 2024-11-02T14:31:54Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Nov 5, 2024 3:08pm

vbarda · 2024-11-05T14:58:42Z

libs/partners/huggingface/langchain_huggingface/embeddings/huggingface.py

@@ -65,11 +70,17 @@ def __init__(self, **kwargs: Any):
        protected_namespaces=(),
    )

-    def embed_documents(self, texts: List[str]) -> List[List[float]]:
-        """Compute doc embeddings using a HuggingFace transformer model.
+    def embed(


consider making private

@vbarda Done!

Samoed · 2024-11-06T08:40:00Z

@vbarda Can you merge this PR, please?

…chain-ai#27857) Now `encode_kwargs` used for both for documents and queries and this leads to wrong embeddings. E. g.: ```python model_kwargs = {"device": "cuda", "trust_remote_code": True} encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"} model = HuggingFaceEmbeddings( model_name="dunzhang/stella_en_400M_v5", model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, ) query_embedding = np.array( model.embed_query("What are some ways to reduce stress?",) ) document_embedding = np.array( model.embed_documents( [ "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.", "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.", ] ) ) print(model._client.similarity(query_embedding, document_embedding)) # output: tensor([[0.8421, 0.3317]], dtype=torch.float64) ``` But from the [model card](https://huggingface.co/dunzhang/stella_en_400M_v5#sentence-transformers) expexted like this: ```python model_kwargs = {"device": "cuda", "trust_remote_code": True} encode_kwargs = {"normalize_embeddings": False} query_encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"} model = HuggingFaceEmbeddings( model_name="dunzhang/stella_en_400M_v5", model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_encode_kwargs=query_encode_kwargs, ) query_embedding = np.array( model.embed_query("What are some ways to reduce stress?", ) ) document_embedding = np.array( model.embed_documents( [ "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.", "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.", ] ) ) print(model._client.similarity(query_embedding, document_embedding)) # tensor([[0.8398, 0.2990]], dtype=torch.float64) ```

add encode_query kwargs

a043d76

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Nov 2, 2024

efriis assigned vbarda Nov 4, 2024

efriis approved these changes Nov 4, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Nov 4, 2024

vbarda reviewed Nov 5, 2024

View reviewed changes

vbarda approved these changes Nov 5, 2024

View reviewed changes

make embed private

ebf31a4

vbarda approved these changes Nov 6, 2024

View reviewed changes

vbarda merged commit 0f85dea into langchain-ai:master Nov 6, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langchain-huggingface: use separate kwargs for queries and docs #27857

langchain-huggingface: use separate kwargs for queries and docs #27857

Samoed commented Nov 2, 2024

vercel bot commented Nov 2, 2024 •

edited

Loading

vbarda Nov 5, 2024

Samoed Nov 5, 2024

Samoed commented Nov 6, 2024

langchain-huggingface: use separate kwargs for queries and docs #27857

langchain-huggingface: use separate kwargs for queries and docs #27857

Conversation

Samoed commented Nov 2, 2024

vercel bot commented Nov 2, 2024 • edited Loading

vbarda Nov 5, 2024

Choose a reason for hiding this comment

Samoed Nov 5, 2024

Choose a reason for hiding this comment

Samoed commented Nov 6, 2024

vercel bot commented Nov 2, 2024 •

edited

Loading