Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(document-search): add support for images #121

Merged
merged 9 commits into from
Oct 22, 2024

Conversation

konrad-czarnota-ds
Copy link
Collaborator

No description provided.

@konrad-czarnota-ds konrad-czarnota-ds linked an issue Oct 21, 2024 that may be closed by this pull request
Copy link
Contributor

github-actions bot commented Oct 21, 2024

badge

Code Coverage Summary

Filename                                                                                                     Stmts    Miss  Cover    Missing
---------------------------------------------------------------------------------------------------------  -------  ------  -------  ----------------------------------------
packages/ragbits-core/src/ragbits/core/__init__.py                                                               0       0  100.00%
packages/ragbits-core/src/ragbits/core/config.py                                                                 6       0  100.00%
packages/ragbits-core/src/ragbits/core/embeddings/__init__.py                                                   11       0  100.00%
packages/ragbits-core/src/ragbits/core/embeddings/base.py                                                        4       0  100.00%
packages/ragbits-core/src/ragbits/core/embeddings/exceptions.py                                                 14       6  57.14%   7-8, 17, 26-27, 36
packages/ragbits-core/src/ragbits/core/embeddings/litellm.py                                                    28      19  32.14%   7-8, 43-51, 69-85
packages/ragbits-core/src/ragbits/core/embeddings/local.py                                                      39      25  35.90%   9-10, 35-45, 58-70, 74-76, 80-81
packages/ragbits-core/src/ragbits/core/embeddings/noop.py                                                        4       0  100.00%
packages/ragbits-core/src/ragbits/core/llms/__init__.py                                                          4       0  100.00%
packages/ragbits-core/src/ragbits/core/llms/base.py                                                             33       4  87.88%   33, 52, 88, 97
packages/ragbits-core/src/ragbits/core/llms/factory.py                                                          18       2  88.89%   47, 60
packages/ragbits-core/src/ragbits/core/llms/litellm.py                                                          25       4  84.00%   8-9, 54, 85
packages/ragbits-core/src/ragbits/core/llms/local.py                                                            24      10  58.33%   8-9, 42-47, 57, 70-71
packages/ragbits-core/src/ragbits/core/llms/types.py                                                             8       2  75.00%   24, 28
packages/ragbits-core/src/ragbits/core/llms/clients/__init__.py                                                  4       0  100.00%
packages/ragbits-core/src/ragbits/core/llms/clients/base.py                                                     23       0  100.00%
packages/ragbits-core/src/ragbits/core/llms/clients/exceptions.py                                               14       6  57.14%   7-8, 17, 26-27, 36
packages/ragbits-core/src/ragbits/core/llms/clients/litellm.py                                                  50      10  80.00%   10-11, 70, 108, 122-127
packages/ragbits-core/src/ragbits/core/llms/clients/local.py                                                    37      12  67.57%   11-12, 62-70, 92-103
packages/ragbits-core/src/ragbits/core/prompt/__init__.py                                                        2       0  100.00%
packages/ragbits-core/src/ragbits/core/prompt/base.py                                                           18       0  100.00%
packages/ragbits-core/src/ragbits/core/prompt/parsers.py                                                        34       0  100.00%
packages/ragbits-core/src/ragbits/core/prompt/prompt.py                                                        110       2  98.18%   107, 111
packages/ragbits-core/src/ragbits/core/prompt/discovery/__init__.py                                              2       0  100.00%
packages/ragbits-core/src/ragbits/core/prompt/discovery/prompt_discovery.py                                     33       2  93.94%   55-56
packages/ragbits-core/src/ragbits/core/utils/_pyproject.py                                                      25       0  100.00%
packages/ragbits-core/src/ragbits/core/utils/config_handling.py                                                  9       0  100.00%
packages/ragbits-core/src/ragbits/core/utils/decorators.py                                                      28       0  100.00%
packages/ragbits-core/src/ragbits/core/vector_store/__init__.py                                                 13       1  92.31%   29
packages/ragbits-core/src/ragbits/core/vector_store/base.py                                                     12       0  100.00%
packages/ragbits-core/src/ragbits/core/vector_store/chromadb_store.py                                           57       7  87.72%   9-10, 60-67, 84, 126
packages/ragbits-core/src/ragbits/core/vector_store/in_memory.py                                                18       0  100.00%
packages/ragbits-core/tests/unit/__init__.py                                                                     0       0  100.00%
packages/ragbits-core/tests/unit/llms/__init__.py                                                                0       0  100.00%
packages/ragbits-core/tests/unit/llms/test_litellm.py                                                           63       0  100.00%
packages/ragbits-core/tests/unit/llms/factory/__init__.py                                                        3       0  100.00%
packages/ragbits-core/tests/unit/llms/factory/test_get_default_llm.py                                            8       0  100.00%
packages/ragbits-core/tests/unit/llms/factory/test_get_llm_from_factory.py                                       8       0  100.00%
packages/ragbits-core/tests/unit/llms/factory/test_has_default_llm.py                                            8       0  100.00%
packages/ragbits-core/tests/unit/prompts/__init__.py                                                             0       0  100.00%
packages/ragbits-core/tests/unit/prompts/test_parsers.py                                                        65       0  100.00%
packages/ragbits-core/tests/unit/prompts/test_prompt.py                                                        143       0  100.00%
packages/ragbits-core/tests/unit/prompts/discovery/__init__.py                                                   0       0  100.00%
packages/ragbits-core/tests/unit/prompts/discovery/prompt_classes_for_tests.py                                  30       0  100.00%
packages/ragbits-core/tests/unit/prompts/discovery/test_prompt_discovery.py                                     18       0  100.00%
packages/ragbits-core/tests/unit/prompts/discovery/ragbits_tests_pkg_with_prompts/__init__.py                    2       1  50.00%   3
packages/ragbits-core/tests/unit/prompts/discovery/ragbits_tests_pkg_with_prompts/prompts/__init__.py            3       2  33.33%   2-4
packages/ragbits-core/tests/unit/prompts/discovery/ragbits_tests_pkg_with_prompts/prompts/temp_prompt1.py       14       0  100.00%
packages/ragbits-core/tests/unit/prompts/discovery/ragbits_tests_pkg_with_prompts/prompts/temp_prompt2.py       14       0  100.00%
packages/ragbits-core/tests/unit/utils/test_decorators.py                                                       26       2  92.31%   17, 39
packages/ragbits-core/tests/unit/utils/pyproject/test_find.py                                                   13       0  100.00%
packages/ragbits-core/tests/unit/utils/pyproject/test_get_config.py                                              9       0  100.00%
packages/ragbits-core/tests/unit/utils/pyproject/test_get_instace.py                                            27       0  100.00%
packages/ragbits-core/tests/unit/vector_stores/test_chromadb_store.py                                           69       4  94.20%   31, 34, 39, 44
packages/ragbits-core/tests/unit/vector_stores/test_simple_vector_store.py                                      16       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/__init__.py                                         2       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/_main.py                                           67       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/documents/__init__.py                               0       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/documents/document.py                              60       3  95.00%   53, 96, 143
packages/ragbits-document-search/src/ragbits/document_search/documents/element.py                               37       2  94.59%   34, 107
packages/ragbits-document-search/src/ragbits/document_search/documents/exceptions.py                            11       5  54.55%   7-8, 17, 26-27
packages/ragbits-document-search/src/ragbits/document_search/documents/sources.py                               93      16  82.80%   13-14, 63, 76, 160-165, 203-206, 210-211
packages/ragbits-document-search/src/ragbits/document_search/ingestion/__init__.py                               0       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/document_processor.py                    29       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/__init__.py                    13       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/base.py                        14       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/dummy.py                       11       1  90.91%   27
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/__init__.py        0       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/default.py        45       4  91.11%   97, 102-103, 134
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/images.py         39      15  61.54%   61-68, 75-83, 94, 107
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/pdf.py            20       6  70.00%   24, 36-44
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/utils.py          33      10  69.70%   50, 61-62, 77-80, 104-119
packages/ragbits-document-search/src/ragbits/document_search/retrieval/__init__.py                               0       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/retrieval/rephrasers/__init__.py                   13       3  76.92%   29-32
packages/ragbits-document-search/src/ragbits/document_search/retrieval/rephrasers/base.py                        5       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/retrieval/rephrasers/noop.py                        5       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/retrieval/rerankers/__init__.py                    13       1  92.31%   27
packages/ragbits-document-search/src/ragbits/document_search/retrieval/rerankers/base.py                         6       0  100.00%
packages/ragbits-document-search/src/ragbits/document_search/retrieval/rerankers/noop.py                         7       0  100.00%
packages/ragbits-document-search/tests/__init__.py                                                               0       0  100.00%
packages/ragbits-document-search/tests/helpers.py                                                                3       0  100.00%
packages/ragbits-document-search/tests/integration/__init__.py                                                   0       0  100.00%
packages/ragbits-document-search/tests/integration/test_sources.py                                              23      10  56.52%   22-32, 40-45
packages/ragbits-document-search/tests/integration/test_unstructured.py                                         47      10  78.72%   46-52, 65-71
packages/ragbits-document-search/tests/unit/__init__.py                                                          0       0  100.00%
packages/ragbits-document-search/tests/unit/test_document_processor.py                                          17       0  100.00%
packages/ragbits-document-search/tests/unit/test_document_search.py                                             75       0  100.00%
packages/ragbits-document-search/tests/unit/test_documents.py                                                   13       0  100.00%
packages/ragbits-document-search/tests/unit/test_elements.py                                                    15       0  100.00%
packages/ragbits-document-search/tests/unit/test_local_file_source.py                                           13       0  100.00%
packages/ragbits-document-search/tests/unit/test_providers.py                                                   31       0  100.00%
packages/ragbits-document-search/tests/unit/test_sources.py                                                     25       0  100.00%
TOTAL                                                                                                         2031     207  89.81%

Diff against main

Filename                                                                                                     Stmts    Miss  Cover
---------------------------------------------------------------------------------------------------------  -------  ------  --------
packages/ragbits-document-search/src/ragbits/document_search/documents/document.py                              +2       0  +0.17%
packages/ragbits-document-search/src/ragbits/document_search/documents/element.py                               +7      +1  -2.08%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/document_processor.py                    +2       0  +100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/__init__.py                    +2       0  +100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/__init__.py        0       0  +100.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/default.py       +45      +4  +91.11%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/images.py        +39     +15  +61.54%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/pdf.py           +20      +6  +70.00%
packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/utils.py         +33     +10  +69.70%
packages/ragbits-document-search/tests/integration/test_unstructured.py                                         +9      +5  -8.12%
packages/ragbits-document-search/tests/unit/test_providers.py                                                   +8       0  +100.00%
TOTAL                                                                                                         +167     +41  -1.25%

Results for commit: c5ca279

Minimum allowed coverage is 60%

♻️ This comment has been updated with latest results

Copy link
Contributor

github-actions bot commented Oct 21, 2024

Trivy scanning results.

.venv/lib/python3.10/site-packages/PyJWT-2.9.0.dist-info/METADATA (secrets)

Total: 1 (MEDIUM: 1, HIGH: 0, CRITICAL: 0)

MEDIUM: JWT (jwt-token)
════════════════════════════════════════
JWT token
────────────────────────────────────────
.venv/lib/python3.10/site-packages/PyJWT-2.9.0.dist-info/METADATA:80
────────────────────────────────────────
78 >>> encoded = jwt.encode({"some": "payload"}, "secret", algorithm="HS256")
79 >>> print(encoded)
80 [ *********************************************************************************************************
81 >>> jwt.decode(encoded, "secret", algorithms=["HS256"])
────────────────────────────────────────

.venv/lib/python3.10/site-packages/litellm/llms/huggingface_llms_metadata/hf_text_generation_models.txt (secrets)

Total: 1 (MEDIUM: 0, HIGH: 0, CRITICAL: 1)

CRITICAL: HuggingFace (hugging-face-access-token)
════════════════════════════════════════
Hugging Face Access Token
────────────────────────────────────────
.venv/lib/python3.10/site-packages/litellm/llms/huggingface_llms_metadata/hf_text_generation_models.txt:36162
────────────────────────────────────────
36160 mncai/Llama2-7B-Active_3rd-floor-LoRA-dim64_epoch4
36161 ajcdp/CM
36162 [ Nagharjun17/*************************************
36163 BigSalmon/InformalToFormalLincoln114Paraphrase
────────────────────────────────────────

.venv/lib/python3.10/site-packages/litellm/proxy/_types.py (secrets)

Total: 1 (MEDIUM: 1, HIGH: 0, CRITICAL: 0)

MEDIUM: Slack (slack-web-hook)
════════════════════════════════════════
Slack Webhook
────────────────────────────────────────
.venv/lib/python3.10/site-packages/litellm/proxy/_types.py:1288
────────────────────────────────────────
1286 alert_to_webhook_url: Optional[Dict] = Field(
1287 None,
1288 [ bhook_url: {'budget_alerts': '*****************************************************************************'}`",
1289 )
────────────────────────────────────────


from pydantic import BaseModel
from typing_extensions import TypeVar

ChatFormat = List[Dict[str, str]]
ChatFormat = List[Dict[str, Any]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right - even when sending images the format is still dict of strings, isn't it? Besides, you don't seem to use ChatFormat / the Prompt classes in this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it because mypy was complaining about this part:

messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"{prompt}"},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ],
            }
        ]
        return await self.model.client.call(messages, self.model_options)

there is a key with list instead of string. I'm not sure why it doesn't complain now when I reverted it to str 🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it's back:

 error: Argument 1 to "call" of "LiteLLMClient" has incompatible type "list[dict[str, Sequence[object]]]"; expected "list[dict[str, str]]"  [arg-type]
Found 1 error in 1 file (checked 70 source files)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank change it to list[dict[str, str | list[dict]]]. Any is very rarely the right answer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it didn't help, but introduced one more error from mypy. I tried to fight with it a bit but without success and decided to simply add ignore line there (this is to be changed anyway to use our prompt template with image support)

if self.__class__ == UnstructuredDefaultProvider:
chunked_elements = chunk_elements(elements, **self.chunking_kwargs)
return [to_text_element(element, document_meta) for element in chunked_elements]
image_elements = [e for e in elements if e.category == "Image"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably use Unstructured's ElementType.Image instead of a raw string

async def _chunk_and_convert(
self, elements: list[UnstructuredElement], document_meta: DocumentMeta, document_path: Path
) -> list[Element]:
if self.__class__ == UnstructuredDefaultProvider:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks confusing to me. Is this about having a behavior on the base class that the classes inheriting from it don't share?

Inheriting a class and having the behavior change magically, even though I haven't changed it myself, would be surprising and not something I would like to encounter as a programmer. Seems like this behavior should be in a method that inheriting class can overwrite if they want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both inheriting classes need to do the same preprocessing:

        image_elements = [e for e in elements if e.category == ElementType.IMAGE]
        other_elements = [e for e in elements if e.category != ElementType.IMAGE]
        chunked_other_elements = chunk_elements(other_elements, **self.chunking_kwargs)

        text_elements: list[Element] = [to_text_element(element, document_meta) for element in chunked_other_elements]

I didn't want to duplicate this code in both classes nor introduce a third "in-between" class so I came up with this idea. Happy to change it if you have a better suggestion how to handle it.

Copy link
Collaborator

@ludwiktrammer ludwiktrammer Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely change this - it's a very surprising behavior and you don't want unexpected surprises from your code. Even though both classes that currently inherit from UnstructuredDefaultProvider need this behavior, the base class should be ready to be subclassed any time for other reasons. It shouldn't contain code specific to functionality that is not related to its intended behavior and definitely shouldn't magically switch to it for all of its children.

This is an important principle: child classes are downstream from the base class and they can know and use the specifics of the base class. On the other hand, the base class is upstream from its children. It cannot know when it will be subclassed and for what reason, it cannot predict how many children it has and what functionality this children are intended to have. As a user of ragbits, I should be able to subclass UnstructuredDefaultProvider in order to tweak some specific behavior, without everything breaking with a cryptic "UnstructuredDefaultProvider doesn't support image conversion" exception.

Seems that UnstructuredPdfProvider and UnstructuredImageProvider are similar because they both deal with images - the only difference is where they get the images from.

You can reflect that in the inheritance structure. Maybe pdf provider is just an image provider enhanced with the ability to find images in PDFs? Than the inheritance could look like that:

UnstructuredDefaultProvider -> UnstructuredImageProvider -> UnstructuredPdfProvider

If this doesn't seem correct to you in this case, alternatively you could have an intermediate (possibly abstract) class that adds image-related functionality, inherits from UnstructuredDefaultProvider and is a base for both UnstructuredImageProvider and UnstructuredPdfProvider

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that turned out to be a good idea, now pdf extends the images and there are only two small functions that needed to be implemented

@@ -70,13 +71,16 @@ def __init__(
variable will be used.
api_server: The API server URL to use for the Unstructured API. If not specified, the
UNSTRUCTURED_SERVER_URL environment variable will be used.
use_api: whether to use unstructured API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more detailed documentation would be helpful - what is the alternative that will be used when use_api is set to False?

async def _to_image_element(
self, element: UnstructuredElement, document_meta: DocumentMeta, document_path: Path
) -> ImageElement:
image_coordinates = extract_image_coordinates(element)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unpacking in a way that the coordinates have meaningful names might make the code clearer (when I see image_coordinates[0] I'm not sure what I'm looking at)

)


def set_or_raise(name: str, value: Optional[str], env_var: str) -> str:
Copy link
Collaborator

@ludwiktrammer ludwiktrammer Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name, the description, and the actual behavior don't match, at least when it comes to my mental model.

  • I would expect a function named set_or_raise to set a value and if it can't - to raise an exception. I also wouldn't know that it has anything to do with environmental variables
  • I would expect a function described as "Checks if given environment variable is set and returns it or raises an error" to simply look for an env variable and if it's not present - to raise an exception
  • In reality this is more like a fallback mechanism - first it tries to use the value given as an argument, if it's not present it tries to get it from the environment, and if it's not present there it throws an exception. The argument called "name" is not actually used anywhere other than the error message, and the error message makes it clear that the function is specifically about checking arguments. I would have never gotten all this without reading the code.

Maybe rename to something like check_required_argument(value: Optional[str], arg_name: str, fallback_env: str)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a function that I simply moved from another class, it came without documentation so I wrote something myself, but I guess it wasn't enough. Changed to your suggestion.


def __init__(self, model_name: str, model_options: LiteLLMOptions):
self.model_name = model_name
self.model: Optional[LiteLLM] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we usually refer to LLM objects as llm, not model

to_text_element,
)

DEFAULT_LLM_IMAGE_SUMMARIZATION_MODEL = "gpt-4o-mini"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocker for this PR but maybe this should use our built-in core_config.default_llm_factory functionality? Or if not (because there is no guarantee that default_llm_factory returns LLM that supports images) maybe we should have a separate config option for default visual LLM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue for that, as discussed.

@konrad-czarnota-ds konrad-czarnota-ds merged commit defd0b2 into main Oct 22, 2024
3 checks passed
@konrad-czarnota-ds konrad-czarnota-ds deleted the kc/add-support-for-images branch October 22, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(document-search): add support for images
2 participants