Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file and indexing related super components #184

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

mathislucka
Copy link
Member

@mathislucka mathislucka commented Jan 31, 2025

Related Issues

Proposed Changes:

Adds a file converter and a document indexer.

How did you test it?

Notes for the reviewer

Needs to be merged after: #183
Changes and naming need some more discussion.

Checklist

@mathislucka mathislucka requested a review from a team as a code owner January 31, 2025 10:16
@mathislucka mathislucka requested review from vblagoje and removed request for a team January 31, 2025 10:16
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link

coveralls commented Jan 31, 2025

Pull Request Test Coverage Report for Build 13260292199

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on feat/ready_made_supercomponents at 73.608%

Totals Coverage Status
Change from base Build 13253060260: 73.6%
Covered Lines: 1983
Relevant Lines: 2694

💛 - Coveralls

pipeline.add_component(
"writer",
DocumentWriter(
document_store=InMemoryDocumentStore(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this super component document store agnostic by passing the doc store as an argument? It'd also simplify the Retriever initialization step. Right now, it'll look like this

retriever = InMemoryEmbeddingRetriever(document_store=indexer.pipeline.get_component("writer").document_store)

@mathislucka mathislucka requested a review from a team as a code owner February 6, 2025 17:15
@mathislucka mathislucka requested review from dfokina and removed request for a team February 6, 2025 17:15
@julian-risch
Copy link
Member

I made a small change to the way FileTypeRouter is initialized to make it work on Windows. An alternative would be to change the implementation of the __init__ of the FileTypeRouter so that it always registers docx, pptx, xlsx MIME types. I decided against it because the docstring of the additional_mimetypes parameter explains how users can do it themselves and it only affects Windows.
The tests that are now failing are unrelated and we can merge regardless of them once the PR is reviewed and approved by @vblagoje . The failing tests are tracked by #201

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, 99% there, left minor comments to consider and a few requests for changes

date: 1.1.2023
---
```bash
pip install farm-haystack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just an example document but I can also see how it can create problems from us and confusion on discord 🤣

Let's switch it to haystack-ai?

TEXT = "text/plain"
PDF = "application/pdf"
PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation"
XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to serialize and deserialize ok? We usually used something like:

class DOCXTableFormat(Enum):
    """
    Supported formats for storing DOCX tabular data in a Document.
    """

    MARKDOWN = "markdown"
    CSV = "csv"

    def __str__(self):
        return self.value

    @staticmethod
    def from_str(string: str) -> "DOCXTableFormat":
        """
        Convert a string to a DOCXTableFormat enum.
        """
        enum_map = {e.value: e for e in DOCXTableFormat}
        table_format = enum_map.get(string.lower())
        if table_format is None:
            msg = f"Unknown table format '{string}'. Supported formats are: {list(enum_map.keys())}"
            raise ValueError(msg)
        return table_format

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only using this internally, so no need to serialize.

def __init__( # noqa: PLR0915
self,
encoding: str = "utf-8",
json_content_key: str = "content",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what this is? Add init pydoc?



@component
class DocumentIndexer(SuperComponent):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not SentenceTransformerDocumentIndexer? It'll be a bit confusing for people who want to use embedding models from other providers not available via sentence-transformers architecture....

date: 1.1.2023
---
```bash
pip install farm-haystack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants