feat: add file and indexing related super components #184

mathislucka · 2025-01-31T10:16:07Z

Related Issues

split from: feat: add SuperComponent #174

Proposed Changes:

Adds a file converter and a document indexer.

How did you test it?

Notes for the reviewer

Needs to be merged after: #183
Changes and naming need some more discussion.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

* raise error for async import * Remove all async pipeline tests (cherry picked from commit cbbf088)

review-notebook-app · 2025-01-31T10:16:13Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2025-01-31T10:20:43Z

Pull Request Test Coverage Report for Build 13260292199

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on feat/ready_made_supercomponents at 73.608%

Totals
Change from base Build 13253060260:	73.6%
Covered Lines:	1983
Relevant Lines:	2694

💛 - Coveralls

haystack_experimental/super_components/converters/file_converter.py

bilgeyucel · 2025-01-31T21:23:15Z

haystack_experimental/super_components/indexers/document_indexer.py

+        pipeline.add_component(
+            "writer",
+            DocumentWriter(
+                document_store=InMemoryDocumentStore(),


Is it possible to make this super component document store agnostic by passing the doc store as an argument? It'd also simplify the Retriever initialization step. Right now, it'll look like this

retriever = InMemoryEmbeddingRetriever(document_store=indexer.pipeline.get_component("writer").document_store)

# Conflicts: # haystack_experimental/core/__init__.py

…ponents # Conflicts: # haystack_experimental/core/__init__.py

# Conflicts: # haystack_experimental/core/super_component/super_component.py

… into feat/ready_made_supercomponents

julian-risch · 2025-02-18T20:24:41Z

I made a small change to the way FileTypeRouter is initialized to make it work on Windows. An alternative would be to change the implementation of the __init__ of the FileTypeRouter so that it always registers docx, pptx, xlsx MIME types. I decided against it because the docstring of the additional_mimetypes parameter explains how users can do it themselves and it only affects Windows.
The tests that are now failing are unrelated and we can merge regardless of them once the PR is reviewed and approved by @vblagoje . The failing tests are tracked by #201

vblagoje

Very cool, 99% there, left minor comments to consider and a few requests for changes

vblagoje · 2025-02-19T10:17:18Z

examples/sample_files/sample.md

+date: 1.1.2023
+---
+```bash
+pip install farm-haystack


I know this is just an example document but I can also see how it can create problems from us and confusion on discord 🤣

Let's switch it to haystack-ai?

vblagoje · 2025-02-19T10:19:42Z

haystack_experimental/super_components/converters/multi_file_converter.py

+    TEXT = "text/plain"
+    PDF = "application/pdf"
+    PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation"
+    XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"


Is this going to serialize and deserialize ok? We usually used something like:

class DOCXTableFormat(Enum): """ Supported formats for storing DOCX tabular data in a Document. """ MARKDOWN = "markdown" CSV = "csv" def __str__(self): return self.value @staticmethod def from_str(string: str) -> "DOCXTableFormat": """ Convert a string to a DOCXTableFormat enum. """ enum_map = {e.value: e for e in DOCXTableFormat} table_format = enum_map.get(string.lower()) if table_format is None: msg = f"Unknown table format '{string}'. Supported formats are: {list(enum_map.keys())}" raise ValueError(msg) return table_format

We are only using this internally, so no need to serialize.

vblagoje · 2025-02-19T10:20:51Z

haystack_experimental/super_components/converters/multi_file_converter.py

+    def __init__( # noqa: PLR0915
+        self,
+        encoding: str = "utf-8",
+        json_content_key: str = "content",


Unclear what this is? Add init pydoc?

vblagoje · 2025-02-19T10:22:41Z

haystack_experimental/super_components/indexers/document_indexer.py

+
+
+@component
+class DocumentIndexer(SuperComponent):


Why not SentenceTransformerDocumentIndexer? It'll be a bit confusing for people who want to use embedding models from other providers not available via sentence-transformers architecture....

vblagoje · 2025-02-19T10:25:49Z

test/test_files/markdown/sample.md

+date: 1.1.2023
+---
+```bash
+pip install farm-haystack


And this one 🙏

mathislucka and others added 18 commits January 31, 2025 10:07

add super component utils

6af8913

add super component base

fc20b49

add super component

ca66f49

add init

251b9ae

add SuperComponent tests

948574f

add type utils tests

7b3009f

format

5aebc78

add init

144cde8

add file converter

d08a37a

add example notebook

0ad8b96

feat: implement DocumentIndexer super component

402e91b

chore: add license headers

647c4a2

fix: raise error on import of AsyncPipeline (#178)

0da3c59

* raise error for async import * Remove all async pipeline tests (cherry picked from commit cbbf088)

test: add unit tests for DocumentIndexer

78e1fcd

update indexer to take model names instead of component

56412be

more tests

eca5d67

fix

301b523

remove unused imports

4201998

mathislucka requested a review from a team as a code owner January 31, 2025 10:16

mathislucka requested review from vblagoje and removed request for a team January 31, 2025 10:16

mathislucka mentioned this pull request Jan 31, 2025

feat: add SuperComponent #174

Draft

bilgeyucel reviewed Feb 4, 2025

View reviewed changes

mathislucka added 5 commits February 6, 2025 10:53

refactor: use single base class

159af04

Merge branch 'main' into feat/supercomponent_base

6c2226a

# Conflicts: # haystack_experimental/core/__init__.py

fix: format

5ce83ca

fix: SuperComponent is a component

a6fca36

Merge branch 'feat/supercomponent_base' into feat/ready_made_supercom…

3f33ddd

…ponents # Conflicts: # haystack_experimental/core/__init__.py

test: MultiFileConverter

eac0142

mathislucka requested a review from a team as a code owner February 6, 2025 17:15

mathislucka requested review from dfokina and removed request for a team February 6, 2025 17:15

mathislucka and others added 11 commits February 7, 2025 13:39

Merge branch 'refs/heads/main' into feat/ready_made_supercomponents

4f71375

# Conflicts: # haystack_experimental/core/super_component/super_component.py

feat: add preprocessor

8c13dc8

feat: update document indexer

d92fe9e

feat: update notebook

0aee13d

lint and format

ba932fc

license headers

cbbb445

Merge branch 'main' into feat/ready_made_supercomponents

2d10891

add test dependencies

e6149d0

Merge remote-tracking branch 'origin/feat/ready_made_supercomponents'…

4ec386a

… into feat/ready_made_supercomponents

ensure docx, pptx, xlsx MIME types are registered

276bbfb

Merge branch 'main' into feat/ready_made_supercomponents

84aa2a7

vblagoje requested changes Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add file and indexing related super components #184

feat: add file and indexing related super components #184

mathislucka commented Jan 31, 2025 •

edited

Loading

review-notebook-app bot commented Jan 31, 2025

coveralls commented Jan 31, 2025 •

edited

Loading

bilgeyucel Jan 31, 2025

julian-risch commented Feb 18, 2025

vblagoje left a comment

vblagoje Feb 19, 2025

vblagoje Feb 19, 2025

mathislucka Feb 19, 2025

vblagoje Feb 19, 2025

vblagoje Feb 19, 2025

vblagoje Feb 19, 2025

feat: add file and indexing related super components #184

Are you sure you want to change the base?

feat: add file and indexing related super components #184

Conversation

mathislucka commented Jan 31, 2025 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

review-notebook-app bot commented Jan 31, 2025

coveralls commented Jan 31, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13260292199

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

bilgeyucel Jan 31, 2025

Choose a reason for hiding this comment

julian-risch commented Feb 18, 2025

vblagoje left a comment

Choose a reason for hiding this comment

vblagoje Feb 19, 2025

Choose a reason for hiding this comment

vblagoje Feb 19, 2025

Choose a reason for hiding this comment

mathislucka Feb 19, 2025

Choose a reason for hiding this comment

vblagoje Feb 19, 2025

Choose a reason for hiding this comment

vblagoje Feb 19, 2025

Choose a reason for hiding this comment

vblagoje Feb 19, 2025

Choose a reason for hiding this comment

mathislucka commented Jan 31, 2025 •

edited

Loading

coveralls commented Jan 31, 2025 •

edited

Loading