-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add file and indexing related super components #184
base: main
Are you sure you want to change the base?
Conversation
* raise error for async import * Remove all async pipeline tests (cherry picked from commit cbbf088)
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Pull Request Test Coverage Report for Build 13260292199Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
haystack_experimental/super_components/converters/file_converter.py
Outdated
Show resolved
Hide resolved
pipeline.add_component( | ||
"writer", | ||
DocumentWriter( | ||
document_store=InMemoryDocumentStore(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make this super component document store agnostic by passing the doc store as an argument? It'd also simplify the Retriever initialization step. Right now, it'll look like this
retriever = InMemoryEmbeddingRetriever(document_store=indexer.pipeline.get_component("writer").document_store)
# Conflicts: # haystack_experimental/core/__init__.py
…ponents # Conflicts: # haystack_experimental/core/__init__.py
# Conflicts: # haystack_experimental/core/super_component/super_component.py
… into feat/ready_made_supercomponents
I made a small change to the way FileTypeRouter is initialized to make it work on Windows. An alternative would be to change the implementation of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, 99% there, left minor comments to consider and a few requests for changes
date: 1.1.2023 | ||
--- | ||
```bash | ||
pip install farm-haystack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just an example document but I can also see how it can create problems from us and confusion on discord 🤣
Let's switch it to haystack-ai?
TEXT = "text/plain" | ||
PDF = "application/pdf" | ||
PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation" | ||
XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to serialize and deserialize ok? We usually used something like:
class DOCXTableFormat(Enum):
"""
Supported formats for storing DOCX tabular data in a Document.
"""
MARKDOWN = "markdown"
CSV = "csv"
def __str__(self):
return self.value
@staticmethod
def from_str(string: str) -> "DOCXTableFormat":
"""
Convert a string to a DOCXTableFormat enum.
"""
enum_map = {e.value: e for e in DOCXTableFormat}
table_format = enum_map.get(string.lower())
if table_format is None:
msg = f"Unknown table format '{string}'. Supported formats are: {list(enum_map.keys())}"
raise ValueError(msg)
return table_format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only using this internally, so no need to serialize.
def __init__( # noqa: PLR0915 | ||
self, | ||
encoding: str = "utf-8", | ||
json_content_key: str = "content", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what this is? Add init pydoc?
|
||
|
||
@component | ||
class DocumentIndexer(SuperComponent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not SentenceTransformerDocumentIndexer
? It'll be a bit confusing for people who want to use embedding models from other providers not available via sentence-transformers architecture....
date: 1.1.2023 | ||
--- | ||
```bash | ||
pip install farm-haystack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one 🙏
Related Issues
Proposed Changes:
Adds a file converter and a document indexer.
How did you test it?
Notes for the reviewer
Needs to be merged after: #183
Changes and naming need some more discussion.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.