Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Reader API to convert HTMLs into Documents #663

Closed
7 of 8 tasks
bilgeyucel opened this issue Apr 16, 2024 · 14 comments
Closed
7 of 8 tasks

Add support for Reader API to convert HTMLs into Documents #663

bilgeyucel opened this issue Apr 16, 2024 · 14 comments
Assignees
Labels
contributions wanted! Looking for external contributions feature request Ideas to improve an integration integration:jina

Comments

@bilgeyucel
Copy link
Contributor

bilgeyucel commented Apr 16, 2024

Is your feature request related to a problem? Please describe.
There's no component to use Jina's Reader API with Haystack.

Describe the solution you'd like
A new JinaHTMLtoDocument (name TBD) component to use Jina's Reader API to convert URLs into Haystack Documents. This component should accept a URL and output a Haystack Document.

Describe alternatives you've considered

  • This component can output a markdown file and users might use MarkdownConverter to use that component in a pipeline (not Haystack intuitive but might have advantages)
  • Depending on how the Reader API works, it can accept a list of URLs and return a list of Haystack Documents

Additional context
Add any other context or screenshots about the feature request here.

Tasks

Preview Give feedback
  1. P1 type:documentation
    dfokina

Tasks

Preview Give feedback
No tasks being tracked yet.
@bilgeyucel bilgeyucel added contributions wanted! Looking for external contributions feature request Ideas to improve an integration integration:jina labels Apr 16, 2024
@anakin87
Copy link
Member

I was the one who proposed this component.
Unfortunately, I tried the service and it is quite unstable at the moment.

@jlonge4
Copy link
Contributor

jlonge4 commented Oct 16, 2024

Hey there @bilgeyucel @anakin87!

Is this still a thing? I just toyed around with the API and got good results. Would be happy to knock this out if you guys think its valuable.

@anakin87
Copy link
Member

@jlonge4 I think the API improved over time.

I see they now have different endpoints for converting a page into markdown, searching the web and grounding (experimental).
What's your idea?

@jlonge4
Copy link
Contributor

jlonge4 commented Oct 17, 2024

@anakin87 I think it's pretty cool. Do you think the existing LinkContentFetcher/Web Search components have too much overlap in functionality with it?

@anakin87
Copy link
Member

I would say it is just another nice option.

Are you thinking of a single component or more than one?

@jlonge4
Copy link
Contributor

jlonge4 commented Oct 17, 2024

@anakin87 I agree! Would passing modes at init time to a single component make sense?
Like reader = JinaReader(mode="read") or something to designate which endpoint to use.

@anakin87
Copy link
Member

I'm thinking of something like:

@component
class JinaReader():

    def __init__(
        self,
        api_key: Secret = Secret.from_env_var("JINA_API_KEY"),
        mode: Union[Mode, str],
        ...
    ):
    ...

    @component.output_types(document=Document)
    def run(self, input:str):

    # check input depending on mode
    ...

Mode can be an Enum like this (with a convenient from_str method):

@bilgeyucel
Copy link
Contributor Author

@anakin87 @jlonge4, what are the exact features of Jina Reader API now? I'm asking because we use reader components for extractive QA tasks, and I don't think the JinaReader component will fit well into that category. Does it make sense to name it JinaReaderConverter, maybe?

@anakin87
Copy link
Member

  • Convert URL into Markdown
  • Search the web and convert results to Markdown
  • Ground a statement with web knowledge (only paid, haven't tried)

https://jina.ai/reader/

@jlonge4
Copy link
Contributor

jlonge4 commented Oct 17, 2024

@anakin87 looks great, I'll get it cooking asap!
@bilgeyucel you have a great point, it definitely is more of a converter or fetcher vs a reader.

@anakin87
Copy link
Member

@jlonge4 I have added a tasklist to #663 (comment).

Could you maybe help with opening a PR to mention the JinaReaderConnector in https://github.com/deepset-ai/haystack-integrations/blob/main/integrations/jina.md?
(I see the focus is on embedding models, so maybe a brief mention + link to examples is OK)

@jlonge4
Copy link
Contributor

jlonge4 commented Nov 22, 2024

@anakin87 you've got it, no problem 😎

@jlonge4
Copy link
Contributor

jlonge4 commented Nov 23, 2024

@anakin87 deepset-ai/haystack-integrations#288

@anakin87
Copy link
Member

Closing this issue.

(Only social media announcement is missing.)
Added an item for this component to Weekly Announcements - https://github.com/deepset-ai/devrel-board/issues/533

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions wanted! Looking for external contributions feature request Ideas to improve an integration integration:jina
Projects
Development

No branches or pull requests

3 participants