Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding Markdown structure in Jina Reader API #152

Open
medmabcf opened this issue Nov 6, 2024 · 1 comment
Open

Question regarding Markdown structure in Jina Reader API #152

medmabcf opened this issue Nov 6, 2024 · 1 comment

Comments

@medmabcf
Copy link

medmabcf commented Nov 6, 2024

Hi,

I’m trying to understand the specific markdown structure used by the Jina Reader API when converting HTML to markdown. For instance, I’ve observed the following mappings:

  • <h1> tags are mapped to ==========
  • <h2> tags are mapped to ------

Is this the standard markdown structure followed by the Jina Reader API? Additionally, I’ve noticed that the output can sometimes vary. Is this due to the use of a heuristic method or some other factor?

Thanks!

@nomagick
Copy link
Member

We are using turndown for HTML to Markdown transformation. Whether h1/h2 gets transformed into ## or ==/-- can be configured with turndown, but we have not customized this option and followed the default.

The default output sometimes changes because Reader automatically switches the use of readability for some level of smart trimming.
If readability would apparently not work for the page we fall back to a rule-based approach known as markdown.

If you find the markdown format preferable, you can specify x-respond-with: markdown or x-return-format: markdown to stabilize the return format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants