Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove linebreaks within paragraphs on export to ODT / Apply font size in HTML #666

Open
Moini opened this issue Feb 6, 2024 · 3 comments

Comments

@Moini
Copy link

Moini commented Feb 6, 2024

Hi 👋

I've been wanting to make a PDF zoomable for my e-ink ereader, and I've found that in order to do that, I need text that wraps automatically (else I have to perpetually move the page around to read the text).

I would like to retain font sizes, as export to ODT does (but to HTML does not, for some odd reason, even though the data is in it...?), so I can differentiate titles from paragraphs. With HTML, reflowing works...

For the reflowing to work in ODT, paragraphs (that are recognized) may not contain any hard line breaks.

Could you please add an option to remove those from recognized paragraphs?

And add an option to insert/keep hard linebreaks within a paragraph when the length of the line is less than x percent of the paragraph width? Those are usually lines where it makes sense to have that hard break.

And / or add an option to apply recognized styling to text in HTML? It's frustrating to have that sit in the title attribute, but not being used... Or am I misunderstanding something?

@manisandro
Copy link
Owner

Something like the strip line break functionality of the plain text mode should be doable.

Regarding applying styling to the HTML: as far as I know this is how hOCR HTML files are structured, but do feel free to research the format further.

@Moini
Copy link
Author

Moini commented Feb 7, 2024

@manisandro Thanks, I see, it's a special kind of XHTML, and not supposed to be used in a browser, but for overlay PDFs with image / text layer. I thought it meant HTML in the save dialog. What 'hOCR' in the dropdown meant wasn't clear to me, but it provided recognition of font sizes and paragraphs, according to the available settings, and that was what I had been looking for.

Being able to strip line breaks would help a lot!

@lukruh
Copy link

lukruh commented Feb 12, 2024

Right now I bound a tiny script to a hotkey for removing single line breaks from text in clipboard. I upvote for a solution in this nice tool (at least for plain text). Could be as simple as replace "-\n" with "" and than "\n" with " ". Maybe double "\n\n" can be avoided using some regex?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants