Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting within a word is not supported by translations #544

Open
lukasrad02 opened this issue Mar 11, 2024 · 4 comments
Open

Formatting within a word is not supported by translations #544

lukasrad02 opened this issue Mar 11, 2024 · 4 comments
Labels

Comments

@lukasrad02
Copy link
Contributor

When using inline formatting that is not surrounded by spaces, e.g. H<strong>e</strong>llo, in a translation, surrounding spaces will be added automatically when the content is converted back to markdown.

Translation editor:
image

Rendered Page:
image

@lukasrad02 lukasrad02 added the [T] bug Something isn't working label Mar 11, 2024
@dropforge
Copy link
Collaborator

Is this due to newlines being added? What is the HTML output?

@lukasrad02
Copy link
Contributor Author

There are no newline added, just spaces.

The HTML passed to html2text (see

return html2text.html2text(restore_strings(template, strings), bodywidth=0)
) is exactly identical to the html entered to the translation editor.

I think (but haven't verified this yet) that html2text parses the whole HTML input into some AST-like structure that does not preserve formatting and uses some generic formatting rules when rewriting it as markdown, thus adding the spaces.

@dropforge
Copy link
Collaborator

Is it viable to switch from html2text to a library that translates the source directly as Markdown? @jeriox Some considerations for that:

  1. We might have to split the source into segments ourselves then, e. g. paragraphs, list item etc.
  2. DeepL intelligently translates link descriptions in HTML, e. g. moving the semantically equivalent parts of a sentence into / out of <a> tags.

@jeriox
Copy link
Contributor

jeriox commented Mar 14, 2024

@dropforge I think it would be feasible, and given how much problems the HTML representation already caused I think it would be a good way forward. Back when we implemented the prototype/MVP it worked good enough, so we decided to go with it as it was quicker. If you are willing to do a deepdive on that I'd highly appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants