-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rework text processors #793
rework text processors #793
Conversation
✔️ No visual differences introduced by this PR. View Playwright Report (note: open the "playwright-report" artifact) |
Dumb question, but what is a source map? |
An object for keeping track of which input character was replaced by more or less than 1 character during preprocessing (invented by nutbread it seems, not a common term). This helps keep track of the length of the original scanned text so the right number of characters can be highlighted on scan. The current code preprocesses the text, then deinflects its substrings starting from the longest. This PR includes a loop interchange, so now we substring first, then preprocess (necessary for Hangul). This makes keeping track of the original text's length simpler and the TextSourceMap redundant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngl this PR feels a bit out of my league. Asking some questions for now for my own edification and I'll take another look tomorrow
@@ -2548,8 +2548,8 @@ | |||
"audio": "", | |||
"clipboard-image": "", | |||
"clipboard-text": "", | |||
"cloze-body": "打", | |||
"cloze-body-kana": "だ", | |||
"cloze-body": "打(う)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test inputs include two cases with some fancy text replacements for parentheses.
https://github.com/themoeway/yomitan/blob/69b57e5efdc8a25025606569f8f20ea508a5263d/test/data/translator-test-inputs.json#L180-L225
This is a small change to the behavior of the {cloze-body}
handlebars in this case.
I think no one will be affected by this as the set of users using these handlebars markers, and using the obscure text replacements at all, and using them in this special way is probably empty.
for (const [key, value] of textPreprocessorOptionsSpace) { | ||
variantSpace.set(key, value); | ||
} | ||
const preprocessorVariantSpace = new Map(preprocessorOptionsSpace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For. my own edification: what is variant space here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each processor has an array of its possible options, (e.g. [false, true]
, or ['off', 'direct', 'inverse']
or [[false, false], [true, false], [true, true]]
. Since we need to run all combinations of the processors, this is like each processor's options being a dimension of a matrix, or an axis of a coordinate system/vector space, and we are getting all of the combinations by traversing all of the cells in the matrix / points in the space.
Fixes FooSoft#775. Note that this behavior gets overridden if backspace is set as a shortcut action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to understand the high level logic and it looks fine to me. Just a few questions for my understanding
const preprocessorVariantSpace = new Map(preprocessorOptionsSpace); | ||
preprocessorVariantSpace.set('textReplacements', this._getTextReplacementsVariants(options)); | ||
const preprocessorVariants = this._getArrayVariants(preprocessorVariantSpace); | ||
const postprocessorVariants = this._getArrayVariants(postprocessorOptionsSpace); | ||
|
||
/** @type {import('translation-internal').DatabaseDeinflection[]} */ | ||
const deinflections = []; | ||
const used = new Set(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does used
here mean semantically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a set of strings the deinflector has been run on - no need to run it again on those words since we would already have them in the deinflections array. It's to improve performance.
_applyTextProcessors(textProcessors, processorVariant, text, textCache) { | ||
for (const {id, textProcessor: {process}} of textProcessors) { | ||
const setting = processorVariant.get(id); | ||
let level1 = textCache.get(text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This textCache logic seems to be what sourceMap was doing originally. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sourceMap
had a functional purpose, it was necessary for the algorithm. The textCache
is there just to improve performance by minimizing calls to textProcessor.process.
Oh and please resolve conflicts |
Part of #787, necessary changes to support korean.
Text preprocessing happens before deinflection, text postprocessing after deinflection is added here. This is primarily to support jamo de/recomposition for hangul, but could have other uses.
Changes the order of operations in translator so that subtrings of the looked up text are preprocessed and then deinflected (instead of deinflecting substrings of the preprocessed text). As a consequence, the (complicated)
TextSourceMap
is no longer used. This too is primarily motivated by hangul, but also fixes an issue wheretextProcessors
which change the number of characters have been causing the wrong text to be selected on scan (this affected non-japanese languages). There is a slight change to behavior on some textReplacements.