-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add whitespace collapsing and text stripping by default #518
Conversation
Review changes with SemanticDiff. Analyzed 6 of 7 files. Overall, the semantic diff is 6% smaller than the GitHub diff.
|
f6a821a
to
28bc336
Compare
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev.ap/text-fixes #518 +/- ##
=====================================================
+ Coverage 74.50% 74.54% +0.04%
=====================================================
Files 45 45
Lines 3016 3021 +5
Branches 485 487 +2
=====================================================
+ Hits 2247 2252 +5
Misses 676 676
Partials 93 93 ☔ View full report in Codecov by Sentry. |
b0305e1
to
a751931
Compare
28bc336
to
b840ce8
Compare
a751931
to
b83714c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an expected behaviour that, if no text normalization is applied, the character list remains duplicates of empty symbols?
For example, creating a project from /sgile/data/MohawkCorpus/am_corpus
without text normalization results in
moh_characters: ['', '', (, ), '0', '1', '3', '8', a, c, d, e, g, h, i, k, l, m,
n, o, p, r, s, t, u, v, w, x, y, z, '', à, á, è, é, ì, í, ò, ó]
where you can find three instances of ''.
I also noticed that changes from #515 are highlighted as changes of this PR as well.
b83714c
to
9004aad
Compare
b840ce8
to
73a03ec
Compare
73a03ec
to
114297b
Compare
great find @wiitt - no, this is not intended behaviour. It was an unnecessary |
for the e2e config, we don't want to require contact information declarations on each submodel
PR Goal?
Whitespace collapsing wasn't being applied by the wizard, only the chosen cleaners. This changes that.
Fixes?
Feedback sought?
sanity, code check
Priority?
medium
Tests added?
✅
How to test?
Confidence?
medium-high
Version change?
no
Related PRs?
#515 and #516