Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add whitespace collapsing and text stripping by default #518

Merged
merged 3 commits into from
Jul 30, 2024

Conversation

roedoejet
Copy link
Member

PR Goal?

Whitespace collapsing wasn't being applied by the wizard, only the chosen cleaners. This changes that.

Fixes?

Feedback sought?

sanity, code check

Priority?

medium

Tests added?

How to test?

Confidence?

medium-high

Version change?

no

Related PRs?

#515 and #516

Copy link

semanticdiff-com bot commented Jul 24, 2024

Review changes with SemanticDiff.

Analyzed 6 of 7 files.

Overall, the semantic diff is 6% smaller than the GitHub diff.

Filename Status
✔️ everyvoice/wizard/dataset.py 2.26% smaller
✔️ everyvoice/utils/__init__.py 0.0% smaller
✔️ everyvoice/tests/test_text.py Analyzed
✔️ everyvoice/tests/test_wizard.py 2.92% smaller
everyvoice/tests/data/unit-test-case1.psv Unsupported file format
✔️ everyvoice/model/e2e/config/__init__.py 12.5% smaller
✔️ everyvoice/config/text_config.py 92.33% smaller

@roedoejet roedoejet requested a review from wiitt July 24, 2024 18:36
@roedoejet roedoejet force-pushed the dev.ap/collapse-whitespace branch from f6a821a to 28bc336 Compare July 24, 2024 18:37
Copy link
Contributor

github-actions bot commented Jul 24, 2024

CLI load time: 0:00.23
Pull Request HEAD: 30fde0c53efb793567ab7b5af7d6f51b1e46c976
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link

codecov bot commented Jul 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.54%. Comparing base (9004aad) to head (73a03ec).

Additional details and impacted files
@@                  Coverage Diff                  @@
##           dev.ap/text-fixes     #518      +/-   ##
=====================================================
+ Coverage              74.50%   74.54%   +0.04%     
=====================================================
  Files                     45       45              
  Lines                   3016     3021       +5     
  Branches                 485      487       +2     
=====================================================
+ Hits                    2247     2252       +5     
  Misses                   676      676              
  Partials                  93       93              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roedoejet roedoejet force-pushed the dev.ap/text-fixes branch from b0305e1 to a751931 Compare July 24, 2024 18:56
@roedoejet roedoejet force-pushed the dev.ap/collapse-whitespace branch from 28bc336 to b840ce8 Compare July 24, 2024 18:58
@roedoejet roedoejet force-pushed the dev.ap/text-fixes branch from a751931 to b83714c Compare July 24, 2024 19:26
Copy link
Collaborator

@wiitt wiitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an expected behaviour that, if no text normalization is applied, the character list remains duplicates of empty symbols?
For example, creating a project from /sgile/data/MohawkCorpus/am_corpus without text normalization results in

moh_characters: ['', '', (, ), '0', '1', '3', '8', a, c, d, e, g, h, i, k, l, m,
    n, o, p, r, s, t, u, v, w, x, y, z, '', à, á, è, é, ì, í, ò, ó]

where you can find three instances of ''.

I also noticed that changes from #515 are highlighted as changes of this PR as well.

@roedoejet roedoejet force-pushed the dev.ap/collapse-whitespace branch from 73a03ec to 114297b Compare July 30, 2024 21:36
@roedoejet
Copy link
Member Author

Is it an expected behaviour that, if no text normalization is applied, the character list remains duplicates of empty symbols? For example, creating a project from /sgile/data/MohawkCorpus/am_corpus without text normalization results in

moh_characters: ['', '', (, ), '0', '1', '3', '8', a, c, d, e, g, h, i, k, l, m,
    n, o, p, r, s, t, u, v, w, x, y, z, '', à, á, è, é, ì, í, ò, ó]

where you can find three instances of ''.

I also noticed that changes from #515 are highlighted as changes of this PR as well.

great find @wiitt - no, this is not intended behaviour. It was an unnecessary if response: condition that was preventing it from applying. I've fixed this and added a test to catch this in the future.

for the e2e config, we don't want to require contact information declarations on each submodel
@roedoejet roedoejet merged commit 9ab79eb into dev.ap/text-fixes Jul 30, 2024
3 checks passed
@roedoejet roedoejet deleted the dev.ap/collapse-whitespace branch July 30, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants