Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: change shape of filelist list data instead of re-reading it #515

Merged
merged 7 commits into from
Jul 30, 2024

Conversation

roedoejet
Copy link
Member

this allows for any processes that take place on the filelist list data (i.e. normalization) to be preserved

fixes #504

PR Goal?

Ensures that we are outputting a set of symbols without duplicates

Fixes?

#504

Feedback sought?

sanity. code clarity

Priority?

medium

Tests added?

tests updated

How to test?

You could try re-running on your dataset to see how the config changes

Confidence?

medium

Version change?

none. this isn't a breaking change. no version change needed really.

Related PRs?

Copy link

semanticdiff-com bot commented Jul 22, 2024

Review changes with SemanticDiff.

Analyzed 6 of 8 files.

Overall, the semantic diff is 12% smaller than the GitHub diff.

Filename Status
✔️ everyvoice/wizard/dataset.py 21.11% smaller
✔️ everyvoice/utils/__init__.py 0.0% smaller
✔️ everyvoice/tests/test_text.py Analyzed
✔️ everyvoice/tests/test_wizard.py 2.39% smaller
everyvoice/tests/data/unit-test-case1.psv Unsupported file format
everyvoice/tests/data/relative/config/everyvoice-shared-text.yaml Unsupported file format
✔️ everyvoice/model/e2e/config/__init__.py 12.5% smaller
✔️ everyvoice/config/text_config.py 21.31% smaller

@roedoejet roedoejet requested review from SamuelLarkin and wiitt July 22, 2024 17:01
Copy link

codecov bot commented Jul 22, 2024

Codecov Report

Attention: Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.48%. Comparing base (83742ff) to head (9ab79eb).
Report is 1 commits behind head on main.

Files Patch % Lines
everyvoice/wizard/dataset.py 83.33% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #515      +/-   ##
==========================================
+ Coverage   74.40%   74.48%   +0.07%     
==========================================
  Files          45       45              
  Lines        3004     3029      +25     
  Branches      479      491      +12     
==========================================
+ Hits         2235     2256      +21     
- Misses        676      679       +3     
- Partials       93       94       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Jul 22, 2024

CLI load time: 0:00.24
Pull Request HEAD: 9ab79eba4a9a99f9bb90c888bdb3b91889a8eb00
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

@roedoejet roedoejet requested a review from MENGZHEGENG July 22, 2024 17:59
Copy link
Collaborator

@SamuelLarkin SamuelLarkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -684,6 +685,10 @@ def reload_filelist_data_as_dict(state):
state.get(StepNames.data_has_header_line_step, "yes") == "yes"
),
)
state["filelist_data"] = []
for row in state["filelist_data_list"][1:]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is dropped in state["filelist_data_list"][0]?
Why are we dropping it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first element of state["filelist_data_list"] is the header

Copy link
Collaborator

@SamuelLarkin SamuelLarkin Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah forgot about the header. Got it.
I guess I got confused with ilne 679 which gets the headers.

@marctessier
Copy link
Collaborator

marctessier commented Jul 25, 2024

I ran some tests. List of characters is much shorter now without the duplicates.

ex: NEW: ( sdte note I manually added ':' for moh )

` TEST_characters: [' ', ':' , a, e, h, i, k, l, m, n, o, r, s, t, w, y, à, á, è, é,
    ì, í, ò, ó, ě, ό, –, ’]
  TEST_phones: [d, d͡z, d͡ʒ, e, èː, é, éː, ě, f, h, i, ìː, í, íː, j, k,
    kʷ, l, m, n, n̥, o, òː, ó, óː, s, t, t͡s, ũ, ũ̀ː, ṹ, ṹː, w, z, ɑ, ɑ̀ː,
    ɑ́, ɑ́ː, ɡ, ɡʷ, ɽ, ɽ̥, ʌ̃, ʌ̃̀ː, ʌ̃́, ʌ̃́ː, ʔ, ό]`

Previous version:

`TEST_characters: [' ', '!', '"', '''', (, ), ',', ., /, ':', ;, '?', a, á, b, e,
    é, h, i, ì, í, k, m, n, o, ó, p, r, s, t, v, w, x, y, a, à, á, b, c, d, e, è,
    é, g, h, i, ì, í, k, l, m, n, o, ò, ó, p, r, s, t, u, v, w, x, y, ' ', á, è, é,
    ì, í, ò, ó, à, á, è, é, ì, í, ò, ó, ě, ό, –, —, ’, “, ”]
  TEST_phones: [b, c, d, d͡z, d͡ʒ, e, èː, é, éː, ě, f, g, h, i, ìː, í, íː, j, k, kʷ,
    l, m, n, n̥, o, òː, ó, óː, p, s, t, t͡s, t͡ʃ, u, ũ, ũ̀ː, ṹ, ṹː, v, w, x, z, ɑ,
    ɑ̀ː, ɑ́, ɑ́ː, ɡ, ɡʷ, ɽ, ɽ̥, ʃ, ʌ̃, ʌ̃̀ː, ʌ̃́, ʌ̃́ː, ʔ, ό]` 

Training worked with no issues.

Copy link
Collaborator

@wiitt wiitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me except the pre-commit part in git workflows which I don't feel confident about.

this allows for any processes that take place on the filelist list data (i.e. normalization) to be preserved

fixes #504
validator ensures that symbols are not defined as both punctuation and other symbols

fixes #450
@roedoejet roedoejet force-pushed the dev.ap/text-fixes branch from b83714c to 9004aad Compare July 30, 2024 21:25
@roedoejet roedoejet merged commit 8fc4099 into main Jul 30, 2024
7 checks passed
@roedoejet roedoejet deleted the dev.ap/text-fixes branch July 30, 2024 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Are we creating the set for characters in the wizard prior to normalization!?
4 participants