fix: change shape of filelist list data instead of re-reading it #515

roedoejet · 2024-07-22T17:00:48Z

this allows for any processes that take place on the filelist list data (i.e. normalization) to be preserved

fixes #504

PR Goal?

Ensures that we are outputting a set of symbols without duplicates

Fixes?

#504

Feedback sought?

sanity. code clarity

Priority?

medium

Tests added?

tests updated

How to test?

You could try re-running on your dataset to see how the config changes

Confidence?

medium

Version change?

none. this isn't a breaking change. no version change needed really.

Related PRs?

semanticdiff-com · 2024-07-22T17:00:51Z

Review changes with SemanticDiff.

Analyzed 6 of 8 files.

Overall, the semantic diff is 12% smaller than the GitHub diff.

	Filename	Status
✔️	everyvoice/wizard/dataset.py	21.11% smaller
✔️	everyvoice/utils/__init__.py	0.0% smaller
✔️	everyvoice/tests/test_text.py	Analyzed
✔️	everyvoice/tests/test_wizard.py	2.39% smaller
❔	everyvoice/tests/data/unit-test-case1.psv	Unsupported file format
❔	everyvoice/tests/data/relative/config/everyvoice-shared-text.yaml	Unsupported file format
✔️	everyvoice/model/e2e/config/__init__.py	12.5% smaller
✔️	everyvoice/config/text_config.py	21.31% smaller

codecov · 2024-07-22T17:03:11Z

Codecov Report

Attention: Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.48%. Comparing base (83742ff) to head (9ab79eb).
Report is 1 commits behind head on main.

Files	Patch %	Lines
everyvoice/wizard/dataset.py	83.33%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #515      +/-   ##
==========================================
+ Coverage   74.40%   74.48%   +0.07%     
==========================================
  Files          45       45              
  Lines        3004     3029      +25     
  Branches      479      491      +12     
==========================================
+ Hits         2235     2256      +21     
- Misses        676      679       +3     
- Partials       93       94       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-07-22T17:03:47Z

CLI load time: 0:00.24
Pull Request HEAD: 9ab79eba4a9a99f9bb90c888bdb3b91889a8eb00
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

SamuelLarkin

LGTM

SamuelLarkin · 2024-07-24T15:30:22Z

everyvoice/wizard/dataset.py

@@ -684,6 +685,10 @@ def reload_filelist_data_as_dict(state):
                state.get(StepNames.data_has_header_line_step, "yes") == "yes"
            ),
        )
+        state["filelist_data"] = []
+        for row in state["filelist_data_list"][1:]:


What is dropped in state["filelist_data_list"][0]?
Why are we dropping it?

The first element of state["filelist_data_list"] is the header

yeah forgot about the header. Got it.
I guess I got confused with ilne 679 which gets the headers.

marctessier · 2024-07-25T14:08:39Z

I ran some tests. List of characters is much shorter now without the duplicates.

ex: NEW: ( sdte note I manually added ':' for moh )

` TEST_characters: [' ', ':' , a, e, h, i, k, l, m, n, o, r, s, t, w, y, à, á, è, é,
    ì, í, ò, ó, ě, ό, –, ’]
  TEST_phones: [d, d͡z, d͡ʒ, e, èː, é, éː, ě, f, h, i, ìː, í, íː, j, k,
    kʷ, l, m, n, n̥, o, òː, ó, óː, s, t, t͡s, ũ, ũ̀ː, ṹ, ṹː, w, z, ɑ, ɑ̀ː,
    ɑ́, ɑ́ː, ɡ, ɡʷ, ɽ, ɽ̥, ʌ̃, ʌ̃̀ː, ʌ̃́, ʌ̃́ː, ʔ, ό]`

Previous version:

`TEST_characters: [' ', '!', '"', '''', (, ), ',', ., /, ':', ;, '?', a, á, b, e,
    é, h, i, ì, í, k, m, n, o, ó, p, r, s, t, v, w, x, y, a, à, á, b, c, d, e, è,
    é, g, h, i, ì, í, k, l, m, n, o, ò, ó, p, r, s, t, u, v, w, x, y, ' ', á, è, é,
    ì, í, ò, ó, à, á, è, é, ì, í, ò, ó, ě, ό, –, —, ’, “, ”]
  TEST_phones: [b, c, d, d͡z, d͡ʒ, e, èː, é, éː, ě, f, g, h, i, ìː, í, íː, j, k, kʷ,
    l, m, n, n̥, o, òː, ó, óː, p, s, t, t͡s, t͡ʃ, u, ũ, ũ̀ː, ṹ, ṹː, v, w, x, z, ɑ,
    ɑ̀ː, ɑ́, ɑ́ː, ɡ, ɡʷ, ɽ, ɽ̥, ʃ, ʌ̃, ʌ̃̀ː, ʌ̃́, ʌ̃́ː, ʔ, ό]`

Training worked with no issues.

wiitt

Everything looks good to me except the pre-commit part in git workflows which I don't feel confident about.

everyvoice/wizard/dataset.py

.github/workflows/pre-commit.yml

this allows for any processes that take place on the filelist list data (i.e. normalization) to be preserved fixes #504

fixes #484

validator ensures that symbols are not defined as both punctuation and other symbols fixes #450

for the e2e config, we don't want to require contact information declarations on each submodel

roedoejet requested review from SamuelLarkin and wiitt July 22, 2024 17:01

roedoejet requested a review from MENGZHEGENG July 22, 2024 17:59

SamuelLarkin approved these changes Jul 24, 2024

View reviewed changes

roedoejet mentioned this pull request Jul 24, 2024

fix: add whitespace collapsing and text stripping by default #518

Merged

roedoejet force-pushed the dev.ap/text-fixes branch 2 times, most recently from a751931 to b83714c Compare July 24, 2024 19:26

roedoejet force-pushed the main branch from 618eb3d to 83742ff Compare July 24, 2024 19:31

wiitt reviewed Jul 26, 2024

View reviewed changes

everyvoice/wizard/dataset.py Show resolved Hide resolved

everyvoice/wizard/dataset.py Outdated Show resolved Hide resolved

.github/workflows/pre-commit.yml Outdated Show resolved Hide resolved

roedoejet added 4 commits July 30, 2024 14:25

fix: change shape of filelist list data instead of re-reading it

87ceba1

this allows for any processes that take place on the filelist list data (i.e. normalization) to be preserved fixes #504

fix: remove all punctuation characters from symbol set by default

1c81f94

fixes #484

feat: add model validator for text

7f96d07

validator ensures that symbols are not defined as both punctuation and other symbols fixes #450

fix: remove unnecessary loading of filelist

9004aad

roedoejet force-pushed the dev.ap/text-fixes branch from b83714c to 9004aad Compare July 30, 2024 21:25

roedoejet added 3 commits July 30, 2024 15:39

fix: add whitespace collapsing and text stripping by default

c6a6da8

fix: check if data is tabular or not before applying text processing

7c8e533

fix(ci): ignore type errors from e2e config

9ab79eb

for the e2e config, we don't want to require contact information declarations on each submodel

roedoejet merged commit 8fc4099 into main Jul 30, 2024
7 checks passed

roedoejet deleted the dev.ap/text-fixes branch July 30, 2024 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: change shape of filelist list data instead of re-reading it #515

fix: change shape of filelist list data instead of re-reading it #515

roedoejet commented Jul 22, 2024

semanticdiff-com bot commented Jul 22, 2024 •

edited

Loading

codecov bot commented Jul 22, 2024 •

edited

Loading

github-actions bot commented Jul 22, 2024 •

edited

Loading

SamuelLarkin left a comment

SamuelLarkin Jul 24, 2024

wiitt Jul 24, 2024

SamuelLarkin Jul 29, 2024 •

edited

Loading

marctessier commented Jul 25, 2024 •

edited

Loading

wiitt left a comment

fix: change shape of filelist list data instead of re-reading it #515

fix: change shape of filelist list data instead of re-reading it #515

Conversation

roedoejet commented Jul 22, 2024

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

Related PRs?

semanticdiff-com bot commented Jul 22, 2024 • edited Loading

codecov bot commented Jul 22, 2024 • edited Loading

Codecov Report

github-actions bot commented Jul 22, 2024 • edited Loading

SamuelLarkin left a comment

Choose a reason for hiding this comment

SamuelLarkin Jul 24, 2024

Choose a reason for hiding this comment

wiitt Jul 24, 2024

Choose a reason for hiding this comment

SamuelLarkin Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

marctessier commented Jul 25, 2024 • edited Loading

wiitt left a comment

Choose a reason for hiding this comment

semanticdiff-com bot commented Jul 22, 2024 •

edited

Loading

codecov bot commented Jul 22, 2024 •

edited

Loading

github-actions bot commented Jul 22, 2024 •

edited

Loading

SamuelLarkin Jul 29, 2024 •

edited

Loading

marctessier commented Jul 25, 2024 •

edited

Loading