-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test #8
Conversation
So I guess because I took some of the unittests with me, coverage decreased 4.7% and that's showing an unsuccessful check. I think this is fine, once we merge this, we can coordinate our unit testing and get our coverage up. |
Maybe it's time to scrap LexiconG2P and integrate a better third-party English g2p. I've already run into a case where the English name "Skyler" didn't appear in the lexicon. Or we can just treat every lexical item in there as a find/replace rule, sort them by length like usual, and add a few letter-level rules to sweep up the rest. (E.g. so "SKY" is the longest thing caught there, and then rules for "L" and "ER" pick up the rest.) Aside from this specific question, there's a bigger question of whether the g2p library should remain a single-paradigm library (only find/replace rules) or whether it could develop into a multi-paradigm library (FSTs, HMMs, NNs, etc.). I lean towards the former, I prefer the idea that any g2p mapping could be opened and manipulated in "G2P Studio", and that they all keep a straightforward execution paradigm so that people can understand what they're looking at. Either way, we should keep a loose coupling between the alignment and g2p libraries so that people can integrate existing g2p solutions without refactoring the alignment library, so that they can handle languages like English, Chinese, Japanese, etc. with more sophisticated solutions. |
I'm happy either way. I don't know any good, lightweight English g2p libraries off-hand. Any suggestions? We could also use some combination of the LexiconG2P and find/replace rules as a fallback for situations like Skyler. I'll take your lead on this!
Yes, I think this was part of my hesitation around putting in the LexiconG2P.
Agreed
Also agreed. I think I should go back in and make this coupling a little looser for exactly that reason. Thanks! |
Hmm well we could use the g2p in Flite, I suppose, though I think it
implies downloading an entire en-us voice...
Le ven. 20 sept. 2019 à 12:40, Aidan Pine <[email protected]> a
écrit :
… Maybe it's time to scrap LexiconG2P and integrate a better third-party
English g2p. I've already run into a case where the English name "Skyler"
didn't appear in the lexicon. Or we can just treat every lexical item in
there as a find/replace rule, sort them by length like usual, and add a few
letter-level rules to sweep up the rest. (E.g. so "SKY" is the longest
thing caught there, and then rules for "L" and "ER" pick up the rest.)
I'm happy either way. I don't know any good, lightweight English g2p
libraries off-hand. Any suggestions? We could also use some combination of
the LexiconG2P and find/replace rules as a fallback for situations like
Skyler. I'll take your lead on this!
Aside from this specific question, there's a bigger question of whether
the g2p library should remain a single-paradigm library (only find/replace
rules) or whether it could develop into a multi-paradigm library (FSTs,
HMMs, NNs, etc.).
Yes, I think this was part of my hesitation around putting in the
LexiconG2P.
I lean towards the former, I prefer the idea that any g2p mapping could be
opened and manipulated in "G2P Studio", and that they all keep a
straightforward execution paradigm so that people can understand what
they're looking at.
Agreed
Either way, we should keep a loose coupling between the alignment and g2p
libraries so that people *can* integrate existing g2p solutions without
refactoring the alignment library, so that they can handle languages like
English, Chinese, Japanese, etc. with more sophisticated solutions.
Also agreed. I think I should go back in and make this coupling a little
looser for exactly that reason. Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/dhdaines/ReadAlong-Studio/pull/8?email_source=notifications&email_token=AAZLYUFGIBRXIYXXZ2WRTYLQKT4JDA5CNFSM4IYPMITKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7HIBEA#issuecomment-533627024>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZLYUGOR65PUQBVGSYB56LQKT4JDANCNFSM4IYPMITA>
.
|
I just tried this (
|
I lean towards anything we can pip install. And we already use numpy, at least, for the lang_id functionality. (I'm also planning on updating the "unidecode fallback" g2p to something more sophisticated, and I'll need numpy for that as well.) Even so, the Flite install raises another good question, that eventually people will probably want to use G2P solutions that are more heavyweight. We should probably define a REST-y API so that one can serve compatible g2p systems inside containers, rather than end up with a monster package that installs everything, or a forked codebase. Just "If you want to bring your own G2P, spin up a web service that accepts the following requests." (Even for this g2p library. I can imagine scenarios where someone doesn't want their language-specific rules inside the library, but is fine with other people accessing them as a web service.) |
Right, it's not in our requirements.txt, so this is a good reminder to add it.
This should be easy enough. We could expose a similar API through the G2P studio and document it with Swagger, then people could use the G2P swagger spec to bootstrap their API in whatever language makes sense for their project. |
So, for action items:
|
Make e2e test suite work with ap test file (coverage=41%)
Hey all,
So this is hopefully the last big exorcism of the g2p functionality from ReadAlong-Studio. The only thing left to go really is the LexiconG2P, and I'm sort of only half-certain I want to bring it over.
We've been documenting modules well with a standard that puts two shebangs at the top:
The docstring is then wrapped with pound signs like:
I think this is a good standard to keep. All of our method and function docstrings were formatted slightly differently, so I thought we could pick a standard? I put a lot of docstrings in with the numpy docstring standard format, which I think is a nice balance of readability and features (we can autodoc in Sphinx with them). If anybody has strong feelings, we can change this, but just let me know! You can look at some of the commits here for examples or check it out here. There's a nice extension for vscode here, but you have to change the setting (Preferences > Settings > Extensions) to use numpy docstrings (the default is docBlockr)
By gutting the G2P, you'll notice all the lang files are gone. I'd like to try and get us to enter the lang files into G2P first. Languages get added as folders by ISO code here: https://github.com/roedoejet/g2p/tree/master/g2p/mappings/langs. Then, lookup tables can be added either as
xlsx
,csv
, orjson
files. Configurations for the lookup tables are specific in yaml like so:I copied over all of the lang files from ReadAlong-Studio and put them in 'as-is', but you can continue to add more and the G2P module now also has a built-in IPA mapping method in its CLI.
So, let's say you just have a map from
atj
toatj-ipa
, but no automatic mapping fromatj-ipa
toeng-ipa
yet. Afterpip install g2p
you can just typeg2p generate-mapping atj --ipa
and it will put the generated lookup table and its corresponding config file ing2p/mappings/langs/generated
. Currently you have to then update the cached version of the mappings manually by typingg2p update
, but this will likely be rolled into theg2p generate-mapping
command.Hopefully this is clear, let me know if any of you have any questions, and I'll add @joanise as a reviewer once @dhdaines adds him to the repo.