Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper switches and voice for TTS of technical documents? #173

Open
pvonmoradi opened this issue Oct 13, 2022 · 0 comments
Open

Proper switches and voice for TTS of technical documents? #173

pvonmoradi opened this issue Oct 13, 2022 · 0 comments

Comments

@pvonmoradi
Copy link

pvonmoradi commented Oct 13, 2022

What's the best voice and switch combination for synthesis of technical texts (like manuals, books, papers)?
@jnordberg sorry for pinging, I think you have a good idea about this...

For example, consider the following text from Wikipedia:

Connection to artificial intelligence

Since inception, Lisp was closely connected with the artificial intelligence research community, especially on PDP-10 systems. Lisp was used as the implementation of the language Micro Planner, which was used in the famous AI system SHRDLU. In the 1970s, as AI research spawned commercial offshoots, the performance of existing Lisp systems became a growing issue, as programmers needed to be familiar with the performance ramifications of the various techniques and choices involved in the implementation of Lisp.
Genealogy and variants

Over its sixty-year history, Lisp has spawned many variations on the core theme of an S-expression language. Moreover, each given dialect may have several implementations—for instance, there are more than a dozen implementations of Common Lisp.

Differences between dialects may be quite visible—for instance, Common Lisp uses the keyword defun to name a function, but Scheme uses define.[20] Within a dialect that is standardized, however, conforming implementations support the same core language, but with different extensions and libraries.

Suppose we remove [x] and other html artifacts.

How long should a segment be? Should a segment consist of only one paragraph so the system can detect the tone and cadence of the words based on previous sentences?

Is it possible to properly synthesize abbreviates like PDP-11 or AI?

I'm currently using this pipeline in a Unix environment:

# cat source.html | select-paragraphs-by-css | convert-to-plain-text | remove-symbols | NLP-to-sentences
cat source.html | pup 'p' | pandoc --from html --to plain  | tr
 -d '⁰¹²³⁴⁵⁶⁷⁸⁹' | sentences

sentences is a NLP tool that finds sentences on STDIN and outputs one sentence on each line. Then I use this text file as input to the tortoise-tts.py script, setting only voice (lj), --disable-redaction and ultra_fast and fast as presets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant