In this lab session you will practice styling TTS output using Speech Synthesis Markup Language (SSML) and tuning ASR. It is assumed that you have read the relevant literature on the subject before attempting to solve the assignments.
For reference:
- Speech Synthesis Markup Language (SSML) Version 1.0, W3C Recommendation 7 September 2004, http://www.w3.org/TR/speech-synthesis/
- Azure Text-to-Speech: Docs, Speech Studio (audio content creation)
- Azure Speech-to-Text: Docs, Speech Studio (custom speech)
- Play with Automatic Speech Recognition:
- Can you think of any names for fictional places, people or objects that are not recognized? (Keep your final project in mind!)
- If not, can you try any scientific names for plants, animals, geologic terms, etc., or names for classic musical pieces and authors?
- Did you come across any real locations or people that are also just not picked up?
- Any specific accent you are using that makes words difficult to process?
- While you do this, take a look at the confidence score with the help of XState’s Visualizer (or you can log it). How good is it?
- Think about how this problem could be solved. Why do you think recognition falters for the examples that you tried?
- To solve the problem you will use Custom Speech:
- You will basically have to provide data, either plain text or audio files, to help the recognition process.
- Train and deploy your model (enable content logging). Note the Endpoint ID.
- To test your model:
- Upgrade SpeechState:
yarn up speechstate
(you will see the version 2.0.0-beta.5 or higher). - Create a file
dm3.js
which implements a very basic ASR test (analogous todm.js
in this repository). Add the following to yoursettings
object:speechRecognitionEndpointId: "paste your Endpoint ID here",
- Now you can test your new ASR model! You will be able to download the log files for your model in Custom Speech interface.
- Upgrade SpeechState:
- Write a report:
- Write a report (max 1 pages) describing which new words are now supported and can be tested. Report should contain your Endpoint ID.
- Add your report to the repository in PDF format
(
report-lab3.pdf
).
A poetry slam is a competition at which poets read or recite original work (or, more rarely, that of others). These performances are then judged on a numeric scale by previously selected members of the audience. (Wikipedia)
Your task in this assignment is to use SSML in Azure Audio Content Creation in order to get an artificial poet to recite the your favourite poem (just a couple of verses) with a speed and in “a style” similar to the way how it is read by an actor (or by a poet her/himself).
You can refer to some poetry performance found on YouTube or elsewhere.
Sources for inspiration:
- California Dreaming (386DX art project).
- Without Me, which was made by Robert Rhys Thomas in 2019 for this course.
- Bad Guy, which was made by Fang Yuan in 2020 for this course.
In your submission provide:
- report for Part A
- text file with your SSML code (
Code/lab3.txt
); in the beginning of the file include the reference to the original performance. - audio file for Part B (
Code/lab3.mp3
)
These files can be placed in your Github repository.
- Commit your changes and push them to your repository (your fork of this repository)
- We will see the changes in the same pull request that you used for Lab 2.
- On Canvas, submit the pull request URL.