Create a new Python 3.6 Notebook
in Azure Notebooks. Next, create a Speech
API Key in the Azure Portal:
As region, we'll be using West Europe
in this example. You can find your API key under the service, then Keys
.
You can use this file test.wav
for testing.
First, we need to request a token from the Issue Token endpoint
of the Speech API. Each token is valid for 10 minutes, hence we can either reuse it multiple times (to minimize network traffic and latency), or request a new one for each call:
import requests, json
import IPython.display as ipd
api_key = "xxxx" # Enter your API key here
token_url = "https://westeurope.api.cognitive.microsoft.com/sts/v1.0/issuetoken"
headers = {'Ocp-Apim-Subscription-Key': api_key}
response = requests.post(token_url, headers=headers)
token = response.text
print("Token: " + token)
Once we have the token, we can form our request for generating speech:
url = "https://westeurope.tts.speech.microsoft.com/cognitiveservices/v1"
headers = {'Authorization': token,
'Content-Type': 'application/ssml+xml',
'User-Agent': 'Test',
'X-Microsoft-OutputFormat': 'riff-16khz-16bit-mono-pcm'}
data = "<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'> \
<voice name='Microsoft Server Speech Text to Speech Voice (en-US, JessaRUS)'> \
Hello, welcome to the Cognitive Services Bootcamp! \
</voice></speak>"
response = requests.post(url, headers=headers, data=data)
audio_data = response.content
print(response.headers)
We can just write it out to a *.wav
file and then download or play it:
with open("test.wav", "wb") as f:
f.write(audio_data)
ipd.Audio('test.wav')
There are many different voices available to choose from. By updating the XML request, we can easily specify a different voice or language. From here on, it should be easy to generate German speech.
Let's take the generated or provided test.wav
from the example before and convert it back to text. Again, let's first create a token:
import requests, json
api_key = "xxx" # Enter your API key here
token_url = "https://westeurope.api.cognitive.microsoft.com/sts/v1.0/issuetoken"
headers = {'Ocp-Apim-Subscription-Key': api_key}
response = requests.post(token_url, headers=headers)
token = response.text
print("Token: " + token)
Now that we have a token, we can call the speech-to-text endpoint and include the wav
data:
url = "https://westeurope.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {'Authorization': 'Bearer ' + token,
'Accept': 'application/json',
'Ocp-Apim-Subscription-Key': api_key,
'Content-Type': 'audio/wav; codec=audio/pcm; samplerate=16000'}
params = {'language': 'en-US', 'format': 'detailed'}
with open("test.wav", 'rb') as f:
data = f.read()
response = requests.post(url, headers=headers, params=params, data=data)
print(json.dumps(response.json(), indent=2))
For recognizing longer text with multiple sentences, you can follow the following tutorial.
Note:
As of May 2019, also compressed audio is supported (e.g., MP3s), see here,
Besides that, the speech-to-text API expects audio with the following specifics:
- 16-bit WAV format with PCM or OGG format with OPUS
- Single channel (mono) at 8 or 16 KHz
More details, see here.
Now that we've converted the user's speech into text, we can detect the intent of the text in the next challenge!