IF YOU FIND THIS REPOSITORY HELPFUL, PLEASE CONSIDER STARRING IT.
This microservice provides the same API as OpenAI's API.
The authors claim that Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |
The .en
models for English-only applications tend to perform better, especially for the tiny.en
and base.en
models. We observed that the difference becomes less significant for the small.en
and medium.en
models.
Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2
model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in the paper. The smaller, the better.
https://github.com/openai/whisper
git clone https://github.com/matusstas/openai-whisper-microservice.git
cd openai-whisper-microservice
docker-compose up -d --build
- repository
https://hub.docker.com/r/matusstas/openai-whisper-microservice
- latest
docker pull matusstas/openai-whisper-microservice:latest
- 1.0.0
docker pull matusstas/openai-whisper-microservice:1.0.0
After successfully starting the image, the Swagger documentation will be available at http://localhost:80/docs
The API currently provides 12 endpoints, which are divided into 4 groups according to their use.
GET /models-available
Return a list of all available Whisper ASR models.
[
"tiny.en",
"tiny",
"base.en",
"base",
"small.en",
"small",
"medium.en",
"medium",
"large-v1",
"large-v2",
"large"
]
GET /models-downloading
Return a list of all downloading Whisper ASR models.
[
"tiny.en"
]
GET /models-downloaded
Return a list of all downloaded Whisper ASR models.
[
"tiny.en",
"tiny"
]
POST /model
Download a Whisper ASR model using background task.
model_name
- required
- string (path)
- Name of the model
{
"detail": "Model is being downloaded"
}
{
"detail": "Invalid model"
}
{
"detail": "Model not downloaded yet"
}
{
"detail": "Model already exist"
}
GET /model/{model_name}
Return a Whisper ASR model.
model_name
- required
- string (path)
- Name of the model
JSON RESPONSE IS TOO LONG TO DISPLAY
{
"detail": "Model not downloaded yet"
}
{
"detail": "Model not found"
}
DELETE /model/{model_name}
Delete a downloaded Whisper ASR model.
model_name
- required
- string (path)
- Name of the model
{
"detail": "Model was deleted"
}
{
"detail": "Model not downloaded yet"
}
{
"detail": "Model not found"
}
POST /model/{model_name}/language
Return a sorted list of all detected languages by their score.
model_name
- required
- string (path)
- Name of the model
file
- required
- string ($binary)
- Chosen audiofile
{
"en": 0.38421738147735596,
"cy": 0.2614089846611023,
"zh": 0.10288530588150024,
"nn": 0.04161091521382332,
"ko": 0.03617018833756447,
...
"uz": 9.862478833611021e-9
}
{
"detail": "Model not multilingual"
}
{
"detail": "Model not downloaded yet"
}
{
"detail": "Model not found"
}
POST /model/{model_name}/transcript
Transcribe audio with a Whisper ASR model.
model_name
- required
- string (path)
- Name of the model
-
task
- required
- string
- Task: [
transcribe
,translate
]
-
language_code
- required
- string
- Language code: [
af
,am
,ar
,as
,az
, ...,zh
]
-
media_type
- required
- string
- Media type: [
application/json
,text/plain
]
-
format
- required
- string
- Output format: [
json
,srt
,tsv
,txt
,vtt
]
-
file
- required
- string ($binary)
- Chosen audiofile
{
"text": " I found that nothing in life is worthwhile unless you take risks. Nothing. Nelson Mandela said, there is no passion to be found playing small and settling for a life that's less than the one you're capable of living. Now I'm sure in your experiences in school and applying to college and...",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0,
"end": 7,
"text": " I found that nothing in life is worthwhile unless you take risks.",
"tokens": [
50363,
314,
1043,
326,
2147,
287,
1204,
318,
24769,
4556,
345,
1011,
7476,
13,
50713
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 1,
"seek": 0,
"start": 7,
"end": 9,
"text": " Nothing.",
"tokens": [
50713,
10528,
13,
50813
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 2,
"seek": 0,
"start": 9,
"end": 15,
"text": " Nelson Mandela said, there is no passion to be found playing small",
"tokens": [
50813,
12996,
40233,
531,
11,
612,
318,
645,
7506,
284,
307,
1043,
2712,
1402,
51113
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 3,
"seek": 0,
"start": 15,
"end": 20,
"text": " and settling for a life that's less than the one you're capable of living.",
"tokens": [
51113,
290,
25446,
329,
257,
1204,
326,
338,
1342,
621,
262,
530,
345,
821,
6007,
286,
2877,
13,
51363
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 4,
"seek": 0,
"start": 20,
"end": 24,
"text": " Now I'm sure in your experiences in school and applying to college and...",
"tokens": [
51363,
2735,
314,
1101,
1654,
287,
534,
6461,
287,
1524,
290,
11524,
284,
4152,
290,
986,
51563
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
}
],
"language": "en"
}
{
"detail": "Model not downloaded yet"
}
{
"detail": "Model not found"
}
GET /languages
Return all available languages.
{
"af": "afrikaans",
"am": "amharic",
"ar": "arabic",
"as": "assamese",
"az": "azerbaijani",
...
"zh": "chinese"
}
GET /language/{language_code}
Return an english language name.
"italian"
{
"detail": "Language not found"
}
GET /cuda
Check whether a GPU with CUDA support is available on the current system. Return boolean value.
true
GET /language/{language_code}
Return message from container to check if it is running.
{
"detail": "Whisper API is running"
}
In order not to lose important data, we added volumes. This is ensured by a specific command in the docker-compose.yml
file.
All downloaded Whisper ASR models are stored in the /root/.cache/whisper
folder. These are the models we work with.
Due to the fact that I also provide management of individual models (download or delete), I needed a simple database that would allow me to do this. So I work with TinyDB (document-oriented database written in pure Python with no external dependencies), where the important file models.json
is stored in the db
folder.
For example, the data in the file might looks like this:
{
"_default":{
"1":{
"name":"tiny",
"downloaded":false
},
"2":{
"name":"tiny.en",
"downloaded":true
}
}
}
- downloaded: true = Whisper ASR model was downloaded
- downloaded: false = Whisper ASR model is still downloading
- GPU support (I'm working on it right now)