The transcription service is the API for requesting transcriptions.
The service allows you to:
- Request asynchronous transcriptions from a variety of audio or video files formats.
- Specify transcription subtask such as diarization and punctuation.
- Follow transcription task state and progress.
- Automatically store transcription results in a database (so that the same audio file does not have to be transcribed multiple times).
- Fetch transcription results with different formats and options.
To use the transcription service you must have at least:
- One or multiple instances of linto-stt running and configured with the same
SERVICE_NAME
(LANGUAGE
must be compatible). - A REDIS broker running at
SERVICES_BROKER
. - A mongo DB running at
MONGO_HOST:MONGO_PORT
.
Optionnaly, for diarization or punctuation the following are needed:
- One or multiple instances of linto-diarization-worker > 1.2.0 for speaker diarization configured on the same service broker (
LANGUAGE
must be compatible). - One or multiple instances of linto-punctuation-worker > 1.2.0 for text punctuation configured on the same service broker (
LANGUAGE
must be compatible).
To share audio files across the different services they must be configured with the same shared volume RESSOURCE_FOLDER
.
1- First build the image:
cd linto-transcription-service &&
docker build . -t transcription_service
2- Create and fill the .env
cp .envdefault .env
Fill the .env with the value described below Environment Variables
2- Launch a container:
docker run --rm -it -p $SERVING_PORT:80 \
-v $YOUR_SHARED_FOLDER:/opt/audio \
--env-file .env \
--name my_transcription_api \
transcription_service \
/bin/bash
Fill SERVING_PORT
, YOUR_SHARED_FOLDER
with your values.
1- Create and fill the .env
cp .envdefault .env
Fill the .env with the value described below Environment Variables
2- Compose
docker-compose up .
Env variable | Description | Example |
---|---|---|
LANGUAGE |
Language code (BCP-47 code) used for text normalization (digits to words, punctuation normalization, ...) | fr-FR |
KEEP_AUDIO |
Either audio files are kept after request | 1 (true) | 0 (false) |
CONCURRENCY |
Number of workers (default 10) | 10 |
SERVICE_NAME |
STT service name, use to connect to the proper redis channel and mongo collection | my_stt_service |
SERVICES_BROKER |
Message broker address | redis://broker_address:6379 |
BROKER_PASS |
Broker Password | Password |
MONGO_HOST |
MongoDB results url | my-mongo-service |
MONGO_PORT |
MongoDB results port | 27017 |
RESOLVE_POLICY |
Subservice resolve policy (default ANY) * | ANY | DEFAULT | STRICT |
<SERVICE_TYPE >_DEFAULT |
Default serviceName for subtask <SERVICE_TYPE > * |
punctuation-1 |
*: See Subservice resolution
The transcription service offers a transcription API REST to submit transcription requests.
The transcription service revolves arround 2 concepts:
- Asynchronous jobs identified with job_id: A job_id represents an ongoing transcription task.
- Transcription results identified by result_id.
Typical transcription process follows this steps:
- Submit your file and the transcription configuration on
/transcribe
. The route returns a 201 with the job_id - Use the
/job/{job_id}
route to follow the job's progress. When the job is finished, you'll be greated with a 201 alongside a result_id. - Fetch the transcription result using the
/results/{result_id}
route specifying your desired format and options.
The list-services GET route fetch available sub-services for transcription.
It returns a json object containing list of deployed services indexed by service type. Services listed are filtered using the set LANGUAGE parameters.
{
"diarization": [ # Service type
{
"service_name": "diarization-1", # Service name. Used as parameter in transcription config to call this specific service.
"service_type": "diarization", # Service type
"service_language": "*", # Supported language
"queue_name": "diarization-queue", # Celery queue used by this service
"info": "A diarization service", # Information about the service.
"instances": [ # Instances of this specific service.
{
"host_name": "feb42aacd8ad", # Instance unique id
"last_alive": 1665996709, # Last heartbeat
"version": "1.2.0", # Service version
"concurrency": 1 # Concurrency of the instance
}
]
}
],
"punctuation": [
{
"service_name": "punctuation-1",
"service_type": "punctuation",
"service_language": "fr-FR",
"queue_name": "punctuation-queue",
"info": "A punctuation service",
"instances": [
{
"host_name": "b0e9e24349a9",
"last_alive": 1665996709,
"version": "1.2.0",
"concurrency": 1
}
]
}
]
}
Subservice resolution is the mecanism allowing the transcription service to use the proper optionnal subservice such as diarization or punctuation prediction. Resolution is applied when no serviceName is passed along subtask configs.
There is 3 policies to resolve service names:
- ANY: Use any compatible subservice.
- DEFAULT: Use the service default subservice (must be declared)
- STRICT: If the service is not specified, raise an error.
Resolve policy is declared at launch using the RESOLVE_POLICY environment variable: ANY | DEFAULT | STRICT (default ANY).
Default service names must be declared at launch: <SERVICE_TYPE>_DEFAULT. E.g. The default punctuation subservice is "punctuation-1", PUNCTUATION_DEFAULT=punctuation1
.
Language compatibily
A subservice is compatible if its language(s) is(are) compatible with the transcription-service language.
The /transcribe route allows POST request containing an audio file.
The route accepts multipart/form-data requests.
Response format can be application/json or text/plain as specified in the accept field of the header.
Form Parameter | Description | Required |
---|---|---|
transcriptionConfig | (object optionnal) A transcription configuration describing transcription parameters, in JSON format | See Transcription configuration |
force_sync | (optional boolean, default=false) If True do a synchronous request | true | false | null |
If the request is accepted, answer should be 201
with a json or text response containing the jobid.
With accept: application/json
{"jobid" : "the-job-id"}
With accept: text/plain
the-job-id
If the force_sync flag is set to true, the request returns a 200
with the transcription (see Transcription Results) using the same accept options as the /result/{result_id} route.
The use of force_sync for big files is not recommended as it blocks a worker for the duration of the transcription.
Additionnaly a timestamps file can be uploaded alongside the audio file containing segments timestamps to transcribe. Timestamps file are text file containing a segment per line with optionnal speakerid such as:
# start stop [speakerid]
0.0 7.05 1
7.05 13.0
The transcription config describes the transcription input parameters and flags of the request. It permits to set:
- Target language for the transcript,
- Voice Activity Detection (VAD) parameters,
- Diarization parameters,
- Punctuation parameters.
It is structured as follows:
{
"language": "fr-FR", # Target language for the transcript (default: null).
"vadConfig": {
"enableVad": true, # Enables Voice Activity Detection (default: true).
"methodName": "WebRTC", # VAD method (default: WebRTC).
"minDuration": 30, # Minimum duration of a speech segment (default: 0).
"maxDuration": 1200 # Maximum duration of a speech segment (default: 1200).
},
"diarizationConfig": {
"enableDiarization": true, # Enables speaker diarization or not (default: false).
"numberOfSpeaker": null, # If set, forces number of speakers.
"maxNumberOfSpeaker": 50, # If set and `numberOfSpeaker` is not, limit the maximum number of speakers.
"speakerIdentification": null, # Names of speakers to identify (depends on the installation of the diarization worker).
"serviceName": null # Force serviceName (See SubService Resolving).
},
"punctuationConfig": {
"enablePunctuation": false, # Applies punctuation or not (default: false).
"serviceName": null # Force serviceName (See SubService resolution).
}
}
serviceName
can be filled to use a specific subservice version. Available services are available on /list-services
.
The target language
can be "*
" for automatic language detection, or usual tags to describe a language ("fr", "fr-FR", "French" -- see https://github.com/linto-ai/linto-stt/tree/master/whisper#language).
Note that the role of this parameter is different from the role of the env variable LANGUAGE
which is used for text normalization
(and limited to BCP-47 codes).
To enable speaker identification, the speakerIdentification
field of the diarization configuration can be set to the wildcard “*
” to enable all speakers, or to a list of speaker names (JSON format. exemple : “["John Doe", "Bob"]
”).
The diarization worker must have been set so that all speaker names can be matched to a set of speech samples.
The /job/{jobid}
GET route allow you to get the state of the given transcription job.
Response format is application/json.
- If the job state is started, it returns a code
102
with informations on the progress. - If the job state is done, it returns a code
201
with theresult_id
. - If the job state is pending returns a code
404
. Pending can mean 2 things: a transcription worker is not yet available or the jobid does not exist. - If the job state is failed returns a code
400
.
{
#Task pending or wrong jobid: 404
{"state": "pending"}
#Task started: 102
{"state": "started", "progress": {"current": 1, "total": 3, "step": "Transcription (75%)"}}
#Task completed: 201
{"state": "done", "result_id" : "result_id"}
#Task failed: 400
{"state": "failed", "reason": "Something went wrong"}
}
The /results/{result_id}
GET route allows you to fetch transcription result associated to a result_id
.
The accept header specifies the format of the result:
application/json
returns the complete result as a json object;
{
"raw_transcription": "bonjour est-ce que vous allez bien mais oui et vous", # Raw transcription
"transcription_result": "spk1: Bonjour ! Est-ce que vous allez bien ?\nspk2: Mais oui et vous ?", # Final transcription
"language": "fr", # Overall transcription language (specified in the request, or majoritarily detected in the audio)
"confidence": 0.9, # Overall transcription confidence score
"segments": [ # Speech segment representing continious speech by a single speaker
{
"raw_segment": "bonjour est-ce que vous allez bien", # Raw transcription of the speech segment
"segment": "Bonjour ! Est-ce que vous allez bien ?", # Processed transcription of the segment (punctuation, normalisation, ...)
"start": 0, # Segment start time
"end": 5.26, # Segment stop time
"duration": 5.26, # Segment duration
"language": "fr", # Segment language (majoritarily detected or specified in the request)
"spk_id": "spk1", # Segment speaker id
"words": [ # Segment's words informations
{
"word": "bonjour", # Word
"start": 0.0, # Word start time
"end": 1.02, # Word end time
"conf": 0.49 # Word confidence score
},
{
"word": "est-ce",
"start": 3.0,
"end": 3.84,
"conf": 0.63
},
...
]
},
...
]
}
text/plain
returns the final transcription as text
spk1: Bonjour ! Est-ce que vous allez bien ?
spk2: Mais oui et vous ?
text/vtt
returns the transcription formated as WEBVTT captions.
WEBVTT Kind: captions; Language: fr
00:00.000 --> 00:05.260
Bonjour ! Est-ce que vous allez bien ?
00:05.270 --> 00:06.710
Mais oui et vous ?
text/srt
returns the transcription formated as SubRip Subtitle.
1
00:00:00,000 --> 00:00:05,260
Bonjour ! Est-ce que vous allez bien ?
2
00:00:05,270 --> 00:00:06,710
Mais oui et vous ?
Additionnaly you can specify options using query string:
- return_raw: if set to true, return the raw transcription (No punctuation and no post processing).
- convert_number: if set to true, convert numbers from characters to digits.
- wordsub: accepts multiple values formated as
originalWord:substituteWord
. Substitute words in the final transcription.
The /job-log/{jobid} GET route to is used retrieve job details for debugging. Returns logs as raw text.
The /docs route offers access to a swagger-ui interface with the API specifications (OAS3).
It also allows to directly test requests using pre-filled modifiable parameters.
Request exemple:
Initial request
curl -X POST "http://MY_HOST:MY_PORT/transcribe" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F '
transcriptionConfig={
"language": "fr-FR",
"vadConfig": {
"enableVad": true,
"methodName": "WebRTC",
"minDuration": 30,
"maxDuration": 1200
},
"enablePunctuation": {
"enablepunctuation": false,
"serviceName": null
},
"diarizationConfig": {
"enableDiarization": true,
"numberOfSpeaker": null,
"maxNumberOfSpeaker": 50,
"serviceName": null
}
}' -F "force_sync=" -F "file=@MY_AUDIO.wav;type=audio/x-wav"
> {"jobid": "de37224e-fd9d-464d-9004-dcbf3c5b4300"}
Request job status
curl -X GET "http://MY_HOST:MY_PORT/job/6e3f8b5a-5b5a-4c3d-97b6-3c438d7ced25" -H "accept: application/json"
> {"result_id": "769d9c20-ad8c-4957-9581-437172434ec0", "state": "done"}
Fetch result
curl -X GET "http://MY_HOST:MY_PORT/results/769d9c20-ad8c-4957-9581-437172434ec0" -H "accept: text/vtt"
> WEBVTT Kind: captions; Language: en_US
00:00.000 --> 00:03.129
This is a transcription
00:03.129 --> 00:07.719
Diarization and punctuation are set
This project is licensed under AGPLv3 license. Please refer to the LICENSE file for full description of the license.
- celery: Distributed Task Queue.
- pymongo: A MongoDB python client.
- text2num: A text to number convertion library.
- Supervisor: A Process Control System.