You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, even though the /transcribe endpoint is marked as async, it is completely blocked until the self.transcribe() call completes. It also only accepts an entire file, as opposed to a stream of audio bytes. This means we can only process an entire file at once, and probably means we need more resources to read the entire file into memory and transfer it to the model. If we could stream, say 10s at a time, we would probably be able to transcribe longer files, and if we did something like the cache aware ^^ thing up there, we might be able to do this without a significant degradation in performance?
Does this require changing the model to accept a collection of bytes instead of an entire file?
The text was updated successfully, but these errors were encountered:
I think to do this, you will need to make two docker containers running on two different ports, one which does just the transcription, and one which processes api requests.
The API container would have something like this:
`
async def forward(ws_a: WebSocket, ws_b):
while True:
try:
data = await ws_a.receive_bytes()
except WebSocketDisconnect:
break
await ws_b.send_bytes(data)
async def reverse(ws_a: WebSocket, ws_b):
while True:
try:
response = await ws_b.receive()
data = response.data
except WebSocketDisconnect:
break
await ws_a.send_text(data)
@api.websocket("/send")
async def websocket_a(ws_a: WebSocket):
await ws_a.accept()
async with session.ws_connect(URL FOR WEBSOCKET RUNNING ON CONTAINER B}", timeout=60) as ws_b_client:
fwd_task = asyncio.create_task(forward(ws_a, ws_b_client))
rev_task = asyncio.create_task(reverse(ws_a, ws_b_client))
await asyncio.gather(fwd_task, rev_task)
`
and the transcribe container would have something like this
@app.websocket("/generate") async def websocket_b(ws_b_server: WebSocket): await ws_b_server.accept() while True: resp = DO TRANSCRIPTION STUFF await ws_b_server.send_text(json.dumps(resp))
I think the hard part here will be figuring out how to send the bytes directly to the model without writing a bunch of temporary files.
Lmk if this doesn't make any sense or if you run into trouble, I've done something similar before and it worked well, but the repo is not yet open source
It would be really cool to process the audio in real time using something like this https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py
At the moment, even though the
/transcribe
endpoint is marked as async, it is completely blocked until theself.transcribe()
call completes. It also only accepts an entire file, as opposed to a stream of audio bytes. This means we can only process an entire file at once, and probably means we need more resources to read the entire file into memory and transfer it to the model. If we could stream, say 10s at a time, we would probably be able to transcribe longer files, and if we did something like the cache aware ^^ thing up there, we might be able to do this without a significant degradation in performance?Does this require changing the model to accept a collection of bytes instead of an entire file?
The text was updated successfully, but these errors were encountered: