Skip to content

One way translation #1706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions authors.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -257,3 +257,8 @@ thli-openai:
name: "Thomas Li"
website: "https://www.linkedin.com/in/thli/"
avatar: "https://avatars.githubusercontent.com/u/189043632?v=4"

erikakettleson-openai:
name: "Erika Kettleson"
website: "https://www.linkedin.com/in/erika-kettleson-85763196/"
avatar: "https://avatars.githubusercontent.com/u/186107044?v=4"
161 changes: 161 additions & 0 deletions examples/voice_solutions/one_way_translation_using_realtime_api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Multi-Language Conversational Translation with the Realtime API

One of the most exciting things about the Realtime API is that the emotion, tone and pace of speech are all passed to the model for inference. Traditional cascaded voice systems (involving STT and TTS) introduce an intermediate transcription step, relying on SSML or prompting to approximate prosody, which inherently loses fidelity. The speaker's expressiveness is literally lost in translation. Because it can process raw audio, the Realtime API preserves those audio attributes through inference, minimizing latency and enriching responses with tonal and inflectional cues. Because of this, the Realtime API makes LLM-powered speech translation closer to a live interpreter than ever before.

This cookbook demonstrates how to use OpenAI's [ Realtime API](https://platform.openai.com/docs/guides/realtime) to build a multi-lingual, one-way translation workflow with WebSockets. It is implemented using the [Realtime + WebSockets integration](https://platform.openai.com/docs/guides/realtime-websocket) in a speaker application and a WebSocket server to mirror the translated audio to a listener application.

A real-world use case for this demo is a multilingual, conversational translation where a speaker talks into the speaker app and listeners hear translations in their selected native language via the listener app. Imagine a conference room with a speaker talking in English and a participant with headphones in choosing to listen to a Tagalog translation. Due to the current turn-based nature of audio models, the speaker must pause briefly to allow the model to process and translate speech. However, as models become faster and more efficient, this latency will decrease significantly and the translation will become more seamless.


Let's explore the main functionalities and code snippets that illustrate how the app works. You can find the code in the [accompanying repo](https://github.com/openai/openai-cookbook/tree/main/examples/voice_solutions/one_way_translation_using_realtime_api/README.md
) if you want to run the app locally.

### High Level Architecture Overview

This project has two applications - a speaker and listener app. The speaker app takes in audio from the browser, forks the audio and creates a unique Realtime session for each language and sends it to the OpenAI Realtime API via WebSocket. Translated audio streams back and is mirrored via a separate WebSocket server to the listener app. The listener app receives all translated audio streams simultaneously, but only the selected language is played. This architecture is designed for a POC and is not intended for a production use case. Let's dive into the workflow!

![Architecture](translation_images/Realtime_flow_diagram.png)

### Step 1: Language & Prompt Setup

We need a unique stream for each language - each language requires a unique prompt and session with the Realtime API. We define these prompts in `translation_prompts.js`.

The Realtime API is powered by [GPT-4o Realtime](https://platform.openai.com/docs/models/gpt-4o-realtime-preview) or [GPT-4o mini Realtime](https://platform.openai.com/docs/models/gpt-4o-mini-realtime-preview) which are turn-based and trained for conversational speech use cases. In order to ensure the model returns translated audio (i.e. instead of answering a question, we want a direct translation of that question), we want to steer the model with few-shot examples of questions in the prompts. If you're translating for a specific reason or context, or have specialized vocabulary that will help the model understand context of the translation, include that in the prompt as well. If you want the model to speak with a specific accent or otherwise steer the voice, you can follpow tips from our cookbook on [Steering Text-to-Speech for more dynamic audio generation](https://cookbook.openai.com/examples/voice_solutions/steering_tts).

We can dynamically input speech in any language.

```js
// Define language codes and import their corresponding instructions from our prompt config file
const languageConfigs = [
{ code: 'fr', instructions: french_instructions },
{ code: 'es', instructions: spanish_instructions },
{ code: 'tl', instructions: tagalog_instructions },
{ code: 'en', instructions: english_instructions },
{ code: 'zh', instructions: mandarin_instructions },
];
```

## Step 2: Setting up the Speaker App

![SpeakerApp](translation_images/SpeakerApp.png)

We need to handle the setup and management of client instances that connect to the Realtime API, allowing the application to process and stream audio in different languages. `clientRefs` holds a map of `RealtimeClient` instances, each associated with a language code (e.g., 'fr' for French, 'es' for Spanish) representing each unique client connection to the Realtime API.

```js
const clientRefs = useRef(
languageConfigs.reduce((acc, { code }) => {
acc[code] = new RealtimeClient({
apiKey: OPENAI_API_KEY,
dangerouslyAllowAPIKeyInBrowser: true,
});
return acc;
}, {} as Record<string, RealtimeClient>)
).current;

// Update languageConfigs to include client references
const updatedLanguageConfigs = languageConfigs.map(config => ({
...config,
clientRef: { current: clientRefs[config.code] }
}));
```

Note: The `dangerouslyAllowAPIKeyInBrowser` option is set to true because we are using our OpenAI API key in the browser for demo purposes but in production you should use an [ephemeral API key](https://platform.openai.com/docs/api-reference/realtime-sessions) generated via the OpenAI REST API.

We need to actually initiate the connection to the Realtime API and send audio data to the server. When a user clicks 'Connect' on the speaker page, we start that process.

The `connectConversation` function orchestrates the connection, ensuring that all necessary components are initialized and ready for use.

```js
const connectConversation = useCallback(async () => {
try {
setIsLoading(true);
const wavRecorder = wavRecorderRef.current;
await wavRecorder.begin();
await connectAndSetupClients();
setIsConnected(true);
} catch (error) {
console.error('Error connecting to conversation:', error);
} finally {
setIsLoading(false);
}
}, []);
```

`connectAndSetupClients` ensures we are using the right model and voice. For this demo, we are using gpt-4o-realtime-preview-2024-12-17 and coral.

```js
// Function to connect and set up all clients
const connectAndSetupClients = async () => {
for (const { clientRef } of updatedLanguageConfigs) {
const client = clientRef.current;
await client.realtime.connect({ model: DEFAULT_REALTIME_MODEL });
await client.updateSession({ voice: DEFAULT_REALTIME_VOICE });
}
};
```

### Step 3: Audio Streaming

Sending audio with WebSockets requires work to manage the inbound and outbound PCM16 audio streams ([more details on that](https://platform.openai.com/docs/guides/realtime-model-capabilities#handling-audio-with-websockets)). We abstract that using wavtools, a library for both recording and streaming audio data in the browser. Here we use `WavRecorder` for capturing audio in the browser.

This demo supports both [manual and voice activity detection (VAD)](https://platform.openai.com/docs/guides/realtime-model-capabilities#voice-activity-detection-vad) modes for recording that can be toggled by the speaker. For cleaner audio capture we recommend using manual mode here.

```js
const startRecording = async () => {
setIsRecording(true);
const wavRecorder = wavRecorderRef.current;

await wavRecorder.record((data) => {
// Send mic PCM to all clients
updatedLanguageConfigs.forEach(({ clientRef }) => {
clientRef.current.appendInputAudio(data.mono);
});
});
};
```

### Step 4: Showing Transcripts

We listen for `response.audio_transcript.done` events to update the transcripts of the audio. These input transcripts are generated by the Whisper model in parallel to the GPT-4o Realtime inference that is doing the translations on raw audio.

We have a Realtime session running simultaneously for every selectable language and so we get transcriptions for every language (regardless of what language is selected in the listener application). Those can be shown by toggling the 'Show Transcripts' button.

## Step 5: Setting up the Listener App

Listeners can choose from a dropdown menu of translation streams and after connecting, dynamically change languages. The demo application uses French, Spanish, Tagalog, English, and Mandarin but OpenAI supports 57+ languages.

The app connects to a simple `Socket.IO` server that acts as a relay for audio data. When translated audio is streamed back to from the Realtime API, we mirror those audio streams to the listener page and allow users to select a language and listen to translated streams.

The key function here is `connectServer` that connects to the server and sets up audio streaming.

```js
// Function to connect to the server and set up audio streaming
const connectServer = useCallback(async () => {
if (socketRef.current) return;
try {
const socket = io('http://localhost:3001');
socketRef.current = socket;
await wavStreamPlayerRef.current.connect();
socket.on('connect', () => {
console.log('Listener connected:', socket.id);
setIsConnected(true);
});
socket.on('disconnect', () => {
console.log('Listener disconnected');
setIsConnected(false);
});
} catch (error) {
console.error('Error connecting to server:', error);
}
}, []);
```

### POC to Production

This is a demo and meant for inspiration. We are using WebSockets here for easy local development. However, in a production environment we’d suggest using WebRTC (which is much better for streaming audio quality and lower latency) and connecting to the Realtime API with an [ephemeral API key](https://platform.openai.com/docs/api-reference/realtime-sessions) generated via the OpenAI REST API.

Current Realtime models are turn based - this is best for conversational use cases as opposed to the uninterrupted, UN-style live translation that we really want for a one-directional streaming use case. For this demo, we can capture additional audio from the speaker app as soon as the model returns translated audio (i.e. capturing more input audio while the translated audio played from the listener app), but there is a limit to the length of audio we can capture at a time. The speaker needs to pause to let the translation catch up.

## Conclusion

In summary, this POC is a demonstration of a one-way translation use of the Realtime API but the idea of forking audio for multiple uses can expand beyond translation. Other workflows might be simultaneous sentiment analysis, live guardrails or generating subtitles.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
REACT_APP_OPENAI_API_KEY=sk-proj-1234567890
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.

# dependencies
/node_modules
/.pnp
.pnp.js

# testing
/coverage

# production
/build

# packaging
*.zip
*.tar.gz
*.tar
*.tgz
*.bla

# misc
.DS_Store
.env
.env.local
.env.development.local
.env.test.local
.env.production.local

npm-debug.log*
yarn-debug.log*
yarn-error.log*
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Translation Demo

This project demonstrates how to use the [OpenAI Realtime API](https://platform.openai.com/docs/guides/realtime) to build a one-way translation application with WebSockets. It is implemented using the [Realtime + Websockets integration](https://platform.openai.com/docs/guides/realtime-websocket). A real-world use case for this demo is multilingual, conversational translation—where a speaker talks into the speaker app and listeners hear translations in their selected native languages via the listener app. Imagine a conference room with multiple participants with headphones, listening live to a speaker in their own languages. Due to the current turn-based nature of audio models, the speaker must pause briefly to allow the model to process and translate speech. However, as models become faster and more efficient, this latency will decrease significantly and the translation will become more seamless.

## How to Use

### Running the Application

1. **Set up the OpenAI API:**

- If you're new to the OpenAI API, [sign up for an account](https://platform.openai.com/signup).
- Follow the [Quickstart](https://platform.openai.com/docs/quickstart) to retrieve your API key.

2. **Clone the Repository:**

```bash
git clone <repository-url>
```

3. **Set your API key:**

- Create a `.env` file at the root of the project and add the following line:
```bash
REACT_APP_OPENAI_API_KEY=<your_api_key>
```

4. **Install dependencies:**

Navigate to the project directory and run:

```bash
npm install
```

5. **Run the Speaker & Listener Apps:**

```bash
npm start
```

The speaker and listener apps will be available at:
- [http://localhost:3000/speaker](http://localhost:3000/speaker)
- [http://localhost:3000/listener](http://localhost:3000/listener)

6. **Start the Mirror Server:**

In another terminal window, navigate to the project directory and run:

```bash
node mirror-server/mirror-server.mjs
```

### Adding a New Language

To add a new language to the codebase, follow these steps:

1. **Socket Event Handling in Mirror Server:**

- Open `mirror-server/mirror-server.cjs`.
- Add a new socket event for the new language. For example, for Hindi:
```javascript
socket.on('mirrorAudio:hi', (audioChunk) => {
console.log('logging Hindi mirrorAudio', audioChunk);
socket.broadcast.emit('audioFrame:hi', audioChunk);
});
```

2. **Instructions Configuration:**

- Open `src/utils/translation_prompts.js`.
- Add new instructions for the new language. For example:
```javascript
export const hindi_instructions = "Your Hindi instructions here...";
```

3. **Realtime Client Initialization in SpeakerPage:**

- Open `src/pages/SpeakerPage.tsx`.
- Import the new language instructions:
```typescript
import { hindi_instructions } from '../utils/translation_prompts.js';
```
- Add the new language to the `languageConfigs` array:
```typescript
const languageConfigs = [
// ... existing languages ...
{ code: 'hi', instructions: hindi_instructions },
];
```

4. **Language Configuration in ListenerPage:**

- Open `src/pages/ListenerPage.tsx`.
- Locate the `languages` object, which centralizes all language-related data.
- Add a new entry for your language. The key should be the language code, and the value should be an object containing the language name.

```typescript
const languages = {
fr: { name: 'French' },
es: { name: 'Spanish' },
tl: { name: 'Tagalog' },
en: { name: 'English' },
zh: { name: 'Mandarin' },
// Add your new language here
hi: { name: 'Hindi' }, // Example for adding Hindi
} as const;
```

- The `ListenerPage` component will automatically handle the new language in the dropdown menu and audio stream handling.

5. **Test the New Language:**

- Run your application and test the new language by selecting it from the dropdown menu.
- Ensure that the audio stream for the new language is correctly received and played.

### Demo Flow

1. **Connect in the Speaker App:**

- Click "Connect" and wait for the WebSocket connections to be established with the Realtime API.
- Choose between VAD (Voice Activity Detection) and Manual push-to-talk mode.
- the speaker should ensure they pause to allow the translation to catch up - the model is turn based and cannot constantly stream translations.
- The speaker can view live translations in the Speaker App for each language.

2. **Select Language in the Listener App:**

- Select the language from the dropdown menu.
- The listener app will play the translated audio. The app translates all audio streams simultaneously, but only the selected language is played. You can switch languages at any time.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// mirror_server.js
import express from 'express';
import http from 'http';
import { Server } from 'socket.io';

const app = express();
const server = http.createServer(app);
const io = new Server(server, {
cors: { origin: '*' }
});

io.on('connection', (socket) => {
console.log('Client connected', socket.id);

socket.on('mirrorAudio:fr', (audioChunk) => {
socket.broadcast.emit('audioFrame:fr', audioChunk);
});

socket.on('mirrorAudio:es', (audioChunk) => {
socket.broadcast.emit('audioFrame:es', audioChunk);
});

socket.on('mirrorAudio:tl', (audioChunk) => {
socket.broadcast.emit('audioFrame:tl', audioChunk);
});

socket.on('mirrorAudio:en', (audioChunk) => {
socket.broadcast.emit('audioFrame:en', audioChunk);
});

socket.on('mirrorAudio:zh', (audioChunk) => {
socket.broadcast.emit('audioFrame:zh', audioChunk);
});

socket.on('disconnect', () => {
console.log('Client disconnected', socket.id);
});
});

server.listen(3001, () => {
console.log('Socket.IO mirror server running on port 3001');
});
Loading