#4 Implement AudioToTextPipeline #34

botirk38 · 2024-07-19T00:50:00Z

Why?

This PR implements an AudioToTextHFPipeline for transcribing audio datasets from HuggingFace into text using SONAR. The key reasons for this implementation are:

To provide a standardized pipeline for audio transcription tasks within our project.
To leverage SONAR's speech-to-text capabilities for processing audio datasets from HuggingFace.
To enable easy configuration and customization of audio transcription parameters.
To integrate seamlessly with our existing Pipeline framework and HuggingFace datasets.
To support batch processing for efficient handling of large audio datasets.

This implementation will improve our ability to process and transcribe audio datasets consistently, making it easier to prepare text data for further analysis or model training.

How?

Key technical decisions and implementations:

Extended the existing Pipeline and PipelineConfig classes to create AudioToTextHFPipeline and AudioPipelineConfig.
Integrated SONAR's SpeechToTextPipeline for audio transcription.
Implemented batch processing of audio data with configurable batch sizes.
Used torch.inference_mode() for efficient inference.
Included detailed logging for better monitoring and debugging.
Implemented error handling throughout the pipeline.
Allowed for customization of encoder and decoder models, target language, and other transcription parameters.

Work in Progress (WIP) parts:

The error handling in the transcribe_audio method could be more specific to different types of errors that might occur during transcription.

Test plan

To test these changes, we should:

Create unit tests for the AudioToTextHFPipeline and AudioPipelineConfig classes.
Implement integration tests with sample audio datasets from HuggingFace.
Test with different audio formats and languages to ensure robustness.
Verify that batch processing works correctly for various batch sizes.
Test the pipeline with different SONAR encoder and decoder models.

huggingface_pipelines/speech.py

…t use EmbeddingToText afterwards

huggingface_pipelines/speech.py

antoine-tran · 2024-08-14T06:26:20Z

huggingface_pipelines/speech.py

+
+
+@dataclass
+class AudioDatasetConfig(DatasetConfig):


Where is this class used ?

When we initialize audio huggingface datasets

antoine-tran · 2024-08-14T06:33:09Z

huggingface_pipelines/speech.py

+                )
+
+                # Ensure all embeddings are 2D
+                all_embeddings = [emb.unsqueeze(0) if emb.dim(


what happens if the audio inputs have multiple channels ?

We can convert the multiple channels into mono channels because raising makes the pipeline less durable, we can do this by taking the mean across channels

antoine-tran · 2024-08-14T06:42:08Z

huggingface_pipelines/speech.py

+                             audio_data['array'], audio_data['sampling_rate'])
+                    audio_inputs.append(temp_file.name)
+                else:
+                    logger.warning(f"Invalid audio data format: {audio_data}")


printing all info of audio_data might be overwhelming, especially in the terminal

I added this logging in trace mode, it may be useful to ensure wav dim and shape is correct

artemru · 2024-08-26T16:34:39Z