-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#4 Implement AudioToTextPipeline #34
#4 Implement AudioToTextPipeline #34
Conversation
…t use EmbeddingToText afterwards
huggingface_pipelines/speech.py
Outdated
|
||
|
||
@dataclass | ||
class AudioDatasetConfig(DatasetConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this class used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we initialize audio huggingface datasets
huggingface_pipelines/speech.py
Outdated
) | ||
|
||
# Ensure all embeddings are 2D | ||
all_embeddings = [emb.unsqueeze(0) if emb.dim( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the audio inputs have multiple channels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can convert the multiple channels into mono channels because raising makes the pipeline less durable, we can do this by taking the mean across channels
huggingface_pipelines/speech.py
Outdated
audio_data['array'], audio_data['sampling_rate']) | ||
audio_inputs.append(temp_file.name) | ||
else: | ||
logger.warning(f"Invalid audio data format: {audio_data}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printing all info of audio_data
might be overwhelming, especially in the terminal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this logging in trace mode, it may be useful to ensure wav dim and shape is correct
) | ||
|
||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype as an extra param here is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
if column not in batch: | ||
logger.warning(f"Column {column} not found in batch. Skipping.") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would raise in this case instead of skipping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
huggingface_pipelines/audio.py
Outdated
and "sampling_rate" in audio_data | ||
): | ||
# Handle multi-channel audio by taking the mean across channels | ||
audio_array = audio_data["array"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a convention used always by HF ? if not, we need to add it as a param.
audio-specific attributes and processing. | ||
|
||
Attributes: | ||
sampling_rate (int): The target sampling rate for audio data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave a comment about HF integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
huggingface_pipelines/audio.py
Outdated
|
||
# Ensure all embeddings are 2D | ||
processed_embeddings: List[torch.Tensor] = [ | ||
emb.unsqueeze(0) if emb.dim() == 1 else emb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really need to do this ?
huggingface_pipelines/audio.py
Outdated
for emb in all_embeddings | ||
] | ||
|
||
# Get the maximum sequence length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we dont need this padding at all since it's already padded ?
huggingface_pipelines/audio.py
Outdated
logger.error( | ||
f"Error in model.predict for column {column}: {str(e)}" | ||
) | ||
# Instead of raising, we'll set the output to None and continue processing | ||
batch[f"{column}_{self.config.output_column_suffix}"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just raise to remain simple
Args: | ||
dataset (datasets.Dataset): The loaded dataset. | ||
|
||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link for casting column HF api add her
datasets.Dataset: The dataset with processed audio column. | ||
""" | ||
if self.audio_column in dataset.column_names: | ||
dataset = dataset.cast_column( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to docs, that casting column will modify original column.
huggingface_pipelines/audio.py
Outdated
fbank_dtype: torch.dtype = torch.float32 | ||
n_parallel: int = 4 | ||
pad_idx: int = 0 | ||
dtype: np.dtype = np.dtype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need np.dtype constructor
huggingface_pipelines/audio.py
Outdated
|
||
try: | ||
# Move tensors to the specified device | ||
audio_inputs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move in mini batches, to stop out of memory errors.
@pytest.fixture | ||
def complex_audio_data() -> Dict[str, Dict[str, Any]]: | ||
return { | ||
"short_audio": {"array": np.random.rand(8000), "sampling_rate": 16000}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify correctness by decoding output tensors.
Why?
This PR implements an AudioToTextHFPipeline for transcribing audio datasets from HuggingFace into text using SONAR. The key reasons for this implementation are:
This implementation will improve our ability to process and transcribe audio datasets consistently, making it easier to prepare text data for further analysis or model training.
How?
Key technical decisions and implementations:
Work in Progress (WIP) parts:
Test plan
To test these changes, we should: