This competition aims to evaluate an audio classification system for Ewe speakers in various contexts. The solution will be deployed on edge devices. For more details, please visit https://zindi.africa/competitions/techcabal-ewe-audio-translation-challenge.
- First, I trained a
efficientnet-b0
model achieving 0.964 accuracy (public LB) - Second, I trained a
mobilenet_v3_small
model, reaching a 0.959 accuracy (public LB) - Third, since model size of the first model is bigger than 10 MB, but it has higher accuracy, so I decided to use pseudo-labels from test set and merge with training set to train the better
mobilenet_v3_small
model, which achieved 0.962 accuracy (public LB)
I limited audio time to 2 seconds and used Logmel features (n_mels = 128
). With model 2 and 3, Delta features were used.
MaskFreq
dataset
is folder of raw dataset,dataset/train
includes 5335 wav files,dataset/test
includes 2947 wav files.train_0.964.py
is python script for training the first step.train_0.959.py
is python script for training the second step.train_0.959_v2.py
is python script for training the last step.saved_model
is folder of models from all three stepsresult
is folder of submissions. Theresult/Submission_cnn_0.962_from_0.959_v2.csv
represents my final submission (0.962 public LB)inference.py
: If you solely wish to execute inference, run this file!
- I experimented various ways of feature engineering (MFCC, Logmels), but Logmels
n_mels=128
gave me the best results. Decreasingn_mels
resulted in decreased accuracy. - I tried
mobilenet_v4
but it not worked - I also tried to develop my own CNN model, but the result is worsen.
- I explored other augmentation techniques like PitchShift, Add random noise but these did not improve results significantly.
- I used other ML training frameworks like Pytorch, huggingface, tf.keras, but none of them surpassed fastai
Thanks for this interesting challenge, This is my first time joining Zindi <3