This is a repository for speech separation tasks.
This project is highly inspired by the paper[1], and is still working to improve the performance.
AVspeech dataset : contains 4700 hours of video segments, from a total of 290k YouTube videos.
Customized video and audio downloader are provided in data/. (based on youtube-dl,sox,ffmpeg)
There are several preprocess functions in the lib. Including STFT, iSTFT, power-law compression, complex mask etc.
Apply MTCNN to detect face and correct it by checking the provided face center. [2]
The visual frames are transfered to 1792 (avg pooling layer) face embeddings with facenet pre-trained model[3].
Audio part : Dilated CNN + Bidirectional LSTM.
Video part : (pretrained MTCNN + Facenet) + dilated CNN + Bidirectional LSTM.
Loss function : modified discriminative loss function inspired from paper[4].
Apply complex ratio mask (cRM) to enhance phase spectrum. Maintain the quality during transformation by hyperbolic tangent fucntion.[5]
The model will be evaluated by signal-to-distortion ratio.