Demo structure #10

mahmoudalismail · 2019-01-18T19:01:51Z

Audlib Demo

Demo is on ~~January 22~~ January 29, 2019 in our ROBUST-MLSP group meeting.

Outlines

Motivation/Why?
Functionalities
Contributors
Difference between this package and existing packages
1. Pythonic library
2. Lazy evaluation
Demo
1. Easy to use interface
  1. Feature extraction
  2. Data preprocessing (Add multi-threading to audiopipe.py)
    1. HPC
  3. Pytorch compatible dataset
2. Performance compared to librosa
  1. Optimization (mfcc computation,...etc.)
Roadmap
Contributing

Motivation/What is pyaudlib?

Pyaudlib is a speech processing library in Python with emphasis on deep learning.

Popular speech/audio processing libraries have no deep learning support:

librosa
voicebox
...

Generic deep learning libraries have good image processing support, but not for audio:

PyTorch
TensorFlow
...

pyaudlib (name subject to change) provides a collection of utilities for developing speech-related applications using both signal processing and deep learning.

Functionalities

pyaudlib offers the following high-level features:

Speech signal processing utilities with ready-to-use applications
- Feature extraction frontend
- Speech enhancement
- Speech activity detection
Deep learning architectures for speech processing tasks in PyTorch
- SNRNN (and its variant) for speech enhancement*
- Attention network + CTC objective for speech recognition*
PyTorch-compatible interface (similar to torchvision) for batch processing
- Dataset class specific to speech tasks
  - For ASR: WSJ0, WSJ1
  - For speech enhancement: RATS, VCTK
  - For speech activity detection: RATS
I/O utilities for interfacing with CMUSPHINX*
A command-line interface with a unix-pipe-like syntax
- Inspecting the spectrogram of a wave file is as easy as
```
audiopipe open -i path/to/audio.wav read logspec plot
```

*Under development.

Difference between pyaudlib and existing libraries

Correctness

Unit testing is done on all signal processing functions
User inputs are checked for correctness

>>> wind = hamming(512, hop=.75, synth=True)
AssertionError: [wsize:512, hop:0.75] violates COLA in time.

>>> wind = hamming(512, hop=.5, synth=True)  # ok!

No unexpected output

>>> # Using audlib
>>> sig, sr = audioread('samples/welcome16k.wav')
>>> sigspec = stft(sig, sr, wind, .5, 512, synth=True)
>>> sigsynth = istft(sigspec, sr, wind, .5, 512)
>>> np.allclose(sig, sigsynth[:len(sig)])
True

>>> # Using librosa (you might not expect this)*
>>> nfft = 512
>>> sigpad = fix_length(sig, len(sig)+nfft//2)
>>> D = stft(sigpad, n_fft=nfft)
>>> sigsynth = istft(D, length=len(sig))
>>> np.allclose(sig, sigsynth)
False
>>> np.sum(np.abs(sig-sigsynth))
0.00012380785899053157

*This is the official example given by librosa.

Efficiency

All functionalities are profiled in terms of time and space complexity.
Frequently used utilities are already up to speed with popular libraries

>>> %timeit stft_audlib(sig, sr, hamming(int(window_length*sr), hop=hopfrac), hopfrac, nfft)
628 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(window_length*sr))
757 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit melspec_audlib(sig, sr, wind, hopfrac, nfft, MelFreq(sr, nfft, nmels))
>>> 1.07 ms ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit melspec_librosa(S=np.abs(stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(wi
...: ndow_length*sr)))**2, n_mels=nmels)
>>> 1.52 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Memory footprint still has room to improve
- Memory usage affects speed for processing of large amount of data.

## For AUDLIB ##
434476 frames processed by stft_audlib in 36.76 seconds.

Line #    Mem usage    Increment   Line Contents
================================================
58    145.5 MiB    145.5 MiB   @profile
59                             def test_transform(transform):
60                                 """Test time spent for a transform to process a dataset."""
61    145.5 MiB      0.0 MiB       start_time = time.time()
62    145.5 MiB      0.0 MiB       numframes = 0
63    145.5 MiB      0.0 MiB       idx = 0 if transform.__name__.endswith('audlib') else 1
64    180.8 MiB      1.5 MiB       for ii, samp in enumerate(wsjspeech):
65    180.8 MiB      0.0 MiB           if not ((ii+1) % 100):
66    172.3 MiB      0.0 MiB               print(f"Processing [{ii+1}/{len(wsjspeech)}] files.")
67    180.8 MiB      8.4 MiB           feat = transform(wsjspeech[ii])
68    180.8 MiB      0.0 MiB           numframes += feat.shape[idx]
69    180.8 MiB      0.0 MiB           if (ii+1) > 500:
70    171.0 MiB      0.0 MiB               break
71    171.0 MiB      0.0 MiB       print(f"""{numframes} frames processed by {transform.__name__} in {time.time()-
start_time:.2f} seconds.""")

## For LIBROSA ##
434479 frames processed by stft_librosa in 36.07 seconds.

Line #    Mem usage    Increment   Line Contents
================================================
58    148.6 MiB    148.6 MiB   @profile
59                             def test_transform(transform):
60                                 """Test time spent for a transform to process a dataset."""
61    148.6 MiB      0.0 MiB       start_time = time.time()
62    148.6 MiB      0.0 MiB       numframes = 0
63    148.6 MiB      0.0 MiB       idx = 0 if transform.__name__.endswith('audlib') else 1
64    166.4 MiB      1.0 MiB       for ii, samp in enumerate(wsjspeech):
65    166.4 MiB      0.0 MiB           if not ((ii+1) % 100):
66    164.8 MiB      0.0 MiB               print(f"Processing [{ii+1}/{len(wsjspeech)}] files.")
67    166.4 MiB      7.0 MiB           feat = transform(wsjspeech[ii])
68    166.4 MiB      0.0 MiB           numframes += feat.shape[idx]
69    166.4 MiB      0.0 MiB           if (ii+1) > 500:
70    164.1 MiB      0.0 MiB               break
71    164.1 MiB      0.0 MiB       print(f"""{numframes} frames processed by {transform.__name__} in {time.time()-
start_time:.2f} seconds.""")

A note on programming pattern in our group

We have seen three patterns for pre-processing audio data before feeding them into a NN:

Extract all features and save to a file, then load them all at once when needed.
- Maximum disk space
- Unacceptable usage of memory
- Extremely slow runtime
Extract and save each feature to a separate file, then load them when needed.
- Maximum disk space
- Minimal memory footprint, given that features are loaded on-demand
- Fastest runtime
Extract features on-the-fly.
- No disk space
- Very small memory footprint
- Moderate runtime

Simplicity
- Syntax is simple but does not over-simplify
- Dataset creation complies to torchvision style
- When in doubts, there are example IPython notebooks to reference
Continuous development (for developers)
- Codebase is written and documented according to industry standard (PEP 8, NumPy docstring guide)
- Continuous integration (MAHMOUD: Add something here)
- No high-level dependencies. Credible low-level dependencies are included when absolutely required:
  - PyTorch for DNN implementations and GPU calculation
  - NumPy for multi-dimensional array computation
  - Click for command-line interface
  - SoundFile for audio I/O
  - resampy for resampling*
  - SciPy for filtering*
  - Matplotlib for plotting
*Will be removed in the future.

Roadmap

Top-priority stack (before March):

A short-time analysis NN layer
- Frame-level computations (e.g. STFT, MFCC) inside NNs
Attention network and CTC objective for ASR
A character-level DNN-based speech recognition system
Multi-threaded feature extraction

Mid-priority stack (before April):

SNRNNpost for speech enhancement
I/O bridge with SPHINX's language model
Integrating (recent) work that came out of our group
- Phase difference channel weighting (PDCW)
- Suppression of Slowly-varying
  components and the Falling edge of the power envelope (SSF)
- Cross-correlation across frequency (CCF)

Other ideas:

Local implementation of frequently used applications
- F0 tracker
- Phase vocoder
- STFT phase estimation given magnitude
- DNN-based force aligner

Contributing

Current contributors (at least pushed to repo once):

Raymond Xia - [email protected]

Mahmoud Alismail - [email protected]

Shangwu Yao - [email protected]

Joining the developement team, reporting issues, or requesting features are all welcome!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo structure #10

Demo structure #10

mahmoudalismail commented Jan 18, 2019 •

edited by raymondxyy

Loading

Demo structure #10

Demo structure #10

Comments

mahmoudalismail commented Jan 18, 2019 • edited by raymondxyy Loading

Audlib Demo

Outlines

Motivation/What is pyaudlib?

Functionalities

Difference between pyaudlib and existing libraries

Roadmap

Contributing

mahmoudalismail commented Jan 18, 2019 •

edited by raymondxyy

Loading