Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo structure #10

Open
mahmoudalismail opened this issue Jan 18, 2019 · 0 comments
Open

Demo structure #10

mahmoudalismail opened this issue Jan 18, 2019 · 0 comments

Comments

@mahmoudalismail
Copy link
Collaborator

mahmoudalismail commented Jan 18, 2019

Audlib Demo

Demo is on January 22 January 29, 2019 in our ROBUST-MLSP group meeting.

Outlines

  1. Motivation/Why?
  2. Functionalities
  3. Contributors
  4. Difference between this package and existing packages
    1. Pythonic library
    2. Lazy evaluation
  5. Demo
    1. Easy to use interface
      1. Feature extraction
      2. Data preprocessing (Add multi-threading to audiopipe.py)
        1. HPC
      3. Pytorch compatible dataset
    2. Performance compared to librosa
      1. Optimization (mfcc computation,...etc.)
  6. Roadmap
  7. Contributing

Motivation/What is pyaudlib?

Pyaudlib is a speech processing library in Python with emphasis on deep learning.

Popular speech/audio processing libraries have no deep learning support:

  • librosa
  • voicebox
  • ...

Generic deep learning libraries have good image processing support, but not for audio:

  • PyTorch
  • TensorFlow
  • ...

pyaudlib (name subject to change) provides a collection of utilities for developing speech-related applications using both signal processing and deep learning.

Functionalities

pyaudlib offers the following high-level features:

  • Speech signal processing utilities with ready-to-use applications
    • Feature extraction frontend
    • Speech enhancement
    • Speech activity detection
  • Deep learning architectures for speech processing tasks in PyTorch
    • SNRNN (and its variant) for speech enhancement*
    • Attention network + CTC objective for speech recognition*
  • PyTorch-compatible interface (similar to torchvision) for batch processing
    • Dataset class specific to speech tasks
      • For ASR: WSJ0, WSJ1
      • For speech enhancement: RATS, VCTK
      • For speech activity detection: RATS
  • I/O utilities for interfacing with CMUSPHINX*
  • A command-line interface with a unix-pipe-like syntax
    • Inspecting the spectrogram of a wave file is as easy as
    audiopipe open -i path/to/audio.wav read logspec plot

*Under development.

Difference between pyaudlib and existing libraries

  1. Correctness

    • Unit testing is done on all signal processing functions

    • User inputs are checked for correctness

    >>> wind = hamming(512, hop=.75, synth=True)
    AssertionError: [wsize:512, hop:0.75] violates COLA in time.
    
    >>> wind = hamming(512, hop=.5, synth=True)  # ok!
    • No unexpected output
    >>> # Using audlib
    >>> sig, sr = audioread('samples/welcome16k.wav')
    >>> sigspec = stft(sig, sr, wind, .5, 512, synth=True)
    >>> sigsynth = istft(sigspec, sr, wind, .5, 512)
    >>> np.allclose(sig, sigsynth[:len(sig)])
    True
    
    >>> # Using librosa (you might not expect this)*
    >>> nfft = 512
    >>> sigpad = fix_length(sig, len(sig)+nfft//2)
    >>> D = stft(sigpad, n_fft=nfft)
    >>> sigsynth = istft(D, length=len(sig))
    >>> np.allclose(sig, sigsynth)
    False
    >>> np.sum(np.abs(sig-sigsynth))
    0.00012380785899053157

    *This is the official example given by librosa.

  2. Efficiency

    • All functionalities are profiled in terms of time and space complexity.
    • Frequently used utilities are already up to speed with popular libraries
    >>> %timeit stft_audlib(sig, sr, hamming(int(window_length*sr), hop=hopfrac), hopfrac, nfft)
    628 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(window_length*sr))
    757 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit melspec_audlib(sig, sr, wind, hopfrac, nfft, MelFreq(sr, nfft, nmels))
    >>> 1.07 ms ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit melspec_librosa(S=np.abs(stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(wi
    ...: ndow_length*sr)))**2, n_mels=nmels)
    >>> 1.52 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    • Memory footprint still has room to improve
      • Memory usage affects speed for processing of large amount of data.
    ## For AUDLIB ##
    434476 frames processed by stft_audlib in 36.76 seconds.
    
    Line #    Mem usage    Increment   Line Contents
    ================================================
    58    145.5 MiB    145.5 MiB   @profile
    59                             def test_transform(transform):
    60                                 """Test time spent for a transform to process a dataset."""
    61    145.5 MiB      0.0 MiB       start_time = time.time()
    62    145.5 MiB      0.0 MiB       numframes = 0
    63    145.5 MiB      0.0 MiB       idx = 0 if transform.__name__.endswith('audlib') else 1
    64    180.8 MiB      1.5 MiB       for ii, samp in enumerate(wsjspeech):
    65    180.8 MiB      0.0 MiB           if not ((ii+1) % 100):
    66    172.3 MiB      0.0 MiB               print(f"Processing [{ii+1}/{len(wsjspeech)}] files.")
    67    180.8 MiB      8.4 MiB           feat = transform(wsjspeech[ii])
    68    180.8 MiB      0.0 MiB           numframes += feat.shape[idx]
    69    180.8 MiB      0.0 MiB           if (ii+1) > 500:
    70    171.0 MiB      0.0 MiB               break
    71    171.0 MiB      0.0 MiB       print(f"""{numframes} frames processed by {transform.__name__} in {time.time()-
    start_time:.2f} seconds.""")
    
    ## For LIBROSA ##
    434479 frames processed by stft_librosa in 36.07 seconds.
    
    Line #    Mem usage    Increment   Line Contents
    ================================================
    58    148.6 MiB    148.6 MiB   @profile
    59                             def test_transform(transform):
    60                                 """Test time spent for a transform to process a dataset."""
    61    148.6 MiB      0.0 MiB       start_time = time.time()
    62    148.6 MiB      0.0 MiB       numframes = 0
    63    148.6 MiB      0.0 MiB       idx = 0 if transform.__name__.endswith('audlib') else 1
    64    166.4 MiB      1.0 MiB       for ii, samp in enumerate(wsjspeech):
    65    166.4 MiB      0.0 MiB           if not ((ii+1) % 100):
    66    164.8 MiB      0.0 MiB               print(f"Processing [{ii+1}/{len(wsjspeech)}] files.")
    67    166.4 MiB      7.0 MiB           feat = transform(wsjspeech[ii])
    68    166.4 MiB      0.0 MiB           numframes += feat.shape[idx]
    69    166.4 MiB      0.0 MiB           if (ii+1) > 500:
    70    164.1 MiB      0.0 MiB               break
    71    164.1 MiB      0.0 MiB       print(f"""{numframes} frames processed by {transform.__name__} in {time.time()-
    start_time:.2f} seconds.""")
    
    • A note on programming pattern in our group

    We have seen three patterns for pre-processing audio data before feeding them into a NN:

    1. Extract all features and save to a file, then load them all at once when needed.
      • Maximum disk space
      • Unacceptable usage of memory
      • Extremely slow runtime
    2. Extract and save each feature to a separate file, then load them when needed.
      • Maximum disk space
      • Minimal memory footprint, given that features are loaded on-demand
      • Fastest runtime
    3. Extract features on-the-fly.
      • No disk space
      • Very small memory footprint
      • Moderate runtime
  3. Simplicity

    • Syntax is simple but does not over-simplify
    • Dataset creation complies to torchvision style
    • When in doubts, there are example IPython notebooks to reference
  4. Continuous development (for developers)

    • Codebase is written and documented according to industry standard (PEP 8, NumPy docstring guide)
    • Continuous integration (MAHMOUD: Add something here)
    • No high-level dependencies. Credible low-level dependencies are included when absolutely required:
      • PyTorch for DNN implementations and GPU calculation
      • NumPy for multi-dimensional array computation
      • Click for command-line interface
      • SoundFile for audio I/O
      • resampy for resampling*
      • SciPy for filtering*
      • Matplotlib for plotting

    *Will be removed in the future.

Roadmap

Top-priority stack (before March):

  • A short-time analysis NN layer
    • Frame-level computations (e.g. STFT, MFCC) inside NNs
  • Attention network and CTC objective for ASR
  • A character-level DNN-based speech recognition system
  • Multi-threaded feature extraction

Mid-priority stack (before April):

  • SNRNNpost for speech enhancement
  • I/O bridge with SPHINX's language model
  • Integrating (recent) work that came out of our group
    • Phase difference channel weighting (PDCW)
    • Suppression of Slowly-varying
      components and the Falling edge of the power envelope (SSF)
    • Cross-correlation across frequency (CCF)

Other ideas:

  • Local implementation of frequently used applications
    • F0 tracker
    • Phase vocoder
    • STFT phase estimation given magnitude
    • DNN-based force aligner

Contributing

Current contributors (at least pushed to repo once):

Raymond Xia - [email protected]

Mahmoud Alismail - [email protected]

Shangwu Yao - [email protected]

Joining the developement team, reporting issues, or requesting features are all welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant