|
| 1 | +# Models for AudioSet: A Large Scale Dataset of Audio Events |
| 2 | + |
| 3 | +This repository provides models and supporting code associated with |
| 4 | +[AudioSet](http://g.co/audioset), a dataset of over 2 million human-labeled |
| 5 | +10-second YouTube video soundtracks, with labels taken from an ontology of |
| 6 | +more than 600 audio event classes. |
| 7 | + |
| 8 | +AudioSet was |
| 9 | +[released](https://research.googleblog.com/2017/03/announcing-audioset-dataset-for-audio.html) |
| 10 | +in March 2017 by Google's Sound Understanding team to provide a common |
| 11 | +large-scale evaluation task for audio event detection as well as a starting |
| 12 | +point for a comprehensive vocabulary of sound events. |
| 13 | + |
| 14 | +For more details about AudioSet and the various models we have trained, please |
| 15 | +visit the [AudioSet website](http://g.co/audioset) and read our papers: |
| 16 | + |
| 17 | +* Gemmeke, J. et. al., |
| 18 | + [AudioSet: An ontology and human-labelled dataset for audio events](https://research.google.com/pubs/pub45857.html), |
| 19 | + ICASSP 2017 |
| 20 | + |
| 21 | +* Hershey, S. et. al., |
| 22 | + [CNN Architectures for Large-Scale Audio Classification](https://research.google.com/pubs/pub45611.html), |
| 23 | + ICASSP 2017 |
| 24 | + |
| 25 | +If you use the pre-trained VGGish model in your published research, we ask that |
| 26 | +you cite [CNN Architectures for Large-Scale Audio Classification](https://research.google.com/pubs/pub45611.html). |
| 27 | +If you use the AudioSet dataset or the released 128-D embeddings of AudioSet |
| 28 | +segments, please cite |
| 29 | +[AudioSet: An ontology and human-labelled dataset for audio events](https://research.google.com/pubs/pub45857.html). |
| 30 | + |
| 31 | +## VGGish |
| 32 | + |
| 33 | +The initial AudioSet release included 128-dimensional embeddings of each |
| 34 | +AudioSet segment produced from a VGG-like audio classification model that was |
| 35 | +trained on a large YouTube dataset (a preliminary version of what later became |
| 36 | +[YouTube-8M](https://research.google.com/youtube8m)). |
| 37 | + |
| 38 | +We provide a TensorFlow definition of this model, which we call __*VGGish*__, as |
| 39 | +well as supporting code to extract input features for the model from audio |
| 40 | +waveforms and to post-process the model embedding output into the same format as |
| 41 | +the released embedding features. |
| 42 | + |
| 43 | +### Installation |
| 44 | + |
| 45 | +VGGish depends on the following Python packages: |
| 46 | + |
| 47 | +* [`numpy`](http://www.numpy.org/) |
| 48 | +* [`scipy`](http://www.scipy.org/) |
| 49 | +* [`resampy`](http://resampy.readthedocs.io/en/latest/) |
| 50 | +* [`tensorflow`](http://www.tensorflow.org/) |
| 51 | +* [`six`](https://pythonhosted.org/six/) |
| 52 | + |
| 53 | +These are all easily installable via, e.g., `pip install numpy` (as in the |
| 54 | +example command sequence below). |
| 55 | + |
| 56 | +Any reasonably recent version of these packages should work. TensorFlow should |
| 57 | +be at least version 1.0. We have tested with Python 2.7.6 and 3.4.3 on an |
| 58 | +Ubuntu-like system with NumPy v1.13.1, SciPy v0.19.1, resampy v0.1.5, TensorFlow |
| 59 | +v1.2.1, and Six v1.10.0. |
| 60 | + |
| 61 | +VGGish also requires downloading two data files: |
| 62 | + |
| 63 | +* [VGGish model checkpoint](https://storage.googleapis.com/audioset/vggish_model.ckpt), |
| 64 | + in TensorFlow checkpoint format. |
| 65 | +* [Embedding PCA parameters](https://storage.googleapis.com/audioset/vggish_pca_params.npz), |
| 66 | + in NumPy compressed archive format. |
| 67 | + |
| 68 | +After downloading these files into the same directory as this README, the |
| 69 | +installation can be tested by running `python vggish_smoke_test.py` which |
| 70 | +runs a known signal through the model and checks the output. |
| 71 | + |
| 72 | +Here's a sample installation and test session: |
| 73 | + |
| 74 | +```shell |
| 75 | +# You can optionally install and test VGGish within a Python virtualenv, which |
| 76 | +# is useful for isolating changes from the rest of your system. For example, you |
| 77 | +# may have an existing version of some packages that you do not want to upgrade, |
| 78 | +# or you want to try Python 3 instead of Python 2. If you decide to use a |
| 79 | +# virtualenv, you can create one by running |
| 80 | +# $ virtualenv vggish # For Python 2 |
| 81 | +# or |
| 82 | +# $ python3 -m venv vggish # For Python 3 |
| 83 | +# and then enter the virtual environment by running |
| 84 | +# $ source vggish/bin/activate # Assuming you use bash |
| 85 | +# Leave the virtual environment at the end of the session by running |
| 86 | +# $ deactivate |
| 87 | + |
| 88 | +# Upgrade pip first. |
| 89 | +$ python -m pip install --upgrade pip |
| 90 | + |
| 91 | +# Install dependences. Resampy needs to be installed after NumPy and SciPy |
| 92 | +# are already installed. |
| 93 | +$ pip install numpy scipy |
| 94 | +$ pip install resampy tensorflow six |
| 95 | + |
| 96 | +# Clone TensorFlow models repo into a 'models' directory. |
| 97 | +$ git clone https://github.com/tensorflow/models.git |
| 98 | +$ cd models/audioset |
| 99 | +# Download data files into same directory as code. |
| 100 | +$ curl -O https://storage.googleapis.com/audioset/vggish_model.ckpt |
| 101 | +$ curl -O https://storage.googleapis.com/audioset/vggish_pca_params.npz |
| 102 | + |
| 103 | +# Installation ready, let's test it. |
| 104 | +$ python vggish_smoke_test.py |
| 105 | +# If we see "Looks Good To Me", then we're all set. |
| 106 | +``` |
| 107 | + |
| 108 | +### Usage |
| 109 | + |
| 110 | +VGGish can be used in two ways: |
| 111 | + |
| 112 | +* *As a feature extractor*: VGGish converts audio input features into a |
| 113 | + semantically meaningful, high-level 128-D embedding which can be fed as input |
| 114 | + to a downstream classification model. The downstream model can be shallower |
| 115 | + than usual because the VGGish embedding is more semantically compact than raw |
| 116 | + audio features. |
| 117 | + |
| 118 | + So, for example, you could train a classifier for 10 of the AudioSet classes |
| 119 | + by using the released embeddings as features. Then, you could use that |
| 120 | + trained classifier with any arbitrary audio input by running the audio through |
| 121 | + the audio feature extractor and VGGish model provided here, passing the |
| 122 | + resulting embedding features as input to your trained model. |
| 123 | + `vggish_inference_demo.py` shows how to produce VGGish embeddings from |
| 124 | + arbitrary audio. |
| 125 | + |
| 126 | +* *As part of a larger model*: Here, we treat VGGish as a "warm start" for the |
| 127 | + lower layers of a model that takes audio features as input and adds more |
| 128 | + layers on top of the VGGish embedding. This can be used to fine-tune VGGish |
| 129 | + (or parts thereof) if you have large datasets that might be very different |
| 130 | + from the typical YouTube video clip. `vggish_train_demo.py` shows how to add |
| 131 | + layers on top of VGGish and train the whole model. |
| 132 | + |
| 133 | +### About the Model |
| 134 | + |
| 135 | +The VGGish code layout is as follows: |
| 136 | + |
| 137 | +* `vggish_slim.py`: Model definition in TensorFlow Slim notation. |
| 138 | +* `vggish_params.py`: Hyperparameters. |
| 139 | +* `vggish_input.py`: Converter from audio waveform into input examples. |
| 140 | +* `mel_features.py`: Audio feature extraction helpers. |
| 141 | +* `vggish_postprocess.py`: Embedding postprocessing. |
| 142 | +* `vggish_inference_demo.py`: Demo of VGGish in inference mode. |
| 143 | +* `vggish_train_demo.py`: Demo of VGGish in training mode. |
| 144 | +* `vggish_smoke_test.py`: Simple test of a VGGish installation |
| 145 | + |
| 146 | +#### Architecture |
| 147 | + |
| 148 | +See `vggish_slim.py` and `vggish_params.py`. |
| 149 | + |
| 150 | +VGGish is a variant of the [VGG](https://arxiv.org/abs/1409.1556) model, in |
| 151 | +particular Configuration A with 11 weight layers. Specifically, here are the |
| 152 | +changes we made: |
| 153 | + |
| 154 | +* The input size was changed to 96x64 for log mel spectrogram audio inputs. |
| 155 | + |
| 156 | +* We drop the last group of convolutional and maxpool layers, so we now have |
| 157 | + only four groups of convolution/maxpool layers instead of five. |
| 158 | + |
| 159 | +* Instead of a 1000-wide fully connected layer at the end, we use a 128-wide |
| 160 | + fully connected layer. This acts as a compact embedding layer. |
| 161 | + |
| 162 | +The model definition provided here defines layers up to and including the |
| 163 | +128-wide embedding layer. |
| 164 | + |
| 165 | +#### Input: Audio Features |
| 166 | + |
| 167 | +See `vggish_input.py` and `mel_features.py`. |
| 168 | + |
| 169 | +VGGish was trained with audio features computed as follows: |
| 170 | + |
| 171 | +* All audio is resampled to 16 kHz mono. |
| 172 | +* A spectrogram is computed using magnitudes of the Short-Time Fourier Transform |
| 173 | + with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann |
| 174 | + window. |
| 175 | +* A mel spectrogram is computed by mapping the spectrogram to 64 mel bins |
| 176 | + covering the range 125-7500 Hz. |
| 177 | +* A stabilized log mel spectrogram is computed by applying |
| 178 | + log(mel-spectrum + 0.01) where the offset is used to avoid taking a logarithm |
| 179 | + of zero. |
| 180 | +* These features are then framed into non-overlapping examples of 0.96 seconds, |
| 181 | + where each example covers 64 mel bands and 96 frames of 10 ms each. |
| 182 | + |
| 183 | +We provide our own NumPy implementation that produces features that are very |
| 184 | +similar to those produced by our internal production code. This results in |
| 185 | +embedding outputs that are closely match the embeddings that we have already |
| 186 | +released. Note that these embeddings will *not* be bit-for-bit identical to the |
| 187 | +released embeddings due to small differences between the feature computation |
| 188 | +code paths, and even between two different installations of VGGish with |
| 189 | +different underlying libraries and hardware. However, we expect that the |
| 190 | +embeddings will be equivalent in the context of a downstream classification |
| 191 | +task. |
| 192 | + |
| 193 | +#### Output: Embeddings |
| 194 | + |
| 195 | +See `vggish_postprocess.py`. |
| 196 | + |
| 197 | +The released AudioSet embeddings were postprocessed before release by applying a |
| 198 | +PCA transformation (which performs both PCA and whitening) as well as |
| 199 | +quantization to 8 bits per embedding element. This was done to be compatible |
| 200 | +with the [YouTube-8M](https://research.google.com/youtube8m) project which has |
| 201 | +released visual and audio embeddings for millions of YouTube videos in the same |
| 202 | +PCA/whitened/quantized format. |
| 203 | + |
| 204 | +We provide a Python implementation of the postprocessing which can be applied to |
| 205 | +batches of embeddings produced by VGGish. `vggish_inference_demo.py` shows how |
| 206 | +the postprocessor can be run after inference. |
| 207 | + |
| 208 | +If you don't need to use the released embeddings or YouTube-8M, then you could |
| 209 | +skip postprocessing and use raw embeddings. |
| 210 | + |
| 211 | +### Future Work |
| 212 | + |
| 213 | +Below are some of the things we would like to add to this repository. We |
| 214 | +welcome pull requests for these or other enhancements, but please consider |
| 215 | +sending an email to the mailing list (see the Contact section) describing what |
| 216 | +you plan to do before you invest a lot of time, to get feedback from us and the |
| 217 | +rest of the community. |
| 218 | + |
| 219 | +* An AudioSet classifier trained on top of the VGGish embeddings to predict all |
| 220 | + the AudioSet labels. This can act as a baseline for audio research using |
| 221 | + AudioSet. |
| 222 | +* Feature extraction implemented within TensorFlow using the upcoming |
| 223 | + [tf.contrib.signal](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/api_guides/python/contrib.signal.md) |
| 224 | + ops. |
| 225 | +* A Keras version of the VGGish model definition and checkpoint. |
| 226 | +* Jupyter notebook demonstrating audio feature extraction and model performance. |
| 227 | + |
| 228 | +## Contact |
| 229 | + |
| 230 | +For general questions about AudioSet and VGGish, please use the |
| 231 | +[[email protected]](https://groups.google.com/forum/#!forum/audioset-users) |
| 232 | +mailing list. |
| 233 | + |
| 234 | +For technical problems with the released model and code, please open an issue on |
| 235 | +the [tensorflow/models issue tracker](https://github.com/tensorflow/models/issues) |
| 236 | +and __*assign to @plakal and @dpwe*__. Please note that because the issue tracker |
| 237 | +is shared across all models released by Google, we won't be notified about an |
| 238 | +issue unless you explicitly @-mention us (@plakal and @dpwe) or assign the issue |
| 239 | +to us. |
| 240 | + |
| 241 | +## Credits |
| 242 | + |
| 243 | +Original authors and reviewers of the code in this package include (in |
| 244 | +alphabetical order): |
| 245 | + |
| 246 | +* DAn Ellis |
| 247 | +* Shawn Hershey |
| 248 | +* Aren Jansen |
| 249 | +* Manoj Plakal |
0 commit comments