Skip to content

Commit d1173bc

Browse files
author
Alexander Gorban
committed
Merge branch 'master' of https://github.com/tensorflow/models into UpdateReadme
2 parents e039cfb + 36203f0 commit d1173bc

File tree

187 files changed

+18691
-1199
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

187 files changed

+18691
-1199
lines changed

CODEOWNERS

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
adversarial_crypto/* @dave-andersen
2+
adversarial_text/* @rsepassi
3+
attention_ocr/* @alexgorban
4+
audioset/* @plakal @dpwe
5+
autoencoders/* @snurkabill
6+
cognitive_mapping_and_planning/* @s-gupta
7+
compression/* @nmjohn
8+
differential_privacy/* @panyx0718
9+
domain_adaptation/* @bousmalis @ddohan
10+
im2txt/* @cshallue
11+
inception/* @shlens @vincentvanhoucke
12+
learning_to_remember_rare_events/* @lukaszkaiser @ofirnachum
13+
lfads/* @jazcollins @susillo
14+
lm_1b/* @oriolvinyals @panyx0718
15+
namignizer/* @knathanieltucker
16+
neural_gpu/* @lukaszkaiser
17+
neural_programmer/* @arvind2505
18+
next_frame_prediction/* @panyx0718
19+
object_detection/* @jch1 @tombstone @derekjchow @jesu9 @dreamdragon
20+
pcl_rl/* @ofirnachum
21+
ptn/* @xcyan @arkanath @hellojas @honglaklee
22+
real_nvp/* @laurent-dinh
23+
rebar/* @gjtucker
24+
resnet/* @panyx0718
25+
skip_thoughts/* @cshallue
26+
slim/* @sguada @nathansilberman
27+
street/* @theraysmith
28+
swivel/* @waterson
29+
syntaxnet/* @calberti @andorardo
30+
textsum/* @panyx0718 @peterjliu
31+
transformer/* @daviddao
32+
tutorials/embedding/* @zffchen78 @a-dai
33+
tutorials/image/* @sherrym @shlens
34+
tutorials/rnn/* @lukaszkaiser @ebrevdo
35+
video_prediction/* @cbfinn
36+

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Contributing guidelines
22

33
If you have created a model and would like to publish it here, please send us a
4-
pull request. For those just getting started with pull reuests, GitHub has a
4+
pull request. For those just getting started with pull requests, GitHub has a
55
[howto](https://help.github.com/articles/using-pull-requests/).
66

77
The code for any model in this repository is licensed under the Apache License

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ running TensorFlow 0.12 or earlier, please
1414
- [adversarial_crypto](adversarial_crypto): protecting communications with adversarial neural cryptography.
1515
- [adversarial_text](adversarial_text): semi-supervised sequence learning with adversarial training.
1616
- [attention_ocr](attention_ocr): a model for real-world image text extraction.
17+
- [audioset](audioset): Models and supporting code for use with [AudioSet](http://g.co.audioset).
1718
- [autoencoder](autoencoder): various autoencoders.
1819
- [cognitive_mapping_and_planning](cognitive_mapping_and_planning): implementation of a spatial memory based mapping and planning architecture for visual navigation.
1920
- [compression](compression): compressing and decompressing images using a pre-trained Residual GRU network.
@@ -30,6 +31,7 @@ running TensorFlow 0.12 or earlier, please
3031
- [next_frame_prediction](next_frame_prediction): probabilistic future frame synthesis via cross convolutional networks.
3132
- [object_detection](object_detection): localizing and identifying multiple objects in a single image.
3233
- [real_nvp](real_nvp): density estimation using real-valued non-volume preserving (real NVP) transformations.
34+
- [rebar](rebar): low-variance, unbiased gradient estimates for discrete latent variable models.
3335
- [resnet](resnet): deep and wide residual networks.
3436
- [skip_thoughts](skip_thoughts): recurrent neural network sentence-to-vector encoder.
3537
- [slim](slim): image classification models in TF-Slim.

attention_ocr/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Pull requests:
2828
virtualenv --system-site-packages ~/.tensorflow
2929
source ~/.tensorflow/bin/activate
3030
pip install --upgrade pip
31-
pip install --upgrade tensorflow_gpu
31+
pip install --upgrade tensorflow-gpu
3232
```
3333

3434
2. At least 158GB of free disk space to download the FSNS dataset:
@@ -65,7 +65,7 @@ To train a model using pre-trained Inception weights as initialization:
6565
```
6666
wget http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz
6767
tar xf inception_v3_2016_08_28.tar.gz
68-
python train.py --checkpoint_inception=inception_v3.ckpt
68+
python train.py --checkpoint_inception=./inception_v3.ckpt
6969
```
7070

7171
To fine tune the Attention OCR model using a checkpoint:

audioset/README.md

+249
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# Models for AudioSet: A Large Scale Dataset of Audio Events
2+
3+
This repository provides models and supporting code associated with
4+
[AudioSet](http://g.co/audioset), a dataset of over 2 million human-labeled
5+
10-second YouTube video soundtracks, with labels taken from an ontology of
6+
more than 600 audio event classes.
7+
8+
AudioSet was
9+
[released](https://research.googleblog.com/2017/03/announcing-audioset-dataset-for-audio.html)
10+
in March 2017 by Google's Sound Understanding team to provide a common
11+
large-scale evaluation task for audio event detection as well as a starting
12+
point for a comprehensive vocabulary of sound events.
13+
14+
For more details about AudioSet and the various models we have trained, please
15+
visit the [AudioSet website](http://g.co/audioset) and read our papers:
16+
17+
* Gemmeke, J. et. al.,
18+
[AudioSet: An ontology and human-labelled dataset for audio events](https://research.google.com/pubs/pub45857.html),
19+
ICASSP 2017
20+
21+
* Hershey, S. et. al.,
22+
[CNN Architectures for Large-Scale Audio Classification](https://research.google.com/pubs/pub45611.html),
23+
ICASSP 2017
24+
25+
If you use the pre-trained VGGish model in your published research, we ask that
26+
you cite [CNN Architectures for Large-Scale Audio Classification](https://research.google.com/pubs/pub45611.html).
27+
If you use the AudioSet dataset or the released 128-D embeddings of AudioSet
28+
segments, please cite
29+
[AudioSet: An ontology and human-labelled dataset for audio events](https://research.google.com/pubs/pub45857.html).
30+
31+
## VGGish
32+
33+
The initial AudioSet release included 128-dimensional embeddings of each
34+
AudioSet segment produced from a VGG-like audio classification model that was
35+
trained on a large YouTube dataset (a preliminary version of what later became
36+
[YouTube-8M](https://research.google.com/youtube8m)).
37+
38+
We provide a TensorFlow definition of this model, which we call __*VGGish*__, as
39+
well as supporting code to extract input features for the model from audio
40+
waveforms and to post-process the model embedding output into the same format as
41+
the released embedding features.
42+
43+
### Installation
44+
45+
VGGish depends on the following Python packages:
46+
47+
* [`numpy`](http://www.numpy.org/)
48+
* [`scipy`](http://www.scipy.org/)
49+
* [`resampy`](http://resampy.readthedocs.io/en/latest/)
50+
* [`tensorflow`](http://www.tensorflow.org/)
51+
* [`six`](https://pythonhosted.org/six/)
52+
53+
These are all easily installable via, e.g., `pip install numpy` (as in the
54+
example command sequence below).
55+
56+
Any reasonably recent version of these packages should work. TensorFlow should
57+
be at least version 1.0. We have tested with Python 2.7.6 and 3.4.3 on an
58+
Ubuntu-like system with NumPy v1.13.1, SciPy v0.19.1, resampy v0.1.5, TensorFlow
59+
v1.2.1, and Six v1.10.0.
60+
61+
VGGish also requires downloading two data files:
62+
63+
* [VGGish model checkpoint](https://storage.googleapis.com/audioset/vggish_model.ckpt),
64+
in TensorFlow checkpoint format.
65+
* [Embedding PCA parameters](https://storage.googleapis.com/audioset/vggish_pca_params.npz),
66+
in NumPy compressed archive format.
67+
68+
After downloading these files into the same directory as this README, the
69+
installation can be tested by running `python vggish_smoke_test.py` which
70+
runs a known signal through the model and checks the output.
71+
72+
Here's a sample installation and test session:
73+
74+
```shell
75+
# You can optionally install and test VGGish within a Python virtualenv, which
76+
# is useful for isolating changes from the rest of your system. For example, you
77+
# may have an existing version of some packages that you do not want to upgrade,
78+
# or you want to try Python 3 instead of Python 2. If you decide to use a
79+
# virtualenv, you can create one by running
80+
# $ virtualenv vggish # For Python 2
81+
# or
82+
# $ python3 -m venv vggish # For Python 3
83+
# and then enter the virtual environment by running
84+
# $ source vggish/bin/activate # Assuming you use bash
85+
# Leave the virtual environment at the end of the session by running
86+
# $ deactivate
87+
88+
# Upgrade pip first.
89+
$ python -m pip install --upgrade pip
90+
91+
# Install dependences. Resampy needs to be installed after NumPy and SciPy
92+
# are already installed.
93+
$ pip install numpy scipy
94+
$ pip install resampy tensorflow six
95+
96+
# Clone TensorFlow models repo into a 'models' directory.
97+
$ git clone https://github.com/tensorflow/models.git
98+
$ cd models/audioset
99+
# Download data files into same directory as code.
100+
$ curl -O https://storage.googleapis.com/audioset/vggish_model.ckpt
101+
$ curl -O https://storage.googleapis.com/audioset/vggish_pca_params.npz
102+
103+
# Installation ready, let's test it.
104+
$ python vggish_smoke_test.py
105+
# If we see "Looks Good To Me", then we're all set.
106+
```
107+
108+
### Usage
109+
110+
VGGish can be used in two ways:
111+
112+
* *As a feature extractor*: VGGish converts audio input features into a
113+
semantically meaningful, high-level 128-D embedding which can be fed as input
114+
to a downstream classification model. The downstream model can be shallower
115+
than usual because the VGGish embedding is more semantically compact than raw
116+
audio features.
117+
118+
So, for example, you could train a classifier for 10 of the AudioSet classes
119+
by using the released embeddings as features. Then, you could use that
120+
trained classifier with any arbitrary audio input by running the audio through
121+
the audio feature extractor and VGGish model provided here, passing the
122+
resulting embedding features as input to your trained model.
123+
`vggish_inference_demo.py` shows how to produce VGGish embeddings from
124+
arbitrary audio.
125+
126+
* *As part of a larger model*: Here, we treat VGGish as a "warm start" for the
127+
lower layers of a model that takes audio features as input and adds more
128+
layers on top of the VGGish embedding. This can be used to fine-tune VGGish
129+
(or parts thereof) if you have large datasets that might be very different
130+
from the typical YouTube video clip. `vggish_train_demo.py` shows how to add
131+
layers on top of VGGish and train the whole model.
132+
133+
### About the Model
134+
135+
The VGGish code layout is as follows:
136+
137+
* `vggish_slim.py`: Model definition in TensorFlow Slim notation.
138+
* `vggish_params.py`: Hyperparameters.
139+
* `vggish_input.py`: Converter from audio waveform into input examples.
140+
* `mel_features.py`: Audio feature extraction helpers.
141+
* `vggish_postprocess.py`: Embedding postprocessing.
142+
* `vggish_inference_demo.py`: Demo of VGGish in inference mode.
143+
* `vggish_train_demo.py`: Demo of VGGish in training mode.
144+
* `vggish_smoke_test.py`: Simple test of a VGGish installation
145+
146+
#### Architecture
147+
148+
See `vggish_slim.py` and `vggish_params.py`.
149+
150+
VGGish is a variant of the [VGG](https://arxiv.org/abs/1409.1556) model, in
151+
particular Configuration A with 11 weight layers. Specifically, here are the
152+
changes we made:
153+
154+
* The input size was changed to 96x64 for log mel spectrogram audio inputs.
155+
156+
* We drop the last group of convolutional and maxpool layers, so we now have
157+
only four groups of convolution/maxpool layers instead of five.
158+
159+
* Instead of a 1000-wide fully connected layer at the end, we use a 128-wide
160+
fully connected layer. This acts as a compact embedding layer.
161+
162+
The model definition provided here defines layers up to and including the
163+
128-wide embedding layer.
164+
165+
#### Input: Audio Features
166+
167+
See `vggish_input.py` and `mel_features.py`.
168+
169+
VGGish was trained with audio features computed as follows:
170+
171+
* All audio is resampled to 16 kHz mono.
172+
* A spectrogram is computed using magnitudes of the Short-Time Fourier Transform
173+
with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann
174+
window.
175+
* A mel spectrogram is computed by mapping the spectrogram to 64 mel bins
176+
covering the range 125-7500 Hz.
177+
* A stabilized log mel spectrogram is computed by applying
178+
log(mel-spectrum + 0.01) where the offset is used to avoid taking a logarithm
179+
of zero.
180+
* These features are then framed into non-overlapping examples of 0.96 seconds,
181+
where each example covers 64 mel bands and 96 frames of 10 ms each.
182+
183+
We provide our own NumPy implementation that produces features that are very
184+
similar to those produced by our internal production code. This results in
185+
embedding outputs that are closely match the embeddings that we have already
186+
released. Note that these embeddings will *not* be bit-for-bit identical to the
187+
released embeddings due to small differences between the feature computation
188+
code paths, and even between two different installations of VGGish with
189+
different underlying libraries and hardware. However, we expect that the
190+
embeddings will be equivalent in the context of a downstream classification
191+
task.
192+
193+
#### Output: Embeddings
194+
195+
See `vggish_postprocess.py`.
196+
197+
The released AudioSet embeddings were postprocessed before release by applying a
198+
PCA transformation (which performs both PCA and whitening) as well as
199+
quantization to 8 bits per embedding element. This was done to be compatible
200+
with the [YouTube-8M](https://research.google.com/youtube8m) project which has
201+
released visual and audio embeddings for millions of YouTube videos in the same
202+
PCA/whitened/quantized format.
203+
204+
We provide a Python implementation of the postprocessing which can be applied to
205+
batches of embeddings produced by VGGish. `vggish_inference_demo.py` shows how
206+
the postprocessor can be run after inference.
207+
208+
If you don't need to use the released embeddings or YouTube-8M, then you could
209+
skip postprocessing and use raw embeddings.
210+
211+
### Future Work
212+
213+
Below are some of the things we would like to add to this repository. We
214+
welcome pull requests for these or other enhancements, but please consider
215+
sending an email to the mailing list (see the Contact section) describing what
216+
you plan to do before you invest a lot of time, to get feedback from us and the
217+
rest of the community.
218+
219+
* An AudioSet classifier trained on top of the VGGish embeddings to predict all
220+
the AudioSet labels. This can act as a baseline for audio research using
221+
AudioSet.
222+
* Feature extraction implemented within TensorFlow using the upcoming
223+
[tf.contrib.signal](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/api_guides/python/contrib.signal.md)
224+
ops.
225+
* A Keras version of the VGGish model definition and checkpoint.
226+
* Jupyter notebook demonstrating audio feature extraction and model performance.
227+
228+
## Contact
229+
230+
For general questions about AudioSet and VGGish, please use the
231+
[[email protected]](https://groups.google.com/forum/#!forum/audioset-users)
232+
mailing list.
233+
234+
For technical problems with the released model and code, please open an issue on
235+
the [tensorflow/models issue tracker](https://github.com/tensorflow/models/issues)
236+
and __*assign to @plakal and @dpwe*__. Please note that because the issue tracker
237+
is shared across all models released by Google, we won't be notified about an
238+
issue unless you explicitly @-mention us (@plakal and @dpwe) or assign the issue
239+
to us.
240+
241+
## Credits
242+
243+
Original authors and reviewers of the code in this package include (in
244+
alphabetical order):
245+
246+
* DAn Ellis
247+
* Shawn Hershey
248+
* Aren Jansen
249+
* Manoj Plakal

0 commit comments

Comments
 (0)