This repository contains the baseline system for the DCASE 2022 challenge task 6A on audio captioning.
The main model is composed of a transformer encoder-decoder, that autoregressively models captions conditionally to VGGish embeddings.
For more information, please refer to the corresponding DCASE subtask page.
The first step in running the baseline system is to clone this repository on your computer:
$ git clone [email protected]:felixgontier/dcase-2022-baseline.git
This operation will create a dcase-2022-baseline
directory at the current location, with the contents of this repository. The dcase-2022-baseline
will be referred to as the root directory in the rest of this readme.
Next, a recent version of PyTorch is required to run the baseline.
Note: The baseline system is developed with Python 3.7, PyTorch 1.7.1 and CUDA 10.1. Please refer to the PyTorch setup guide for PyTorch/CUDA compatibility information.
Other required packages can be installed using Pip by running the following command in the root directory:
$ python3.7 -m venv env/ # Optionally create a virtual environment
$ pip install -r requirements_pip.txt
Lastly, the caption-evaluation-tools is needed for evaluation.
- Download and extract the repository in the baseline root directory.
- Download the Stanford models by running:
$ cd coco_caption
$ ./get_stanford_models.sh
Note that the caption evaluation tools require that Java is installed and enabled.
The Clotho v2.1 dataset can be found on Zenodo:
The test set (without captions) is available separately:
After downloading all .7z
archives and .csv
caption files from both repositories, audio files should be extracted in the data
directory.
Specifically, the directory structure should be as follows from the baseline root directory:
data/
| - clotho_v2/
| | - development/
| | | - *.wav
| | - validation/
| | | - *.wav
| | - evaluation/
| | | - *.wav
| | - test/
| | | - *.wav
| | - clotho_captions_development.csv
| | - clotho_captions_validation.csv
| | - clotho_captions_evaluation.csv
Pre-processing operations are implemented in clotho_preprocessing.py
. The pre-processing utilities are also available as a standalone.
Dataset preparation is done by running the following command:
$ python clotho_preprocessing.py --cfg dcb_data
The script outputs a <file_name>_<caption_id>.npy
file for each ground truth caption of each audio file in the dataset. Each output file contains a Numpy record array with the following fields:
file_name
: Name of the source audio file.vggish_embeddings
: VGGish embeddings extracted for 1s audio frames with 1s interval.caption
: The corresponding caption, with all punctuation removed and all lowercase.
Output directories follow the same structure as the inputs:
data/
| - clotho_v2_vggish/
| | - development/
| | | - *.npy
| | - validation/
| | | - *.npy
| | - evaluation/
| | | - *.npy
| | - test/
| | | - *.npy
Data pre-processing relies on settings in the data_settings/dcb_data.yaml
file.
data:
root_path: 'data'
input_path: 'clotho_v2'
output_path: 'clotho_v2_vggish'
splits:
- development
- validation
- evaluation
- test
The settings are the following:
root_path
(str): Path to the root data directory.input_path
(str): Sub-path ofroot_path
with unprocessed data.output_path
(str): Sub-path ofroot_path
where pre-processed data should be saved. If it does not exist, the directory will be created.splits
(list(str)): Data splits, each corresponding to a sub-directory ofinput_path
andoutput_path
.
Experiments settings are defined in a YAML file located in the exp_settings
directory. The dcb.yaml
file contains parameters used to produce the reported baseline results.
Specific settings are detailed below.
To run an experiment according to a <exp_name>.yaml
settings file, use the following command:
$ python main.py --exp <exp_name>
After training, model weights are saved to a outputs/<exp_name>_out/
directory.
- Download pre-trained weights from
- In
exp_settings/dcb.yaml
, change thelm/eval_model
setting to/path/to/dcase_baseline_pre_trained.bin
, with the correct path to the downloaded file. - Set the
workflow/train
andworkflow/validate
tofalse
, andworkflow/evaluate
and/orworkflow/infer
totrue
. - Run the evaluation and/or inference.
$ python main.py --exp dcb
Experiment settings described in the exp_settings/dcb.yaml
file are:
adapt:
audio_emb_size: 128
nb_layers: 1
data:
root_dir: data
features_dir: clotho_v2_vggish
input_field_name: vggish_embeddings
output_field_name: caption
max_audio_len: 32
max_caption_tok_len: 64
lm:
config: # Model parameters
activation_dropout: 0.1
activation_function: 'gelu'
attention_dropout: 0.1
classifier_dropout: 0.0
d_model: 768
decoder_attention_heads: 12
decoder_ffn_dim: 3072
decoder_layers: 6
dropout: 0.1
encoder_attention_heads: 12
encoder_ffn_dim: 3072
encoder_layers: 6
vocab_size: 50265
generation: # Generation parameters
early_stopping: true
no_repeat_ngram_size: 3
num_beams: 4
min_length: 5
max_length: 100
length_penalty: 1.0
decoding: beam
eval_model: best
eval_checkpoint: null
freeze:
all: false
attn: false
dec: false
dec_attn: false
dec_mlp: false
dec_self_attn: false
enc: false
enc_attn: false
enc_mlp: false
mlp: false
tokenizer: facebook/bart-base
pretrained: null
training:
eval_steps: 1000
force_cpu: false
batch_size: 4
gradient_accumulation_steps: 2
num_workers: 8
lr: 1.0e-05
nb_epochs: 20
save_steps: 1000
seed: 0
workflow:
train: true
validate: true
evaluate: true
infer: false
The adaptation
block defines a small adaptation network before the transformer encoder. Its aim is to adjust the dimension of audio features to that of the transformer (lm/config/d_model
setting).
audio_emb_size
(int): Dimension of audio features, i.e. the input dimension of the adaptation network. In the case of VGGish embeddings, this setting is set to 128.nb_layers
(int): Number of layers of the network. If set to 0, the dimension of audio features must be equal to that of the transformer. If greater than 1, the network will containnb_layers
dense layers with output dimensionlm/config/d_model
and ReLU activations. The last layer of the adaptation network has no activation function.
The data
block contains settings related to the dataset.
root_dir
(str): Path to the data root directory.features_dir
(str): Subdirectory ofroot_dir
where the current dataset is located.input_field_name
(str): Name of the input field in Numpy rec-arrays of data examples.output_field_name
(str): Name of the output field in Numpy rec-arrays of data examples.max_audio_len
andmax_caption_tok_len
(int): The data loader pads each example audio and tokenized caption to a set duration for batching. Provided values are adapted to the VGGish representation and BART tokenization of the baseline.
The lm
block contains settings related to both the encoder and decoder of the main transformer model, which is derived from BART.
The config
sub-block details the model, as per the HuggingFace BART configuration. Provided settings replicate the bart-base model configuration.
Note: The vocab_size
parameter depends on the pre-trained tokenizer defined by lm/tokenizer
.
The generation
sub-block provides generation-specific settings (see the HuggingFace Generation documentation):
decoding
(str):beam
orgreedy
decoding are supported.
The freeze
sub-block enables freezing different components of the transformer (attention, MLP, self-attention or cross-attention).
Other parameters are:
eval_model
(str): Model selection at evaluation/inference.best
selects the best model according to validation loss at training,checkpoint
uses a specific checkpoint set byeval_checkpoint
. This setting can also be set to/path/to/model.bin
for custom trained model weights, e.g. the provided pre-trained weights.eval_checkpoint
(int): Model checkpoint to use at evaluation/inference. This is ignored unlesseval_model
is set tocheckpoint
.tokenizer
(str): Name of the HuggingFace pre-trained tokenizer.pretrained
(str, null): If not null, name of a HuggingFace pre-trained model (e.g. facebook/bart-base). Note that this will bypass allconfig
sub-block settings.
The training
block describes parameters of the training process.
eval_steps
(int): Frequence of model validation, in training steps.save_steps
(int): Frequence of model weights saving, in training steps. Iflm/eval_model
is set tobest
, this should be a factor ofeval_steps
.force_cpu
(bool): Force all computations on CPU, even when CUDA is available.batch_size
(int): Batch size during model training and validation.gradient_accumulation_steps
(int): Accumulates gradients over several steps, effectively increasing the batch size without additional memory cost. Gradient accumulation is disabled if this is set to 1.num_workers
(int): Number of CPU workers for data loading.lr
(float): Learning rate during training.nb_epochs
(int): Number of training epochs.seed
(int, null): Sets a specific torch random seed before experiments. Note that this does not ensure reproducibility when training on GPU.
The workflow
block sets operations to be conducted in the experiment.
train
will perform optimization with data in the</path/to/data>/development
directory, where</path/to/data>
is the appendeddata/root_dir
anddata/features_dir
settings.validate
must be set totrue
during training iflm/eval_model
is set tobest
. Validation is done on data in the</path/to/data>/validation
directory.evaluate
refers to evaluation with metrics, and outputsmetrics_coco_<decoding_method>.json
andgenerated_captions_<decoding_method>.txt
files in theoutput/<exp_name>_out
directory, where<decoding_method>
is thelm/generation/decoding
setting. Evaluation is done on data in the</path/to/data>/evaluation
directory.infer
refers to caption generation without computing metrics. Inference outputs a submission-ready filetest_output_captions_<decoding_method>.csv
. Inference is performed on data in the</path/to/data>/test
directory.