Human activity and climate change have been placing ever increasing pressure on biodiversity. A common problem of interest in conservation biology and ecology research is to detect the presence of a wildlife species in a region. Acoustic monitoring, which is a technique that uses certain electronic devices to capture animal sounds, is seeing growing use in this context. It facilitates collection of wildlife data in a non-invasive manner, continuously and over large areas, while avoiding the heavy cost associated with employing humanpower for manual surveys. However, the large scale data thus generated requires intensive analysis using advanced machine learning or deep learning algorithms.
I have done this work on data collected by Juliana Velez (julianav), a PhD student in the Fieberg Lab (PI: John Fieberg (jfieberg), Department of Fisheries, Wildlife, and Conservation Biology) at the University of Minnesota - Twin Cities. I have used deep neural network (DNN) models to classify audio files as having cattle sounds present or absent. Tapir is one of the wildlife species (classified as "Endangered" by IUCN in 1996) that Juliana has collected data for. Owing to insufficient data for tapir, I've also performed data augmentation - generating new data using existing base dataset - with and without using DNNs. Though I do not train a model for detecting tapir in this repository, this will be done subsequently, and generated data will be useful for that.
All coding in this repository has been performed using Python. Data augmentation has been performed using SciPy and PyDub libraries. I have also written a variational autoencoder, a state of the art deep learning algorithm, for pre-processing/augmenting tapir data using PyTorch framework. For supervised classification task, convolutional neural network (CNN) has been used. To this end, I made use of OpenSoundscape framework, which comes as an open source library and uses PyTorch under the hood.
Acoustic data has been captured using AudioMoth devices. Each measures
There are millions of audio files, pertaining to wildlife and various disturbances in the form of domestic animals like cattle, dogs, and gunshot sounds. AudioMoth devices were configured to create
In this repository, I have summarised the data in CSV files. The sub-directories cattle_pres and cattle_abs under Cattle directory contain the metadata for labeled audio files. Tapir metadata is under its corresponding directory.
matplotlib library: Module matplotlib.pyplot for creating, displaying, and saving figures/colorplots.
NumPy package: For transforming data in acoustic signal to numpy array which can be used for further processing.
OpenSoundscape framework: Corresponding library used for coding CNN models for species classification using bio-acoustic data.
pathib module: Classes Path and PurePath for working w/ files and directories.
pandas package: For working w/ dataframes.
PyDub library: Used in data augmentation. pydub.AudioSegment for extracting samples and metadata from a sound file, splitting an audio file into parts, and merging various files together.
python-csv library: For manipulating csv files.
SciPy library: Module scipy.signal for creating spectrogram from an audio file.
torch library : Classes and sub-modules of torch.nn used for writing variational autoencoder; part of PyTorch framework.
torchvision library: Classes of datasets and transforms modules for transforming the dataset to a form accessible for pytorch.
The broad goal of the project is to understand the interaction between wildlife and domestic animals like cattle, dogs while also exploring the effect of poaching. The machine learning problem in this project pertains to identifying animal species in acoustic data collected as explained earlier. Since deep learning algorithms require a significant amount of training data, there is also a scope of generating synthetic data points for the species which there are insufficient audio clips for.
Ideally, one would expect the model to identify all the species in the given data. However, for simplicity, the classification model in this repository is restricted to presence/absence of one species, which is a binary classification problem. For this purpose, I have focused on cattle. Accordingly, the model needs to be trained with a dataset containing records for the presence and absence of cattle. Here, a
A common approach to deep learning with sound data (e.g., in speech recognition) is to generate a spectrogram, which is a diagram showing how frequency varies with time (i.e., a frequency vs time plot) in an acoustic signal. It represents different frequencies in different colours, signifying the amplitude (or loudness) of each frequency in the signal. For the problems in this repository, though the goal is not to identify a sequence of letters from a sound (which is so in speech recognition task), spectrograms offer a very useful starting point.
To obtain a spectrogram, an acoustic signal is first divided into a number of segments and each is Fourier transformed. (Fourier transformation is a procedure to decompose a waveform into linear combination of its constituent frequencies:
It turns out that humans perceive a very small range of frequencies, because of which a frequency vs time plot is not very informative. Further, our perception of the difference between two sounds is not in terms of the difference between their frequencies, but in terms of the logarithm of their ratio. To exemplify, the distinction between two sounds w/ frequencies
Mel spectrogram is most commonly generated using librosa (through librosa.feature.melspectrogram() method) or torchaudio (through torchaudio.transforms.MelSpectrogram class) libraries. For the purpose of synthetic audio signal generation, spectrograms are only used for specifying the temporal location of the tapir call. Accordingly, I've created spectrograms depicting logarithmic frequency against time. This I've achieved using spectrogram() method of scipy.signal module. This method takes the numpy array, created from the AudioSegment object for the corresponding audio file, as an argument. An example can be seen in the figure below, where the frequency is in
As stated before, tapir data will be relevant for future work involving ML modelling for identifying tapir presence/absence in the audio. Since there were four audio files from the study site containing five tapir calls in all, this was certainly insufficient to train the model. Further, the five tapir sounds I had were not all distinct. From literature, I learnt that there are
Cali Zoo has one tapir enclosure and the data is collected using three AudioMoth devices. Though these are placed at different locations, their range is long enough to capture tapir sounds from their enclosure. Each of them records and saves audio clips at various times of the day. The clips recorded by two or more AudioMoths at a particular time of the day have the same tapir sound, but with different background on account of the location of each AudioMoth device. While this brings in much more tapir data, for the purpose of representativeness problem discussed in the last paragraph, the number of relevant new clips is just the 'union' of the clips from the three AudioMoths (each clip used exactly once). This increased the number of tapir calls in the base dataset (which is used to generate synthetic data) to
Juliana was recommended by OpenSounscape developers to use
First, I describe the procedure of generating synthetic audio clips. I have done this using what I would like to call the 'controlled' method. (The choice of this terminology will be clear soon.) To understand the mechanism, consider the spectrogram shown next.
Spectrogram of an audio clip showing $log(freq)$ vs $time$. There is a tapir call between $7s$ and $8s$
The spike in frequency as seen fairly localised between
-
In the first approach, I generate records having complete silence in the background of the tapir sound. To understand the rationale for this, consider a hypothetical classification model trained with audio clips each having at least one tapir sound. During the training phase, the model will need to isolate the tapir call out of the
$5$ seconds long clip. Since the training data for tapir presence is sparse, it seems sensible to have complete silence in the clip other than at the instance of the tapir call. In a way, this makes it convenient for the model to isolate the tapir call out of the background. -
In the second approach, I use the guiding principle that the subsequent classification model for detecting tapir absence/presence should be trained with real data, like would be encountered during the testing phase. Different test files would obviously have various types of background noises in various combinations and orders. Accordingly, I generate new
$5 ~ sec$ clips containing one tapir sound with forest noises in the background. One way to make this project more relevant to the larger community is to use background noises from different landscapes, in the background of tapir calls. However, given the small amount of base data, I have decided to stick with the backgrounds relevant to this project, which, as mentioned before, could be other animals/birds calls, leaves rustling, twigs snapping as creatures move around, et cetera. This still leaves the following question: how to choose the background forest noises of duration$(5 - length$ $of$ $tapir$ $call)$ seconds from the$10 ~ sec$ long clip? There are indeed various ways, the best and the simplest one, in my opinion, is the following. Use the background from either the first$5 ~ sec$ or the last$5 ~ sec$ chunk of the$10 ~ sec$ raw clip, depending on which one the tapir call happens to be in. To exemplify, if the original$10$ seconds audio clip has two tapir calls - one between$3s$ ,$4s$ and another between$6s$ ,$7s$ , then I use the first tapir sound and the first$5$ seconds of the audio to generate$5$ clips, with the tapir sound starting at locations$0s$ ,$1s$ ,$2s$ ,$3s$ ,$4s$ . And using the tapir sound between$6s$ ,$7s$ in the raw clip, I use the section of the audio from$6$ to$10 ~ sec$ to create another$5$ clips.
Both these methods require the chunk of audio w/ tapir sound to be separated out of the
As explained, I have generated synthetic audio clips in two ways. For silence in background, I generated chunks of silence with silent() class method of AudioSegment. For each audio file, I created an AudioSegment instance using from_file('filename') class method, and extracted a chunk of this clip from
At this point, it is worth noting that in this method of data augmentation, I've worked at the level of audio clips. It is equally possible to enhance the dataset using spectrograms corresponding to the audio clips in the base dataset. Since the spectrogram is a pictorial frequency vs time representation, this involves working with images. One advantage of working with spectrograms in a controlled manner is that both frequency band and time duration of tapir sound are accessible. Such an analysis can be found in the literature, and exploits both frequency-masking and time-masking. In this repository, I choose to not perform controlled augmentation of spectrogram data since the analysis goes on similar lines as audio augmentation. However, in the next section, I explore the viability of a state of the art deep learning algorithm for enhancing tapir present spectrogram data. This could be useful for those in the larger community who want to do modelling starting from spectrogram images.
Variational autoencoder is a neural network based generative (in that it attempts to identify the structure of the data so as to simulate the data generation process), unsupervised (in that it doesn't require class labels for training) algorithm. It was proposed by Kingma and Welling in 2013.
Consider a neural network that applies a set of non-linear transformations to the input data (to reduce its dimension) and maps it to a probability distribution, from which a latent vector is sampled. This network is an encoder. Another neural network, a decoder, then maps the latent vector back to the original input space using non-linear transformations. Essentially, the encoder compresses the data while the decoder decompresses it.
Representation of a variational autoencoder. Image sourced from internet
Let's call the latent random variable
Both encoder and decoder part of the algorithm contain neural network layers. Since I'm working with images, I have employed an architecture containing convolution and pooling layers for changing spatial resolution of the data. For spatially connected data, these layers are superior to a network of fully connected layers for a number of reasons, which I discuss next.
For an image with
In the context of vision architectures, the weights in question are typically arranged in a filter or kernel, which is essentially a
The fact that the weights in the filter are learnt during training is what makes convnets so powerful. Essentially, what features are to be extracted are learnt by the algorithm. The feature maps generated by the convolution layer are further passed through a pooling layer. Its purpose is to reduce the size (or dimensionality) of the feature maps. It summarises the features in the input, through averaging, or selecting maximum of each section. This has a very important consequence - invariance. While in the original image, the location or orientation of the subject might have been relevant, the summarised feature map is invariant to translation/deformation.
Raw audio clips have just
The architecture I use for an encoder is a set of
The loss function I've used is the sum of binary cross entropy and KL divergence. I used Adam optimiser, which offers a modified form of stochastic gradient descent. I experimented with learning rates of
There are two ways to use variational autoencoder in the validation phase. One is to use the probability distribution of the input image dataset learnt during training phase to reconstruct the images. The other is to pass Gaussian distribution to the decoder and generate synthetic images. The distribution being passed is for the latent random variable, and is exactly Gaussian, as against the learnt distribution which would be approximately Gaussian. Accordingly, the latter approach generates images somewhat different from the original ones.
The loss curve (which is a plot of average loss vs epochs) I obtained is quite as expected. The error decreases fast, which is expected from such powerful algorithms, even more so for small datasets. The reconstructed images appear to fall short of the mark, particularly for sharp, sudden sounds (which tapir calls are like) nearly represented by a delta function. I believe there are two possible reasons - very small training dataset, and images in the training set not being independent (since they are generated from 19 tapir sounds using controlled method). Synthetic images are obviously more distant from the original ones given the latent space random variable is sampled from exactly normal distribution, which would be different from the distribution followed by the from the input data.
As mentioned before, some of the raw audio files have been labeled as cattle presence/absence (summarised under Data directory). The idea is to use this labelled audio data for training the model. There are about five times more labeled cattle present clips than cattle absent. This would seemingly lead to an imbalance in the dataset. However, from an ecological perspective, such imbalance for abundant species is useful. This can be understood from the diagram next.
Precision and recall in statistics. Image by Walber - Own work, CC BY-SA 4.0, from Wikipedia
For rare species, there are expected to be only a few events, implying that recall should be high. Cattle on the other hand are ubiquitous, and many sounds (relevant events) are expected. Accordingly, precision should be maximised at the cost of recall for such commonly occuring sounds. Indeed, the training data for cattle that I have has high precision. (A reasonable metric to measure performance of the model is
Audio waveform is a time-series. Accordingly, it is natural to consider 1D ConvNets (employing time-only convolutions) for acoustic signals. However, in the last few years, a consensus has emerged that these models are mostly inferior to (unless made very deep, in that case they could be as good as)
OpSo offers a way to write species classification models using existing convnet architectures. It abstracts most of the algorithmic and coding details away from the user. The framework takes audio clips and generates spectrogram images from them, which are further used for training the model. It uses transfer learning - model architectures are initialised with weights pre-trained on ImageNet database by default, which is a dataset of millions of labeled images. It is also possible to load weights from a path on local machine or from a URL.
The network is conceptually divided between a feature extractor part and a classifier part. Feature extractor consists of a number of convolution, pooling layers (quite similar to the architecture I wrote for variational autoencoder) closer to the input. These layers have filters which help extract the features in an image while reducing its dimensions. Typically, features are extracted hierarchically, with beginning layers extracting low level features (like edges) while later ones extract high level features (like complicated shapes).
That the low-level features are common to most datasets forms the basis for transfer learning. It is an approach to supervised classification, which is quite relevant in cases where there isn't copious amounts of data to train the model from scratch. Further, even if there is sufficient data available, transfer learning is recommended since it significantly reduces resources required for training without diminishing the performance of the model. Depending on how similar the dataset at hand is to ImageNet, some layers trained on it can be reused, while others need to be trained on the dataset for the problem at hand.
The classifier part consists of fully-connected layers farthest from the input image. It uses the high level feature maps generated by the layers just before and predicts the class label for the image. In OpSo, it is possible to initialise the weights of one or both of these parts, train the two parts with different learning rates, and even freeze the feature extractor altogether (in which case the gradients are not computed). For my purpose, there are just two class labels (cattle presence/absence), while ImageNet has more than a thousand. To get around this, only the feature extractor layers should be pre-trained on ImageNet.
Though OpSo is a user-friendly deep learning framework, I believe that it is lacking in some areas. The foremost being in the context of hyperparameter optimisation. While there exists a default learning rate scheduling (which can be further modified), there is no tuning for batch size, which I believe is also an important parameter. For number of epochs, early stopping with validation loss as stopping criterion is used. However, in cases when there is insufficient training data, it may not be possible to separate a holdout set for validation. Then, number of epochs may also need to be tuned.
I have used a ResNet18 network pre-trained on ImageNet with the feature extractor frozen. This is coded in opensoundscape.torch.architectures.resnet module. I take this approach because the images appear to have only low level features, and ImageNet weights should present a fairly good estimate for the feature extractor. This makes the training faster.
Important default parameters in opso (hence used in my code as well) are as follows. Loss function for binary classification: cross entropy loss; optimiser parameters: SGD algorithm,
I did one-hot encoding for the cattle data, and split the data in training and holdout sets. I used ResNet18 architecture to instantiate the CNN class (which contains train() method). Batch size of