Audio Signal Feature Extraction And Clustering

A configurable feature extractor for .wav audio files storing data in buckets on disc for easier loading of subsets of classes and features during clustering.

Installation

Use python 3.6. Dependencies are listed in requirements.txt

pip install -r requirements.txt

Extracting features

Features are extracted and stored in buckets on disc according to the subfolders in the audio folder. Each subfolder is regarded as a class folder which is used to label the data for evaluation. Any folder within a class folder is merged into the same class.

python DataHandler.py -d <audio folder> -f all

Keeping features in separate buckets allows for faster testing and evaluation of newly implemented features as well as different permutations of feature sets. The same is true for class selection during clustering.

Currently, these features are available:

mfcc
decay
rolloff
brightness

After an initail feature extraction it is possible to use the -f flag for extraction of specific features. This enables faster testing of newly implemented features.

Clustering of data

Configure which classes and what features to use during clustering by modifying configuration/congif.py. By default, all avaliable features are used.

python Cluster.py -c config

Evaluation metrics

The following metrics are used to evaluate the clusters:

homogeneity score (homo) - clusters contain only data points which are members of a single class. 1.0 perfect score.
completeness score (compl) - all the data points that are members of a given class are elements of the same cluster. 1-0 perfect score.
v-measure score (v-meas) - harmonic mean between homogeneity and completeness. 1.0 perfect score.

Clusters are then relabeled according to initial classes and the accuarcy and a confusion matrix is computed.

Results

A data set consisting of 1175 drum samples were to be identified based using clustering. All drum samples had a frame rate of 44100 and a 24-bit depth. Bellow is a summary of the distribution of the data per class:

class	clap	cymbal	fx	hi-hat	kick	perc	rim	snare	tom	(total)
number of samples	15	63	11	141	277	94	252	174	148	(1175)
percentage	1.28%	5.36%	0.94%	12.00%	23.57%	8.00%	21.45%	14.81%	12.60%	(100%)

Results on a subset of 5 classes

All available features was used in this Experiment. The final result on a subset of the classes consisting of kick, snare, hi-hat, cymbal, and tom:

Algo		time	inertia	  homo	  compl	  v-meas
k-means  	1.23s	371.646	  0.814	  0.803	  0.808

Model accuracy: 0.904

The unnormalized confusion matrix shows that dispite the skeewed dataset, the clustering works rather well. This can mostly be attributed to the mfcc feature that produces clusters with 0.746 in v-measure score. To further increase the accuracy, some feature that can distinctly separate hi-hats from cymbals must be found as some of the hi-hats are currently clustered together with the cymbals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Audio Signal Feature Extraction And Clustering

Installation

Extracting features

Clustering of data

Evaluation metrics

Results

Results on a subset of 5 classes

Files

README.md

Latest commit

History

README.md

File metadata and controls

Audio Signal Feature Extraction And Clustering

Installation

Extracting features

Clustering of data

Evaluation metrics

Results

Results on a subset of 5 classes