Skip to content

Latest commit

 

History

History
78 lines (47 loc) · 3.01 KB

README.md

File metadata and controls

78 lines (47 loc) · 3.01 KB

ML classification - Audio files

Introduction

The goal of this analysis is to try to use ML to automatically classify audios with telephone conversations leaked.

In particular we wanted to discard answering machines and non answered calls so that we could reduce the amount of files that a human needs to audit searching for interesting information.

In order to extract image related features we are using OpenCV 3.0 in order to install it from source linked with python3 follow these instructions

In order to extract audio related features we are usign librosa

Feature engineering

Since our goal was to discard audios that went to answering machines or where not responded we started to think about the characteristics of those audios.

Percent of silence seemed to be one of the first things that came to mind, also if we detected the ringtones we could use the length of the audio from the last ring as a feature for our classification.

Audio features

  • Ring detection
  • Percentage of silence
  • Length of the chunk between the last ring and the end of the file
  • Number of rings

Image features

  • white proportion
    • We have computed the percentage of white available in each image waveform, the greater the value the stronger the possibility of the audio being a non interesting file... or it least that is our hypothesis

Installation instructions

  1. Create a python3 virtualenv

     $ virtualenv .venv -p /usr/local/bin/python3 --no-wheel
    
  2. Install dependencies

     $ pip install -r requirements.txt
    
  3. Follow these instructions to install opencv 3.0.0 and link it to the virtual environment. Test it

     $ python
     >>> import cv2
     >>> cv2.__version__
     '3.0.0'
    

Repo structure

Preprocess (Python scripts)

Scripts used to download the audios and do the feature extraction

More info && usage here

Analysis (Jupyter notebooks)

Intermediate Analysis to help us understand our dataset characteristics, the performance of our selected features and sampling our complete dataset for sharing purposes

More info && usage here

Classification (Jupyter notebooks)

Final classification process, also a validation notebook to manually check the overall performance of the Machine Learning process.

More info && usage here

Audio files

Due to legal and ethical issues the audio files have not been made publicly available since many of them are just private phone conversations. If somebody wants to grab a sample to do some other ML data analysis approach please contact us in order to check if we can provide you a sample index file to the audios themselves.

Authors