Benchmark for plankton images classifications methods for images from multiple plankton imaging devices (ISIIS, Zooscan, Flowcam, etc.)
This tool allows you to run a comparison between a Convolutional Neural Network and a Random Forest classifier on a dataset of plankton images.
**Superseeded by https://github.com/emmaamblard/plankton_classif/tree/main
. **
The comparison is to be done on data from multiple plankton imaging devices:
- ISIIS (In Situ Ichthyoplankton Imaging System)
- zooscan
- flowcam
- IFCB (Imaging FlowCytobot)
- UVP (Underwater Vision Profiler)
Store your input data in data/<instrument_name>
. Your data must contain an images
folder with your images, as well as a csv file named <instrument_name>_data.csv
with one row per object.
This csv file should contain the following columns:
path_to_img
: path to imageclassif_id
: object classificationliving
: whether the classification inclassif_id
is living or not (boolean)features_1
tofeatures_n
: object features for random forest fit (choices for names of these columns are up to you)
It is strongly recommended that each class contain at least 100 images.
Data will be split into training, validation and testing sets.
A convolutional neural network takes an image as input and predicts a class for this image.
The CNN backbone is a MobileNetV2 feature extractor (https://tfhub.dev/google/imagenet/mobilenet_v2_140_224/feature_vector/4) with depth multiplier of 1.4. A classification head with the number of classes to predict is added on top of the backbone. Intermediate fully connected layers with customizable dropout rate can be inserted between both.
Input images are expected to have color values in the range [0,1] and a size of 224 x 224 pixels. If need be, images are automatically resized by the CNN DataGenerator.
A random forest takes a vector of features as input and predicts a class from these values.
Settings can be customized in the settings.yaml
file.
Reproductible results can be obtained using the random_state
argument (random_state = 12
for paper results)
Training is done in two phases:
- model is optimized by training on the training set and evaluating on the validation set
- optimized model is trained on the training set and evaluated on the test set never used before
For each step (i.e. epoch) in the training of the CNN model, the model is trained on training data and evaluated on validation data. It is recommended to train for a large number of epochs and later decide where to stop based on the evolution of accuracy and loss for validation data. This process, called early stopping, is implemented in this tool: for each epoch, weights are saved if and only if the results of this epoch are better than previous one. Last saved weights are then used to test the model on the test data.
Random Forest parameters are optimized with a gridsearch including:
- number of trees
- number of features to use to compute each split (default for classification is
sqrt(n_features)
) - minimum number of samples required to be at a leaf node (default for classification is
5
)
For each set of parameters, model is trained on training data and evaluated on validation data. Finally, the best model is trained on training data and tested on test data.
When you run train_cnn.py
or train_rf.py
, an output directory is created and results are stored in this directory.
Results for each model and dataset can be explored with the notebooks inspect_cnn_results.ipynb
and inspect_rf_results.ipynb
.
Comparison of results across models and datasets is implemented in the notebook comparison.ipynb
.