Model results

INFS4203 Data Mining project @ UQ

TODO:

Data preprocessing
Making another way of detecting anomalies for the class 'car'. Given that it doesn't follow a normal distribution.
Creating k-NN classifier with k-fold CV.
Creating k-means classifier with k-fold CV.
Creating decision tree classifier with k-fold CV.
Creating random forest classifier with k-fold CV.

Prerequisites

Poetry: This project uses Poetry with Python=3.11 (pyenv) for dependency management. Dependencies is listed in pyproject.toml
Data Directory: In order for the get_data_dir() function to work properly you will ned a .env file in the root of your repository with the following information:
```
 REPO_DATA_DIR="path to data directory where train and test.csv files are present"
```

Getting Started

Install Poetry (if you haven't already):

Detailed instructions for various operating systems can be found at Poetry's installation documentation. The following instructions are for macOS and Linux.
```
curl -sSL https://install.python-poetry.org | bash
```

Clone this repository:

git clone [email protected]:larsmoan/INFS4203.git
cd INFS4203

Install dependencies:
```
poetry install
```

Usage

Activate the virtual environment:
```
poetry shell
```
Run the main classification::
```
python classification.py
```
Reproduce the hyperparameter search::
```
python hyperparameter_tuning.py
```

Dataset

The data used in this project can be found in the data/ folder. It is anynomized, but said to come from CIFAR-10 but with a lot of preprocessing from the teaching staff, such as removing values and adding outliers.

Each sample consist of a 128-dimensional feature vector, each having a label from 0-9.

Label Number	Actual Label
0	airplane
1	car
2	bird
3	cat
4	deer
5	dog
6	frog
7	horse
8	ship
9	truck

Original dataset: train.csv

This dataset consists of 2180 entries

Secondary dataset: train2.csv

This dataset was released from the teaching staff midway in the project and conists of 1880 entries of the same size as the original dataset. However from inspecting both of them it seems that the second dataset has a lot more inherent complexity present. This can be seen when reducing the dimensionality of the dataset from 128 -> 2 using tSNE (not part of the curriculum).

tSNE on the original dataset

tSNE on the secondary dataset

tSNE on the joint dataset of the original + secondary

Another finding when comparing the datasets is how closely each class specific feature follows a gaussian distribution.

Shapiro-Wilkes test on the original dataset

Shapiro-Wilkes test on the secondary dataset

Shapiro-Wilkes test on the joint dataset

Model results

From the initial training I tested out k-NN, k-means, decision tree and random forest. Decision tree performed so poorly that I haven't included it in the results.

The results below was obtained using the combination of train and train2 as training set whilst using 20% of this as validation data that was not seen when the models were fitted.

Classification reports:

Classification report for: RandomForest classifier	precision	recall	f1-score	support
0.0	0.92	0.98	0.95	56
1.0	0.99	0.99	0.99	86
2.0	1.00	0.93	0.97	61
3.0	0.97	0.96	0.97	81
4.0	0.95	0.96	0.95	55
5.0	0.88	1.00	0.94	22
6.0	0.97	1.00	0.99	69
7.0	1.00	0.98	0.99	65
8.0	0.98	0.95	0.97	65
9.0	0.98	0.94	0.96	49
accuracy			0.97	609
macro avg	0.96	0.97	0.97	609
weighted avg	0.97	0.97	0.97	609

Classification report for: KMeans classifier	precision	recall	f1-score	support
0.0	0.96	0.96	0.96	56
1.0	0.96	0.78	0.86	86
2.0	0.84	0.95	0.89	61
3.0	0.93	0.96	0.95	81
4.0	0.79	0.96	0.87	55
5.0	0.69	1.00	0.81	22
6.0	0.94	0.70	0.80	69
7.0	1.00	0.98	0.99	65
8.0	0.95	0.97	0.96	65
9.0	0.92	0.94	0.93	49
accuracy			0.91	609
macro avg	0.90	0.92	0.90	609
weighted avg	0.92	0.91	0.91	609

Classification report for: KNN classifier	precision	recall	f1-score	support
0.0	0.97	1.00	0.98	56
1.0	0.99	0.99	0.99	86
2.0	1.00	0.93	0.97	61
3.0	0.98	0.99	0.98	81
4.0	0.95	0.96	0.95	55
5.0	0.96	1.00	0.98	22
6.0	0.99	1.00	0.99	69
7.0	1.00	0.98	0.99	65
8.0	0.98	0.97	0.98	65
9.0	0.96	0.96	0.96	49
accuracy			0.98	609
macro avg	0.98	0.98	0.98	609
weighted avg	0.98	0.98	0.98	609

During the project process I did numerous different train / val splits and RandomForest seemed to generally always perform better on completely unseen data. Especially when it was fitted on the original dataset (train.csv) and validated on the second dataset (train2.csv). Therefore RandomForest was chosen as the classifier used on the testset and generated the results for the final report.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
infs4203		infs4203
validator		validator
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
s4827064.csv		s4827064.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INFS4203 Data Mining project @ UQ

Prerequisites

Getting Started

Usage

Dataset

Original dataset: train.csv

Secondary dataset: train2.csv

Model results

The results below was obtained using the combination of train and train2 as training set whilst using 20% of this as validation data that was not seen when the models were fitted.

Classification reports:

About

Releases

Packages

Languages

larsmoan/INFS4203

Folders and files

Latest commit

History

Repository files navigation

INFS4203 Data Mining project @ UQ

Prerequisites

Getting Started

Usage

Dataset

Original dataset: train.csv

Secondary dataset: train2.csv

Model results

The results below was obtained using the combination of train and train2 as training set whilst using 20% of this as validation data that was not seen when the models were fitted.

Classification reports:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages