Cluster Validation with VIC

VIC [1] is a Cluster Validation technique that uses a set of classifiers to evaluate a given partition of a data set.

This implementation uses a custom database of Scientometrics data for the QS 2019 Top 200 Universities world wide. For more information on the data set, please check QS_Dataset_Report.pdf.

VIC is a method that uses an ensemble of supervised classifiers and k-fold cross validation to evaluate a partition, the algorithm works likewise:

set v = 0
For each classifier.
- Set v' = 0
- Divide data in k folds
- For each fold.
  - Train with four remaining folds and test AUC with current fold.
  - Update v' = v'+AUC(this_fold)
- Update v = max(v,v'/k)
return v

The Classifiers

The current implementation uses 6 classifiers:

Random Forest
Support Vector Machine
Naive Bayes
Linear Discriminant Analysis
Gradient Boosting
Logistic Regression

Adding New Classifiers

This repository follows Scikit-Learn's workflow and thus the easiest way to add a model is by using sklearn implementations. You need to add an identifier to the list in line 36 of vic.py and then add the corresponding line/lines to select and define the classifier in the function train_and_test() defined in models.py. If it's not a model from sklearn you must create a Class that includes the methods fit(x,y) and predict(x) that train the model and make inference over some input correspondingly. Two examples using this method for custom classifiers are included; a Logistic Regression model in numpy and a Multi Layer Perceptron in TensorFlow (if you want to try MLP un-comment the corresponding class an condition in models.py).

Usage

To test the current implementation all you need is:

python vic.py

The program will then run VIC on 50 partitions of our QS Dataset with cluster separations between rank 75 and 125. Two files are generated. The txt report shows the value of v for each partition and the classifier with the highest accuracy and includes the best result at the end. The json file stores significant information used in Results Analysis.ipynb.

There are two options available, if you want to try some other data use the flag --clusters_path to specify a directory with one or more data sets in 'csv' format. You can also use the --outfile flag to specify a new name for the generated report, default is ./vic_report.txt.

Results

The following figre shows the VIC score for each partition used during our experiments. The best score was achieved when the clusters are split at rank 81.

For a complete analysis of the results in our experiments please check Results Analysis.ipynb and Cluster_Validation_Using_VIC.pdf.

Bibliography

[1] J. Rodríguez, M. A. Medina-Pérez, A. E. Gutierrez-Rodríguez, R. Monroy, H. Terashima-Marín. Cluster validation using an ensemble of supervised classifiers. Knowledge-Based Systems, Volume 145 (2018). Pages 134-144.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
QS_Partitions_CSV		QS_Partitions_CSV
images		images
.gitignore		.gitignore
Cluster_Validation_Using_VIC.pdf		Cluster_Validation_Using_VIC.pdf
QS_Dataset_Report.pdf		QS_Dataset_Report.pdf
README.md		README.md
Results Analysis.ipynb		Results Analysis.ipynb
attributes.txt		attributes.txt
models.py		models.py
report.json		report.json
vic.py		vic.py
vic_report.txt		vic_report.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cluster Validation with VIC

The Classifiers

Adding New Classifiers

Usage

Results

Bibliography

About

Releases

Packages

Languages

miguelmedinaperez/Cluster-Validation-With-VIC

Folders and files

Latest commit

History

Repository files navigation

Cluster Validation with VIC

The Classifiers

Adding New Classifiers

Usage

Results

Bibliography

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages