This project implements three clustering algorithms - KMeans, KMode, KPrototype, and DBSCAN - using both local and global parallelized approach with PySpark. The algorithms are implemented separately and organized into different subdirectories within the repository.
-
KMeans: KMeans is a popular clustering algorithm that partitions data into 'k' clusters based on similarity of features.
-
KMode: KMode is a clustering algorithm specifically designed for categorical data, where clusters are formed based on the mode of categories. Both local and global implementations are provided in this directory, along with an experimental notebook demonstrating how the size-up and num partition affect runtime. There are also notebooks that includes a step-by-step explanation of implementation.
-
KPrototype: KPrototype extends KMeans to handle both numerical and categorical data, allowing for more versatile clustering in mixed data types.
-
DBSCAN: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together data points that are closely packed, while marking outliers as noise.
The provided code is an implementation of the K-means clustering algorithm using Apache Spark and NumPy. It consists of two main classes: KMeans and a helper class for data conversion.
- The KMeans class contains methods for initializing centroids, finding the closest centroid for a given data point, fitting the K-means model to the data, and transforming new data points based on the learned centroids.
- The helper class partition_to_numpy_array is used to convert Apache Spark's RDD partition data into NumPy arrays, which are required for the K-means algorithm implementation.
- Additionally, there are two helper functions: get_optimal_kmeans_centroid and get_optimal_kmeans_centroid_with_dummy_key. These functions are used to obtain the optimal centroids for each partition and to handle the key-value pair format required by Apache Spark's reduceByKey operation.
The notebook includes a section that iterates over different fractions of the input data and different numbers of partitions, performing the parallel K-means clustering and reporting the results. It's important to note that for small datasets, the overhead of distributed computing may outweigh the benefits of parallelization. However, this implementation can serve as a foundation for handling larger datasets more efficiently in the future. Note: Please install pyspark and ucimlrepo to run this code.
During the mapping phase, centroids
After computing merged count hashed data for each cluster, representing (cluster, merged count-hash) pairs, they are transmitted to the driver code. Subsequently, new centroids
The iterative process continues until convergence, indicated by the Hamming distance between new and old centroids reaching the stopping criterion of 0, signifying identical centroids.
In this approach, we adopt a parallelized strategy where the sequential model learns from each data partition concurrently. Thus, dividing the full dataset into N partitions and selecting K as the number of clusters results in the mapping phase generating NxK centroids as output, transmitted to the driver. These centroids, each representing the mode of clusters within its partition, collectively approximate the behavior of the sequential algorithm. Considering these centroids as a dataset, KMode clustering is performed in the driver code to derive the final centroids for K clusters.
The notebooks in this folder contain both the Kmode global and local approaches. We also keep versions showing failures and mistakes to demonstrate what we've learned from this project.
Please refer to kmode/experiments_notebook/K-Mode-global3.ipynb
and kmode/experiments_notebook/K-Mode-local2.ipynb
if you want to try out our implementation of the global and local approaches, respectively. The required packages to run our code are provided in requirements.txt.
-
kmode/experiments_notebook/K-Mode-local.ipynb
: The initial version of the local implementation where there was an issue withnp.apply_along_axis
, resulting in strings having only the first character, leading to incorrect centroids. This has been fixed in./experiments_notebook/K-Mode-local2.ipynb
. -
kmode/experiments_notebook/K-Mode-global.ipynb
: The first version of the global implementation, where we simply mapped centroids in the mapping phase and grouped all the data in the reducing phase, then sent them back to the driver to calculate the mode. This was inefficient as it involved a lot of data movement and didn't utilize the reducing process well. -
kmode/experiments_notebook/K-Mode-global2.ipynb
: The second version of the global implementation, where we counted the number of occurrences of elements within the reducer phase to reduce data movement from the reducer phase to the driver. However, the data movement from the mapping phase to the reducing phase remained the same. We then implemented mini-reducers in the final version./experiments_notebook/K-Mode-global3.ipynb
.
The final implementation has been rewritten in a class object style, but we suggest testing using notebooks for better visibility.
-
kmode/kmode_global.py
: This file contains the global implementation of KMode as a Python class object. -
kmode/kmode_local.py
: This file contains the local implementation of KMode as a Python class object. In each partition, it uses a sequential KMode object fromkmode_sequential.py
to learn on data in their partition.
The results from the global and sequential approaches are an exact match; however, the local approach provides only an approximation.
Feature | Global & Sequential | Local | ||||
---|---|---|---|---|---|---|
Cluster | 1 | 2 | 3 | 1 | 2 | 3 |
cap-shape | f | x | f | x | x | x |
cap-surface | nan | y | x | nan | nan | h |
cap-color | n | n | x | n | n | n |
does-bruise-or-bleed | f | f | f | f | f | f |
gill-attachment | d | a | d | d | a | d |
gill-spacing | c | nan | c | c | nan | c |
gill-color | w | w | w | w | w | w |
stem-root | nan | nan | nan | nan | nan | nan |
stem-surface | nan | nan | nan | nan | nan | nan |
stem-color | w | w | w | w | w | w |
veil-type | nan | nan | nan | nan | nan | nan |
veil-color | nan | nan | nan | nan | nan | nan |
has-ring | f | t | f | f | f | t |
ring-type | f | f | f | f | f | e |
spore-print-color | nan | nan | nan | nan | nan | nan |
habitat | d | d | d | d | d | d |
season | u | a | u | a | a | u |
Table: Comparison of the final centroids obtained from clustering secondary mushroom dataset with KMode algorithm with sequential, global and local approach
Feature | Global & Sequential Cluster 1 | Global & Sequential Cluster 2 | Global & Sequential Cluster 3 | Local Cluster 1 | Local Cluster 2 | Local Cluster 3 |
---|---|---|---|---|---|---|
bin_0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
bin_1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
bin_2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
bin_3 | F | T | F | F | F | T |
bin_4 | N | N | N | N | Y | N |
nom_0 | Red | Red | Blue | Red | Red | Blue |
nom_1 | Polygon | Trapezoid | Triangle | Triangle | Triangle | Triangle |
nom_2 | Hamster | Hamster | Dog | Axolotl | Hamster | Hamster |
nom_3 | India | Finland | Costa Rica | India | India | India |
nom_4 | Bassoon | Theremin | Theremin | Theremin | Theremin | Theremin |
nom_5 | nan | nan | nan | nan | nan | nan |
nom_6 | nan | nan | nan | nan | nan | nan |
nom_7 | nan | nan | nan | nan | nan | nan |
nom_8 | nan | nan | nan | nan | nan | nan |
nom_9 | nan | nan | nan | nan | nan | nan |
ord_0 | 2.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 |
ord_1 | Expert | Novice | Contributor | Novice | Novice | Novice |
ord_2 | Freezing | Freezing | Freezing | Freezing | Freezing | Freezing |
ord_3 | n | n | a | n | n | a |
ord_4 | Y | N | R | N | P | N |
ord_5 | nan | nan | nan | nan | nan | nan |
day | 1.0 | 6.0 | 7.0 | 3.0 | 5.0 | 3.0 |
month | 6.0 | 7.0 | 5.0 | 8.0 | 8.0 | 5.0 |
Table: Comparison of the final centroids obtained from clustering Kaggle categorical encoding competition II dataset using the KMode algorithm with sequential, global, and local approaches.
This is an implementation of the K-Prototype clustering algorithm in PySpark. The K-Prototype algorithm is an extension of the K-Means and K-Mode algorithm that can handle both numerical and categorical data.
- Load your data into a Pandas DataFrame and pass it to the
df
variable. - Define the labels for your data in the
labels
list. order: numerical labels first, followed by categorical labels. - Specify the index where the categorical labels start in the
categorical_labels_start_index
variable. - Run the code.
The relative importance given to categorical variables is controlled by the coefficient 0.2
next to the hamming_distance
function. Increasing this value will give more importance to categorical variables during clustering.
You can use other distance metrics for numerical and categorical data by modifying the euclidean_distance
and hamming_distance
functions, respectively. The new distance functions should follow the same format as the existing ones, returning a numpy array of distances between the input vector and the centroids.
- PySpark
- Pandas
- NumPy
This is an implementation of the DBSCAN algorithm using Apache Spark. It is tailored to handle geospatial data, which is partitioned and processed by DBSCAN clustering on each partition. The implementation also includes a strategy to merge these local clusters into global clusters efficiently.
The geospatial data used in this project can be found at the following link:
To see the implementation in action, you can run the Kaggle notebook where the data is pre-loaded.
DBSCAN Implementation on Kaggle
Alternatively, you can download the dataset from the provided link, set the correct path to the root directory of the data, and run the code in your local environment.
To run this project, you will need the following Python packages installed:
- PySpark
- Pandas
- NumPy
To get started with using these clustering algorithms:
- Navigate to the respective subdirectory for the algorithm you're interested in.
- Follow the instructions provided in the README file within each directory for installation, usage, and examples.
- Ensure you have PySpark installed and configured to run the algorithms in a distributed environment.