nrep = 10 in GA

szcf-weiya · Jan 28, 2025 · 7cc2186 · 7cc2186 · szcf-weiya · Jan 28, 2025
1 parent 756b19d
commit 7cc2186
Show file tree

Hide file tree

Showing 6 changed files with 37 additions and 67 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 res/
 Manifest.toml
 src/main.jl
-docs/build
+docs/build
+docs/src/benchmarks/
diff --git a/README.md b/README.md
@@ -0,0 +1,17 @@
+# KmeansBenchmarks.jl
+
+[![CI](https://github.com/szcf-weiya/KmeansBenchmarks.jl/actions/workflows/ci.yml/badge.svg)](https://github.com/szcf-weiya/KmeansBenchmarks.jl/actions/workflows/ci.yml)
+
+This project seeks to systematically benchmark and compare k-means implementations across the following aspects:
+
+- **Software ecosystem**: R (e.g., `stats`, `ClusterR`) vs Julia (e.g., `Clustering`)
+- **Algorithm variants**: Variants like Lloyd’s, Hartigan-Wong
+- **Initialization**: Random seeding, k-means++
+
+We evaluate the performance from three main metrics:
+
+- Clustering accuracy
+- Ratio of the Between-sum-of-squares / Total-sum-of-squares
+- Computational time
+
+This work aims to provide actionable insights for researchers and practitioners in selecting optimal k-means configurations tailored to their data size, dimensionality, and domain requirements. 
diff --git a/benchmarks/benchmarks.jl b/benchmarks/benchmarks.jl
@@ -6,7 +6,7 @@ using DataFrames
 plotly()
 
 # repeat two times
-nrep = 2
+nrep = 10
 res = [benchmark(arr_data, arr_methods) for _ in 1:nrep]
 
 df = DataFrame(hcat(repeat(repeat(collect(keys(arr_data)), inner=length(arr_methods)), outer = nrep),

diff --git a/docs/src/benchmarks.md b/docs/src/benchmarks.md
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -2,6 +2,16 @@
 
 Clustering is a cornerstone of unsupervised machine learning, with the k-means algorithm standing as one of the most widely used methods for partitioning data into coherent groups. Its simplicity, interpretability, and adaptability have made it a staple in fields ranging from customer segmentation to bioinformatics. However, the performance and results of k-means can vary significantly depending on the implementation choices made by practitioners, including the software ecosystem (e.g., R, Julia, Python), the algorithmic variants employed (e.g., Lloyd’s algorithm, Hartigan-Wong, or scalable approximations like Mini-Batch k-means), and the initialization strategies (e.g., random seeding, k-means++, or density-based initialization). These choices impact not only computational efficiency but also the quality and stability of the resulting clusters.
 
-This project seeks to systematically benchmark and compare k-means implementations across different frameworks—focusing on R and Julia as representative languages for statistical computing and high-performance numerical analysis, respectively—while also evaluating the interplay between initialization methods and algorithmic variants. R, with its rich ecosystem of packages (e.g., `stats`, `ClusterR`), offers user-friendly tools optimized for statistical rigor, whereas Julia (particularly the package `Clustering`), leveraging its just-in-time (JIT) compilation and parallel computing capabilities, promises faster execution for large datasets. Beyond software comparisons, the study will assess how initialization techniques (e.g., naive random centroids vs. sophisticated seeding) influence convergence rates, cluster quality metrics (e.g., silhouette score, within-cluster sum of squares), and sensitivity to local optima.
+This project seeks to systematically benchmark and compare k-means implementations across the following aspects:
 
-By quantifying trade-offs between computational speed, scalability, and cluster accuracy, this work aims to provide actionable insights for researchers and practitioners in selecting optimal k-means configurations tailored to their data size, dimensionality, and domain requirements. The findings will contribute to a deeper understanding of how algorithmic choices and software ecosystems shape the practical utility of this foundational clustering method.
+- **Software ecosystem**: R (e.g., `stats`, `ClusterR`) vs Julia (e.g., `Clustering`)
+- **Algorithm variants**: Variants like Lloyd’s, Hartigan-Wong
+- **Initialization**: Random seeding, k-means++
+
+We evaluate the performance from three main metrics:
+
+- Clustering accuracy
+- Ratio of the Between-sum-of-squares / Total-sum-of-squares
+- Computational time
+
+This work aims to provide actionable insights for researchers and practitioners in selecting optimal k-means configurations tailored to their data size, dimensionality, and domain requirements. 
diff --git a/src/evaluate.jl b/src/evaluate.jl
@@ -7,6 +7,11 @@ function evaluate(x::AbstractMatrix, cl::AbstractVector, f::Function, paras = Di
     return acc, WSS, dt
 end
 
+"""
+    benchmark(arr_data::NamedTuple, arr_methods::NamedTuple)
+
+Run the benchmark experiments for all methods in `arr_methods` on each dataset in `arr_data`.
+"""
 function benchmark(arr_data::NamedTuple, arr_methods::NamedTuple)
     ndata = length(arr_data)
     nmethod = length(arr_methods)