Skip to content

Commit

Permalink
nrep = 10 in GA
Browse files Browse the repository at this point in the history
  • Loading branch information
szcf-weiya committed Jan 28, 2025
1 parent 756b19d commit 7cc2186
Show file tree
Hide file tree
Showing 6 changed files with 37 additions and 67 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
res/
Manifest.toml
src/main.jl
docs/build
docs/build
docs/src/benchmarks/
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# KmeansBenchmarks.jl

[![CI](https://github.com/szcf-weiya/KmeansBenchmarks.jl/actions/workflows/ci.yml/badge.svg)](https://github.com/szcf-weiya/KmeansBenchmarks.jl/actions/workflows/ci.yml)

This project seeks to systematically benchmark and compare k-means implementations across the following aspects:

- **Software ecosystem**: R (e.g., `stats`, `ClusterR`) vs Julia (e.g., `Clustering`)
- **Algorithm variants**: Variants like Lloyd’s, Hartigan-Wong
- **Initialization**: Random seeding, k-means++

We evaluate the performance from three main metrics:

- Clustering accuracy
- Ratio of the Between-sum-of-squares / Total-sum-of-squares
- Computational time

This work aims to provide actionable insights for researchers and practitioners in selecting optimal k-means configurations tailored to their data size, dimensionality, and domain requirements.
2 changes: 1 addition & 1 deletion benchmarks/benchmarks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ using DataFrames
plotly()

# repeat two times
nrep = 2
nrep = 10
res = [benchmark(arr_data, arr_methods) for _ in 1:nrep]

df = DataFrame(hcat(repeat(repeat(collect(keys(arr_data)), inner=length(arr_methods)), outer = nrep),
Expand Down
63 changes: 0 additions & 63 deletions docs/src/benchmarks.md

This file was deleted.

14 changes: 12 additions & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@

Clustering is a cornerstone of unsupervised machine learning, with the k-means algorithm standing as one of the most widely used methods for partitioning data into coherent groups. Its simplicity, interpretability, and adaptability have made it a staple in fields ranging from customer segmentation to bioinformatics. However, the performance and results of k-means can vary significantly depending on the implementation choices made by practitioners, including the software ecosystem (e.g., R, Julia, Python), the algorithmic variants employed (e.g., Lloyd’s algorithm, Hartigan-Wong, or scalable approximations like Mini-Batch k-means), and the initialization strategies (e.g., random seeding, k-means++, or density-based initialization). These choices impact not only computational efficiency but also the quality and stability of the resulting clusters.

This project seeks to systematically benchmark and compare k-means implementations across different frameworks—focusing on R and Julia as representative languages for statistical computing and high-performance numerical analysis, respectively—while also evaluating the interplay between initialization methods and algorithmic variants. R, with its rich ecosystem of packages (e.g., `stats`, `ClusterR`), offers user-friendly tools optimized for statistical rigor, whereas Julia (particularly the package `Clustering`), leveraging its just-in-time (JIT) compilation and parallel computing capabilities, promises faster execution for large datasets. Beyond software comparisons, the study will assess how initialization techniques (e.g., naive random centroids vs. sophisticated seeding) influence convergence rates, cluster quality metrics (e.g., silhouette score, within-cluster sum of squares), and sensitivity to local optima.
This project seeks to systematically benchmark and compare k-means implementations across the following aspects:

By quantifying trade-offs between computational speed, scalability, and cluster accuracy, this work aims to provide actionable insights for researchers and practitioners in selecting optimal k-means configurations tailored to their data size, dimensionality, and domain requirements. The findings will contribute to a deeper understanding of how algorithmic choices and software ecosystems shape the practical utility of this foundational clustering method.
- **Software ecosystem**: R (e.g., `stats`, `ClusterR`) vs Julia (e.g., `Clustering`)
- **Algorithm variants**: Variants like Lloyd’s, Hartigan-Wong
- **Initialization**: Random seeding, k-means++

We evaluate the performance from three main metrics:

- Clustering accuracy
- Ratio of the Between-sum-of-squares / Total-sum-of-squares
- Computational time

This work aims to provide actionable insights for researchers and practitioners in selecting optimal k-means configurations tailored to their data size, dimensionality, and domain requirements.
5 changes: 5 additions & 0 deletions src/evaluate.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ function evaluate(x::AbstractMatrix, cl::AbstractVector, f::Function, paras = Di
return acc, WSS, dt
end

"""
benchmark(arr_data::NamedTuple, arr_methods::NamedTuple)
Run the benchmark experiments for all methods in `arr_methods` on each dataset in `arr_data`.
"""
function benchmark(arr_data::NamedTuple, arr_methods::NamedTuple)
ndata = length(arr_data)
nmethod = length(arr_methods)
Expand Down

2 comments on commit 7cc2186

@szcf-weiya
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/123879

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.1.0 -m "<description of version>" 7cc21868b489b67664e39c64e9241d24cb19ab9c
git push origin v0.1.0

Please sign in to comment.