Hexaly vs Gurobi on the K-Means Clustering Problem (MSSC)
The K-Means Clustering Problem is defined as follows: given a set of points along some dimensions, find a partition of these points into k clusters. The total sum of squared Euclidean distances between each cluster’s centroid and its elements is minimized. This problem is also known as the Minimum Sum-of-Squares Clustering (MSSC).
On the K-Means Clustering Problem (MSSC), Hexaly reaches a gap to the best solutions in the literature below 1% in 1 minute of running time for almost all instances considered in the benchmark. We illustrate on this page how Hexaly outperforms traditional general-purpose optimization solvers, like Gurobi, on this challenging problem.
Input data
The instances of this benchmark come from the UCI Machine Learning Repository, gathering a wide variety of datasets used in machine learning with an arbitrary number of dimensions, and the TSPLIB, which is the reference instance repository for the Traveling Salesman Problem (TSP) and contains only 2-dimensional points. From these two repositories, we extracted the instances used in research articles that describe state-of-the-art dedicated algorithms for the K-Means Clustering Problem [1][2][3][4][5].
Mathematical models for the K-Means Clustering Problem (MSSC)
Mixed-Integer Quadratically Constrained Programming (MIQCP) model
The results reported for Gurobi are obtained with the Mixed-Integer Quadratically Constrained Programming (MIQCP) formulation presented here. Continuous variables represent each cluster center, one for each dimension, allowing the computation of Euclidean distances to each point thanks to quadratic constraints. In addition, binary variables are used to model assignments of points to clusters. Note that we had to scale all coordinates in the range [0,1] for Gurobi due to numerical issues on instances with large coordinates.
Hexaly model
The Hexaly model relies on set variables representing assignments of points to clusters. The centroid of each cluster is then computed using a lambda expression. The distance between the points assigned to a cluster and its centroid is computed following the same approach. Compared to the MIQCP model, the Hexaly model is much more compact, as the only decisions to be taken correspond to assigning points to clusters. Set variables combined with lambda expressions enable simple and intuitive problem modeling without limiting the possible extensions.
function model() {
// Set variables: clusters[c] represents the points in cluster c
clusters[1..k] <- set(nbPoints);
// Each point must belong to one and only one cluster
constraint partition[c in 1..k](clusters[c]);
for [c in 1..k] {
local cluster <- clusters[c];
local size <- count(cluster);
// Compute the centroid of the cluster
centroid[d in 0..nbDimensions-1] <- size == 0 ? 0 : sum(cluster, o => coordinates[o][d]) / size;
// Compute the variance of the cluster
squares[d in 0..nbDimensions-1] <- sum(cluster, o => pow(coordinates[o][d] - centroid[d], 2));
variances[c] <- sum[d in 0..nbDimensions-1](squares[d]);
}
// Minimize the total variance
obj <- sum[c in 1..k](variances[c]);
minimize obj;
}
Hexaly and Gurobi results on the K-Means Clustering Problem (MSSC)
We compare both solvers’ performance after 1 minute of running time. At the end of the running time, we measure the gap in % to the best solutions known in the literature, also called State Of The Art (SOTA). Each instance is solved with different values of K, the number of clusters; in the column “clusters”, we report the smallest and largest values of K considered. For each instance, the gap reported in the table corresponds to the arithmetic mean of the gaps in % between the best known solution and the solution found by the solver for each value of K. If no feasible solution is found within the time limit or the gap found is greater than 100%, then the gap is considered to be 100%.
We use Hexaly 13.0 and Gurobi 11.0, a state-of-the-art Mixed-Integer Quadratically Constrained Programming (MIQCP) solver. Both are used with default parameters. We ran them on a server equipped with an AMD Ryzen 7 7700 processor (8 cores, 3.8GHz, 8MB cache) and 32GB RAM.
Instances | Points | Dimensions | Clusters | Hexaly | Gurobi |
---|---|---|---|---|---|
german | 59 | 2 | 2-10 | 0.0 | 1.9 |
ruspini | 75 | 2 | 2-10 | 0.0 | 6.3 |
spath | 89 | 3 | 2-10 | 0.0 | 5.0 |
hatco | 100 | 7 | 2-10 | 0.0 | 5.1 |
iris | 150 | 4 | 2-10 | 0.0 | 9.1 |
gr202 | 202 | 2 | 2-10 | 0.0 | 12.8 |
glass | 214 | 9 | 15-50 | 1.7 | 100.0 |
body | 507 | 5 | 30-80 | 1.9 | 100.0 |
gr666 | 666 | 2 | 2-10 | 0.0 | 28.5 |
breast | 683 | 9 | 2-25 | 0.4 | 61.2 |
vowel | 871 | 3 | 40-100 | 3.8 | 100.0 |
concrete | 1,030 | 8 | 60-120 | 3.3 | 100.0 |
u1060 | 1,060 | 2 | 2-25 | 0.1 | 68.6 |
segment | 2,310 | 19 | 230-500 | 10.8 | 100.0 |
pr2392 | 2,392 | 2 | 2-10 | 0.0 | 74.7 |
pcb3038 | 3,038 | 2 | 2-25 | 0.1 | 83.4 |
pendigit | 10,992 | 16 | 2-25 | 0.6 | 87.1 |
d15112 | 15,112 | 2 | 2-25 | 0.2 | 89.3 |
letter | 20,000 | 16 | 2-25 | 0.2 | 87.3 |
kegg | 53,413 | 20 | 2-25 | 32.9 | 84.8 |
shuttle | 58,000 | 9 | 2-25 | 2.6 | 92.3 |
pla85900 | 85,900 | 2 | 2-25 | 0.3 | 97.7 |
Hexaly obtains solutions with a gap to the SOTA lower than 1% in 1 minute of running time for almost all instances. On the other hand, Gurobi finds decent solutions for instances with less than 200 points. Beyond this, Gurobi struggles to find feasible solutions, especially when the number of clusters exceeds 5.
Conclusion
Hexaly offers an innovative modeling approach based on set variables, making the modeling of the K-Means Clustering Problem (MSSC) much more compact than the Mixed-Integer Quadratically Constrained Programming (MIQCP) formulation handled by traditional general-purpose optimization solvers like Gurobi. Extra constraints can be added easily, offering flexibility to extend the model to suit real-world clustering needs. Beyond simplicity, Hexaly delivers state-of-the-art solutions in minutes.
You can find the complete model of the K-Means Clustering Problem (MSSC) and many other Clustering problems in Python, Java, C#, and C++ in our Example Tour.
Ready to start?
Discover the ease of use and performance of Hexaly through a free 1-month trial.
[1] D. Aloise, P. Hansen (2009). A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering. Pesquisa Operacional. 29(3), 503-516.
[2] D. Aloise, P. Hansen, L. Liberti (2012). An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming 131, pp. 195–220.
[3] A.M. Bagirov, B. Ordin, G. Ozturk, A.E. Xavier (2015). An incremental clustering algorithm based on hyperbolic smoothing. Computational Optimization and Applications 61, pp. 219–241.
[4] T. Pereira, D. Aloise, J. Brimberg, N. Mladenović (2018). Review of basic local searches for solving the minimum sum-of-squares clustering problem. In P. Pardalos, A. Migdalas (eds), Open Problems in Optimization and Data Analysis. Springer Optimization and Its Applications, vol. 141. Springer, Cham.
[5] P. Kalczynski, J. Brimberg, Z. Drezner (2019). The importance of good starting solutions in the minimum sum of squares clustering problem. Submitted on arXiv on 6 Apr 2020.