August 28, 2015

Hugo Steinhaus, or K-means clustering in French

Kernel clustering
[Modern transcription of the Hugo Steinhaus paper in 1956 (in French), at the source of k-means clustering algorithms, published first in a french-written post]

Data clustering or clustering analysis belongs to statistical data analysis methods. It aims at forming groups of objects that are similar in some way. Those groups are named clusters. The word cluster is related to clot, for thick mass of coagulated liquid or of material stuck together

The whole set of objects contains heterogeneous data, that ought to be grouped into subsets possessing a greater inner homogeneity. Such methods rely on similarity criteria or proximity measures. They are related to classification, machine learning, segmentation, pattern recognition, and have applications ranging from image processing to bioinformatics.
One of the most popular clustering method is known as K-means (k-moyennes in French). with a variation called dynamic clustering (beautifully called nuées dynamiques in French, for an application in bilogy: Kinetic transcriptome analysis reveals an essentially intact induction system in a cellulase hyper-producer Trichoderma reesei strain, Dante Poggi et al., Biotechnology and Biofuels, 2014).

An history of K-means can be found in Data Clustering: 50 Years Beyond K-Means, Anil K. Jain, Pattern recognition letters, 2010. Other historic bits can be found in Origins and extensions of the k-means algorithm in cluster analysis, Hans-Hermann Bock, Electronic Journ@l for History of Probability and Statistics, 2008. This algorithm is deeply linked to Lloyd-Max algorithm, developed by Lloyd in  1957, and rediscovered by Max three after after. It is useful for optimal scalar quantifier design.

Sur la division des corps matériels en parties (pdf)
The K-means technique is a little older. It was published in French by Hugo Steinhaus in 1956 in the Bulletin de l’Académie Polonaise des Sciences (Bulletin of the Polish Academy of Sciences). Hugo Steinhaus (1887-1972) is a Polish mathematician, sometimes known as the discover of Stefan Banach. He contributed to numerous branches of mathematics, and considered a early founder of probability theory.

He also contributed to applied mathematics, working jointly with engineers, biologists, physician, economists, geologists, and "even lawyers". Lacking of trustworthy information during World War II, he invented a statistical tool to estimate German losses, using necrologic news from German soldiers on the front. He notably used the mention that the soldier killed was the first, second or third child from a family. He thus is a precursor of data science.
This paper is called "Sur la division des corps matériels en parties" (On the division of material bodies into parts). The first to explicitely  formulate in finite dimension the principle of k-mean clustering. It is much more constructive that the Banach-Tarski paradoxal theorem which delas with cutting a ball into two different balls, doubling in volume. This paper is pleasant to read, and evokes practical uses from type classification in anthropology to industrial object normalization.

Being in French, in a journal whose archives cannot be accessed easily, it has not been read as much as it deserved. . Here is it transduced by Maciej Denkowski, transmitted by Jérôme Bolte, and transcribed in LaTeX by  myself, with some effort to preserve the original typesetting and composition.

  Title                    = {Sur la division des corps mat\'eriels en parties},
  Author                   = {Steinhaus, H.},
 File                     = {Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means.pdf:Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means:PDF},
  Journal                  = {Bulletin de l’Acad\'emie Polonaise des Sciences},
  Number                   = {12},
  Pages                    = {801--804},
  Volume                   = {Cl. {III} --- Vol. {IV}},
  Year                     = {1956},

  Owner                    = {duvall},
  Timestamp                = {2015.}