April 11, 2017

BRANE Clust: cluster-assisted gene regulatory network inference refinement

The joined GRN inference and clustering algorithm BRANE Clust is just been published in BRANE Clust: cluster-assisted gene regulatory network inference refinement in IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017.

Alternative versions are available as a preprint, on biorxiv, with a page and software and in HAL. Another brick in the BRANE series wall, a series of bioinformatics tools based on graphs and optimization, dedicated to -omics gene expression data for Gene Regulatory Network (GRN) inference.

While traditional Next-generation sequencing (NGS) pipelines often combine motley assumptions (correlation, normalization, clustering, inference), this work is an first step toward gracefully combining network inference and clustering. 

BRANE Clust works as a post-processing tool upon classical network thresholding refinement. From a complete weighted network (obtained from any network inference method) BRANE Clust favors edges both having higher weights (as in standard thresholding) and linking nodes belonging to a same cluster. It  relies on an optimization procedure. It  computes an optimal gene clustering (random walker algorithm) and an optimal edge selection jointly. The introduction of a clustering step in the edge selection process improves gene regulatory network inference. This is demonstrated on both synthetic (five networks of  DREAM4 and network 1 of DREAM5) and real (network 3 of DREAM5) data. These conclusions are drawn after comparing classical thresholding on CLR and GENIE3 networks to our proposed post-processing. Significant improvements in terms of Area Under Precision-Recall curve are obtained. The  predictive power on real data yields promising results: predicted links specific to BRANE Clust reveal plausible biological interpretation. GRN approaches that produce a complete weighted network to prune could benefit from BRANE Clust post-processing.

Escherichia coli network built using BRANE Clust on GENIE3 weights and containing 236 edges. Large dark gray nodes refers to transcription factors (TFs). Inferred edges also reported in the ground truth are colored in black while predictive edges are light gray. Dashed edges correspond to a link inferred by both BRANE Clust and GENIE3 while solid links refer to edges specifically inferred by BRANE Clust.
Discovering meaningful gene interactions is crucial for the identification of novel regulatory processes in cells.
Building accurately the related graphs remains challenging due to the large number of possible solutions from available data. Nonetheless, enforcing a priori on the graph structure, such as modularity, may reduce network indeterminacy issues. BRANE Clust (Biologically-Related A priori Network Enhancement with Clustering) refines gene regulatory network (GRN) inference thanks to cluster information. It works as a post-processing tool for inference methods (i.e. CLR, GENIE3). In BRANE Clust, the clustering is based on the inversion of a system of linear equations involving a graph-Laplacian matrix promoting a modular structure. Our approach is validated on DREAM4 and DREAM5 datasets with objective measures, showing significant comparative improvements. We provide additional insights on the discovery of novel regulatory or co-expressed links in the inferred Escherichia coli network evaluated using the STRING database. The comparative pertinence of clustering is discussed computationally (SIMoNe, WGCNA, X-means) and biologically (RegulonDB). BRANE Clust software is available at:

March 17, 2017

Co-simulation, state-of-the-art by Claudio Gomes

A few days ago, we had a seminar given by Claudio Gomes (University of Antwerp, Belgium). He recently produced a research report (turned into a paper) on Co-simulation: State of the Art with C. Thule, D. Broman, P. G. Larsen and H. Vangheluwe (arxiv). This work is an impressive body of work on tools enabling experts in different disciplines to collaborate more efficiently in the development of ever more complex systems. It overviews "co-simulation approaches, research challenges, and research opportunities" and "summarizes, bridges, and enhances future research in  this multidisciplinary area".

The attendant slides  nicely summarize in  a didactic way the main issues pertaining to coupled systems, via the composition of sub-system simulations. They deal with Terminology, Simulation units, Input extrapolation techniques, Orchestration algorithms, Algebraic loops, Convergence and Stability.

It is essential to find new ways of enabling experts in different disciplines to collaborate more efficient in the development of ever more complex systems, under increasing market pressures. One possible solution for this challenge is to use a heterogeneous model-based approach where different teams can produce their conventional models and carry out their usual mono-disciplinary analysis, but in addition, the different models can be coupled for simulation (co-simulation), allowing the study of the global behavior of the system. Due to its potential, co-simulation is being studied in many different disciplines but with limited sharing of findings. Our aim with this work is to summarize, bridge, and enhance future research in this multidisciplinary area. We provide an overview of co-simulation approaches, research challenges, and research opportunities, together with a detailed taxonomy with different aspects of the state of the art of co-simulation and classification for the past five years. The main research needs identified are: finding generic approaches for modular, stable and accurate coupling of simulation units; and expressing the adaptations required to ensure that the coupling is correct.

This opportunity was initiated though a very open discussion around extrapolation for co-simulation in cyber-physical systems in CHOPtrey: contextual online polynomial extrapolation for enhanced multi-core co-simulation of complex systems.

February 25, 2017

BARCHAN: Blob Alignment for Robust CHromatographic ANalysis (GCxGC)

In 1987, G. P. Barchan et al.  wrote a paper called: Gas chromatographic method of determining carbon monoxide and dioxideBut barkhan, or barchan, means more outside gas chromatography. It refers to sand dunes, with crescent shapes, modeled by the wind.

BARCHAN: crescent-shaped sand dunes

They inspired our image registration tool for comprehensive 2D chromatography peak alignment (GCxGC or GC2D). It was just published as BARCHAN: Blob Alignment for Robust CHromatographic ANalysis (GCxGC), Journal of Chromatography A, February 2017. For given 2D chromatogram areas of interest, a baseline removal (BEADS) is applied, a peak detection is performed, blobs are detected and registered with a mixed rigid/non-rigid transformation based on the Coherent Point Drift technique.

A pair of GCxGC chromatogram areas of interest.
BARCHAN registration, example.

The preprint is here. The HAL and the arxiv version. The abstract is next:

Abstract: (Comprehensive) Two dimensional gas chromatography (GCxGC) plays a central role into the elucidation of complex samples. The automation of the identification of peak areas is of prime interest to obtain a fast and repeatable analysis of chromatograms. To determine the concentration of compounds or pseudo-compounds, templates of blobs are defined and superimposed on a reference chromatogram. The templates then need to be modified when different chromatograms are recorded. In this study, we present a chromatogram and template alignment method based on peak registration called BARCHAN. Peaks are identified using a robust mathematical morphology tool. The alignment is performed by a probabilistic estimation of a rigid transformation along the first dimension, and a non-rigid transformation in the second dimension, taking into account noise, outliers and missing peaks in a fully automated way. Resulting aligned chromatograms and masks are presented on two datasets. The proposed algorithm proves to be fast and reliable. It significantly reduces the time to results for GCxGC analysis.


• BARCHAN: 2D chromatogram and template alignment based on peak registration.
• The alignment is performed by probabilistic estimation with a Gaussian Mixture Model.
• It combines a rigid and a non-rigid transformation for complex samples analysis.
• The method accounts for noise, outliers and missing peaks in an automated way.
• BARCHAN significantly reduces the time to results for GC×GC analysis.

This work is a follow-up of preceding chromatography papers:
  1. Chromatogram baseline estimation and denoising using sparsity (BEADS)
  2. Comprehensive Two-Dimensional Gas Chromatography for Detailed Characterisation of Petroleum Products
  3. Characterization of middle-distillates by comprehensive two-dimensional gas chromatography (GCxGC): a powerful alternative for performing various standard analysis of middle distillates
  4. Comparison of conventional gas chromatography and comprehensive two-dimensional gas chromatography for the detailed analysis of petrochemical samples

December 22, 2016

Signal and image classification with invariant descriptors (scattering transforms): Internship

[Internship position closed]

Application and additional details


The field of complex data analysis (data science) is interested in the extraction of suitable indicators used for dimension reduction, data comparison or classification. Initially based on application-dependent, physics-based descriptors or features, novel methods employ more generic and potentially multiscale descriptors, that can be used for machine learning or classification. Examples are to be found in SIFT-like (scale-invariant feature transform) techniques (ORB, SURF), in unsupervised or deep learning. The present internship focuses on the framework of scattering transform (S. Mallat et al.) and the associated classification techniques. It yields signal, image or graph representations with invariance properties relative to data-modifying transformations: translation, rotation, scale… Its performances have been well-studied for classical data (audio signals, image databases, handwritten digit recognition).

This internship aims at dealing with lesser studied data: identification of the closest match to a template image inside a database of underground models, extraction of suitable fingerprints from 1D spectrometric signals from complex chemical compounds for macroscopic chemometric property learning. 

The stake in the first case resides in the different scale and nature of template and model images, the latter being sketch- or cartoon-like versions of the templates. In the second case, signals are composed of a superposition of hundreds of (positive) peaks. Their nature differs from standard information processed by scattering transforms. A focus on one of the proposed applications can be considered, depending on success or difficulties met. 


December 17, 2016

Kultur Pop 36 : Rebirth

Le volume 36 (Rebirth) de Kultur Pop, compilations de génériques de Radio France, vient de paraître. Au programme :
Title Artist Track
Zapateado Opus 23 (Pablo de Sarasate) [France Culture, Culture matin] Itzhak Perlman & Sam Sanders 01
Quatre danseries : L' échappée [France Culture, Etat d'alerte (fin)] Jean-Philippe Goude 02
Changanya [France Culture, La matinale du samedi] Lakuta 03
La lune rousse [France Culture, Backstage] Fakear 04
Satta [France Culture, Notre époque] Boozoo Bajou 05
Siegfried [France Culture, Interlude nuits] Erik Truffaz 06
El condor pasa [France Culture, Paso doble, début] Paul Desmond 07
Soleil Rouge [France Culture, Interlude nuits] Jean-Louis Matinier 08

November 18, 2016

Recherche scientifique utilitaire ? Jean Perrin, et la méthode scientifique

Particulièrement ému de prononcer ces paroles dans un coin de ce jardin qu'aimait Marie Curie [...] je veux simplement tirer un enseignement, et vous monter par leur exemple comment toute nouveauté vraiment utile à l'homme ne peut être obtenue que par la découverte des choses inconnues poursuivies sans aucun préoccupation utilitaire. Ce n'est pas en désirant lutter contre le cancer que Marie Curie et Pierre Curie ont fait leurs immortelles découvertes [...]. Ainsi en tout domaine, pour acquérir de la puissance, pour diminuer ces corvées qu'il ne faut pas confondre avec un travail noble, pour faire reculer la vieillesse et la mort elle-même, pour briser enfin le cadre étroit où notre destin semblait à jamais enfermé, nous devons faciliter la recherche scientifique désintéressée. Vous tous qui allez m'écouter par dizaine de milliers, vous qui me voyez sans que je vous voie, entendez mon appel, et contribuez par toute votre influence à faciliter cette recherche conquérante qui fera le bonheur et la liberté des hommes.
Ce texte est lu par Jean Perrin, fondateur du CNRS, en 1938, dans le petit jardin de l'institut du radium. Il se trouve toujours non loin de l'Institut de biologie physico-chimique, rue Pierre et Marie Curie. On le retrouve à 2'30" dans le montage Jean Perrin et la réalité moléculaire à l'occasion du 40e anniversaire de la découverte du radium lors de la semaine internationale contre le cancer.

Cet extrait a été diffusé dans le cadre de l'émission La méthode scientifique, sur France Culture, avec Alain Fuchs : Quelle politique de la recherche en France ?

Son générique est sur Kultur Pop: Leonard Nimoy, Music to watch space girls by

October 31, 2016

CHOPtrey: contextual online polynomial extrapolation for enhanced multi-core co-simulation of complex systems

XKCD. My hobby: extrapolating

CHOPtrey. A long, extrapolated and progressive scientific story. Ace, deuce, trey... ready?

It all started with Cyril Faure, then a PhD student  with Nicolas Pernet in real-time computing. He used to stop by my office. We had coffee, discussed books (from Descartes to Sci-Fi) and music (mostly Loudblast, Slayer, Pink Floyd, Manowar, Bolt Thrower). We exchanged ideas and concerns. One day, he told me about a worry in his thesis. Caveat: I am very bad at computer science, advanced programming, and had little hints about partitioned/slacked real-time co-simulation systems.

So this was not about programming, but simulation and co-simulation. Big physical systems (engines, aircrafts, boats) are complicated to simulate. Protocols and methods include FMI standard (Functional Mockup Interface) and FMU (Functional Mockup Unit). Partitioning them into subsystems may help the simulation, but split discrete subsystems should communicate. Fast enough to be precise, slow enough for speed-ups.
Hoher, 2011, A contribution to the real-time simulation...
So when simulated subsystems communicate at regular communication times, and one want interpolated data at a fractional time interval, one generally uses the last known value. This is called zeroth-order hold (ZOH). 
So I wondered: "this sound a little bit like aliasing, in a causal setting, let's interpolate, say, with a line or parabola". No so easy. In that domain, interpolation is "known" to be unstable. Even with FOH (first-order hold) or SOH (second-order hold).

A few implementations (Matlab prototypes, C++ and a final embedding in the XMOD software) and some tuning later, we produced a hierarchical scheme (with Abir, Cyril, Daniel and Mongi), based on morphological contexts, like in lossless image compression (PNG for instance). It proved very cheap, quite robust, and apparently stable. We called it CHOPtrey, from the old French words "ace, deuce, trey" (1, 2, 3 referring to sides of dice). In reference to a chop tray (or cutting board). Because it allows to chop a big simulated system into several smaller ones. Because it is composed of three parts:
  • CHOPred: a Computationally Hasty Online Prediction system (improving the trade-off between integration speed-ups, needing large communication steps, and simulation precision)
  • CHOPoly: Causal Hopping Oblivious Polynomials, with smoothed adaptive forward prediction improves co-simulation accuracy. They are similar to Savitzky-Golay filters or LOESS or LOWESS regression methods)
  • CHOPatt: a Contextual & Hierarchical Ontology of Patterns, where data behavior is segmented  into different classes to handle the discontinuities of the exchanged signals
During this work, I learned a lot about co-simulation, parallel computing, multi-core, etc. But also a lot about least-squares parabolic regression, which I thought I knew for decades. I did not. Not the deepest theory ever, but one of my nicest experience of cross-domain collaboration (along with bioinformatics and our BRANE work on gene regulatory networks), some genuine computer engineering, with a final packaged product (implemented in xMod). Here it is:
CHOPtrey: contextual online polynomial extrapolation for enhanced multi-core co-simulation of complex systems 
Abstract : The growing complexity of Cyber-Physical Systems (CPS), together with increasingly available parallelism provided by multi-core chips, fosters the parallelization of simulation. Simulation speed-ups are expected from co-simulation and parallelization based on model splitting into weak-coupled sub-models, as for instance in the framework of Functional Mockup Interface (FMI). However, slackened synchronization between sub-models and their associated solvers running in parallel introduces integration errors, which must be kept inside acceptable bounds. CHOPtrey denotes a forecasting framework enhancing the performance of complex system co-simulation, with a trivalent articulation. First, we consider the framework of a Computationally Hasty Online Prediction system (CHOPred). It allows to improve the trade-off between integration speed-ups, needing large communication steps, and simulation precision, needing frequent updates for model inputs. Second, smoothed adaptive forward prediction improves co-simulation accuracy. It is obtained by past-weighted extrapolation based on Causal Hopping Oblivious Polynomials (CHOPoly). And third, signal behavior is segmented to handle the discontinuities of the exchanged signals: the segmentation is performed in a Contextual & Hierarchical Ontology of Patterns (CHOPatt). Implementation strategies and simulation results demonstrate the framework ability to adaptively relax data communication constraints beyond synchronization points which sensibly accelerate simulation. The CHOPtrey framework extends the range of applications of standard Lagrange-type methods, often deemed unstable. The embedding of predictions in lag-dependent smoothing and discontinuity handling demonstrates its practical efficiency.
The full set of numbers for the six sides of a die are ace, deuce, trey, cater, cinque, sice. They are from Old French (cf un, deux, trois, quatre, cinq, six of modern French). Ace is originally from the Latin for 'unit'.

June 12, 2016

Pêcheurs de perles, une parabole (in French, BEADS and CHOPtrey)

This post is about pearls found in apparently simple but yet complex industrial-type questions, and a handful of parabolas. Two practical applications are found in analytical chemistry with  BEADS: Baseline Estimation And Denoising w/ Sparsity and in cyber-physical system co-simulation with  CHOPtrey: Contextual Polynomial extrapolation for real-time forecasting. The whole stuff  is just a parabola, or a parable

I was not at my best level of confidence in this talk, even in French. I had to completely change the talk a couple of days before. Politics... The best dwells in the Dave Gilmour (or Pink Floyd) parts:
The talk, in French, parle de perles trouvées dans des questions à la fois simples et complexes, et d'une poignée de paraboles. Avec deux applications au filtrage de lignes de base en analyse physico-chimique, et pour la co-simulation de systèmes complexes avec des extrapolations polynomiales contextuelles. C'était à Maths en mouvement. The related works are:

Pêcheurs de perles, par Laurent Duval from Contact FSMP on Vimeo.

I just needed some parabolic relaxation.
Parabolic relaxation

April 30, 2016

Trainlets: cropped wavelet decomposition for high-dimensional learning

It's being a lonng time: element 120 from the aperiodic table of wavelets is the trainlet, from Jeremias Sulam, Student Member, Boaz Ophir, Michael Zibulevsky, and Michael Elad, Trainlets: Dictionary Learning in High Dimensions:
Abstract: Sparse representations has shown to be a very powerful model for real world signals, and has enabled the development of applications with notable performance. Combined with the ability to learn a dictionary from signal examples, sparsity-inspired algorithms are often achieving state-of-the-art results in a wide variety of tasks. Yet, these methods have traditionally been restricted to small dimensions mainly due to the computational constraints that the dictionary learning problem entails. In the context of image processing, this implies handling small image patches. In this work we show how to efficiently handle bigger dimensions and go beyond the small patches in sparsity-based signal and image processing methods. We build our approach based on a new cropped wavelet decomposition, which enables a multi-scale analysis with virtually no border effects. We then employ this as the base dictionary within a double sparsity model to enable the training of adaptive dictionaries. To cope with the increase of training data, while at the same time improving the training performance, we present an Online Sparse Dictionary Learning (OSDL) algorithm to train this model effectively, enabling it to handle millions of examples. This work shows that dictionary learning can be up-scaled to tackle a new level of signal dimensions, obtaining large adaptable atoms that we call trainlets.
They offer a base dictionary used within a double sparsity model to enable the training of adaptive dictionaries. The associated package is here, from Michael Elad software page.  The  cropped wavelet decomposition enables a multi-scale analysis with virtually no border effects. An entry  to trainlets has added to WITS, the aperiodic table of wavelets.

But things always ends up with a song! Two of my favorite train songs, by  Porcupine tree (Trains) and the Nits (The train).

April 24, 2016

M-band 2D dual-tree (Hilbert) wavelet multicomponent image denoising

The toolbox implements a parametric nonlinear estimator that generalizes several wavelet shrinkage denoising methods. Dedicated to additive Gaussian noise, it adopts a multivariate statistical approach to take into account both the spatial and the inter-component correlations existing between the different wavelet subbands, using a Stein Unbiased Risk Estimator (SURE) principle, which derives optimal parameters. The wavelet choice is a slightly redundant multi-band geometrical dual-wavelet frame. Experiments on multispectral remote sensing images outperform conventional wavelet denoising techniques (including curvelets). Since they are based on MIMO filter banks (multi-input, multi-ooutput), in a mullti-band  fashion,, we can called they MIMOlets quite safely. The dual-tree wavelet consists in two directional wavelet trees, diisplayed below for a 4-band filter:

4-band directional dual-tree wavelets

The set of wavelet functions implements:
The demonstration script is Init_Demo.m, and the functions for M-band dual-tree wavelets are provided in the directory TOOLBOX_DTMband_solo. For instance, the clean multispectral image (port of Tunis, only one channel):

The (very) noisy version:

The denoised one:

November 10, 2015

BRANE Cut: Biologically-Related Apriori Network Enhancement with Graph cuts

[BRANE Cut featured on RNA-Seq blog][Omic tools][bioRxiv preprint][PubMed/Biomed Central][BRANE Cut code]

Gene regulatory networks are somehow difficult to infer. This first work from an on-going work (termed BRANE *, for Biologically Related Apriori Netwok Enhancement) introduces an optimization method (based on Graph cuts, borrowed from computer vision/image processing) to infer graphs based on biologically-related a priori (including sparsity). It is succesfully tested on DREAM challenge data and an Escherichia coli network, with a specific work to derive optimization parameters from gene network cardinality properties. And it is quite fast.

Background: Inferring gene networks from high-throughput data constitutes an important step in the discovery of relevant regulatory relationships in organism cells. Despite the large number of available Gene Regulatory Network inference methods, the problem remains challenging: the underdetermination in the space of possible solutions requires additional constraints that incorporate a priori information on gene interactions.

Methods: Weighting all possible pairwise gene relationships by a probability of edge presence, we formulate the regulatory network inference as a discrete variational problem on graphs. We enforce biologically plausible coupling between groups and types of genes by minimizing an edge labeling functional coding for a priori structures. The optimization is carried out with Graph cuts, an approach popular in image processing and computer vision. We compare the inferred regulatory networks to results achieved by the mutual-information-based Context Likelihood of Relatedness (CLR) method and by the state-of-the-art GENIE3, winner of the DREAM4 multifactorial challenge.

Our BRANE Cut approach infers more accurately the five DREAM4 in silico networks (with improvements from 6 % to 11 %). On a real Escherichia coli compendium, an improvement of 11.8 % compared to CLR and 3 % compared to GENIE3 is obtained in terms of Area Under Precision-Recall curve. Up to 48 additional verified interactions are obtained over GENIE3 for a given precision. On this dataset involving 4345 genes, our method achieves a performance similar to that of GENIE3, while being more than seven times faster. The BRANE Cut code is available at: http://​www-syscom.​univ-mlv.​fr/~pirayre/Codes-GRN-BRANE-cut.html.

Conclusions: BRANE Cut is a weighted graph thresholding method. Using biologically sound penalties and data-driven parameters, it improves three state-of-the art GRN inference methods. It is applicable as a generic network inference post-processing, due to its computational efficiency.
Keywords:  Network inference, Reverse engineering, Discrete optimization, Graph cuts, Gene expression data, DREAM challenge.

Additional information of the BRANE page

September 19, 2015

Big data, fishes and cooking: fourteen shades of "V"

[At this short post, you can access the 14 "V" often glued to Bug Data, including vacuity]

To Lao Tzu is often attributed (I cannot access the original meaning):
Govern a great nation as you would cook a small fish. Do not overdo it.
Today's wisdom could be:
Deal with Big data as you would process a small signal. Do not over-expect from it, do not over-fit it, do not-overinterpret it.

Luckily, Big data does not exist, where Making The Most Of Small Data is advocated. This is a bit like teenage sex:
“Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”

In the What exactly is Big Data StackExchange question, I have listed all 14 "V" that could describe big data, including... vacuity. They are: Validity, Value, Variability/Variance, Variety, Velocity, Veracity/Veraciousness, Viability, Virtuality, Visualization, Volatility, Volume and Vacuity.

August 28, 2015

Hugo Steinhaus, or K-means clustering in French

Kernel clustering
[Modern transcription of the Hugo Steinhaus paper in 1956 (in French), at the source of k-means clustering algorithms, published first in a french-written post]

Data clustering or clustering analysis belongs to statistical data analysis methods. It aims at forming groups of objects that are similar in some way. Those groups are named clusters. The word cluster is related to clot, for thick mass of coagulated liquid or of material stuck together

The whole set of objects contains heterogeneous data, that ought to be grouped into subsets possessing a greater inner homogeneity. Such methods rely on similarity criteria or proximity measures. They are related to classification, machine learning, segmentation, pattern recognition, and have applications ranging from image processing to bioinformatics.
One of the most popular clustering method is known as K-means (k-moyennes in French). with a variation called dynamic clustering (beautifully called nuées dynamiques in French, for an application in bilogy: Kinetic transcriptome analysis reveals an essentially intact induction system in a cellulase hyper-producer Trichoderma reesei strain, Dante Poggi et al., Biotechnology and Biofuels, 2014).

An history of K-means can be found in Data Clustering: 50 Years Beyond K-Means, Anil K. Jain, Pattern recognition letters, 2010. Other historic bits can be found in Origins and extensions of the k-means algorithm in cluster analysis, Hans-Hermann Bock, Electronic Journ@l for History of Probability and Statistics, 2008. This algorithm is deeply linked to Lloyd-Max algorithm, developed by Lloyd in  1957, and rediscovered by Max three after after. It is useful for optimal scalar quantifier design.

Sur la division des corps matériels en parties (pdf)
The K-means technique is a little older. It was published in French by Hugo Steinhaus in 1956 in the Bulletin de l’Académie Polonaise des Sciences (Bulletin of the Polish Academy of Sciences). Hugo Steinhaus (1887-1972) is a Polish mathematician, sometimes known as the discover of Stefan Banach. He contributed to numerous branches of mathematics, and considered a early founder of probability theory.

He also contributed to applied mathematics, working jointly with engineers, biologists, physician, economists, geologists, and "even lawyers". Lacking of trustworthy information during World War II, he invented a statistical tool to estimate German losses, using necrologic news from German soldiers on the front. He notably used the mention that the soldier killed was the first, second or third child from a family. He thus is a precursor of data science.
This paper is called "Sur la division des corps matériels en parties" (On the division of material bodies into parts). The first to explicitely  formulate in finite dimension the principle of k-mean clustering. It is much more constructive that the Banach-Tarski paradoxal theorem which delas with cutting a ball into two different balls, doubling in volume. This paper is pleasant to read, and evokes practical uses from type classification in anthropology to industrial object normalization.

Being in French, in a journal whose archives cannot be accessed easily, it has not been read as much as it deserved. . Here is it transduced by Maciej Denkowski, transmitted by Jérôme Bolte, and transcribed in LaTeX by  myself, with some effort to preserve the original typesetting and composition.

  Title                    = {Sur la division des corps mat\'eriels en parties},
  Author                   = {Steinhaus, H.},
 File                     = {Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means.pdf:Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means:PDF},
  Journal                  = {Bulletin de l’Acad\'emie Polonaise des Sciences},
  Number                   = {12},
  Pages                    = {801--804},
  Volume                   = {Cl. {III} --- Vol. {IV}},
  Year                     = {1956},

  Owner                    = {duvall},
  Timestamp                = {2015.}

Hugo Steinhaus : classification par k-moyennes, nuées dynamiques

Partitionnement à noyau
[Mise à disposition de l'article de Hugo Steinhaus de 1956, à l'origine de l'algorithme de partitionnement par les k-moyennes (available in English)]

Le partitionnement des données (data clustering ou clustering analysis) est une méthode "statistique" d'analyse de données visant à regrouper, dans un ensemble de données hétérogènes, des sous-ensembles de ces données en amas ou paquets plus homogènes. Chaque sous-ensemble doit ainsi présenter des caractéristiques similaires, quantifiée par des critères de similarité ou différentes mesures de proximité. Ces techniques appartiennent aux familles de classification, d'apprentissage automatique ou de segmentation, employées dans un nombre phénoménal d'applications, du traitement d'image à la bio-informatique.

L'une des méthodes de partitionnement ou d’agrégation les plus populaires est celle des k-moyennes (ou K-means), un problème d'optimisation combinatoire dont une version porte le joli nom de nuées dynamiques (pour une application qui m'intéresse : Kinetic transcriptome analysis reveals an essentially intact induction system in a cellulase hyper-producer Trichoderma reesei strain, Dante Poggi et al., Biotechnology and Biofuels, 2014).

Une histoire des k-moyennes est disponible dans Data Clustering: 50 Years Beyond K-Means, Anil K. Jain, Pattern recognition letters, 2010. Un autre point de vue est dans Origins and extensions of the k-means algorithm in cluster analysis, Hans-Hermann Bock, Journ@l Electronique d'Histoire des Probabilités et de la Statistique. Cet algorithme est profondément relié à l'algorithme dit de Lloyd-Max, développé par Lloyd en 1957, et redécouvert par Max trois ans après. Il permet notamment de construire un quantificateur scalaire optimal. 

Sur la division des corps matériels en parties (pdf)
Cette technique a cependant une source légèrement antérieure, publiée en français par Hugo Steinhaus en 1956 dans le Bulletin de l’Académie Polonaise des Sciences. Hugo Steinhaus (1887-1972) est un mathématicien polonais qui a contribué à de nombreuses branches des mathématiques, et est considéré comme l’un des précurseurs de la théorie des probabilités. Il a également œuvré en mathématiques appliquées, avec des collaborations avec des ingénieurs, géologues, économistes, des physiciens, biologistes. En manque d’informations fiables sur le déroulement de la 2e guerre mondiale, il « invente » un outil statistique pour estimer les pertes allemandes, en utilisant les annonces sporadiques des décès, partant d’un calcul de la fréquence relative d’annonces nécrologiques de soldats décédés mentionnant s’ils sont les 1er, 2e, 3e etc. fils d’une famille. Il est ainsi un précurseur de la science des données.

Cet article s'intitule "Sur la division des corps matériels en parties", et est le premier formulant de manière explicite, en dimension finie, le problème de partitionnement par les k-moyennes. Il est donc plus constructif que le théorème paradoxal de Banach-Tarski qui s'intéresse à la découpe d'une boule en deux boules de volume total double. Son écriture est plaisante, visant un usage pratique allant de la classification des types en anthropologie à la normalisation des objets industriels. 

Étant en langue française, dans une revue aux archives peu disponibles en ligne, cet article n'a pas eu la lecture qu'il méritait. Le voici transduit par Maciej Denkowski, transmis par Jérôme Bolte, et transcrit en LaTeX par votre serviteur, avec un effort majeur pour en conserver la pagination originale.

  Title                    = {Sur la division des corps mat\'eriels en parties},
  Author                   = {Steinhaus, H.},
 File                     = {Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means.pdf:Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means:PDF},
  Journal                  = {Bulletin de l’Acad\'emie Polonaise des Sciences},
  Number                   = {12},
  Pages                    = {801--804},
  Volume                   = {Cl. {III} --- Vol. {IV}},
  Year                     = {1956},

  Owner                    = {duvall},
  Timestamp                = {2015.}

Une version anglophone de ce billet s'intitule : Hugo Steinhaus, or K-means clustering in French.

July 8, 2015

Sparse seismic data restoration: a PhD defense

Smoothed $\ell_1/\ell_2$ function for a sparse $\ell_0$ surrogate
Mai Quyen PHAM has defended her PhD thesis on July 15th, 2015 at 10.00 am, on the topic of "Seismic wave field restoration using sparse representations and quantitative analysis” (manuscript in pdf), at Université Paris-Est, bâtiment Copernic, amphithéâtre Maurice Gross, 5 boulevard Descartes (RER A, Noisy-Champs), 77420 Champs-sur-Marne. 

Its focus is twofold:
1) sparse adaptive filtering with approximate templates in redundant and geometric wavelet frames (akin to echo cancellation in speech),
2) sparse blind deconvolution for parsimonious reflectivity signals with l1/l2 norm ratio penalty

This work has notably been published in two journal papers
*Euclid in a Taxicab: Sparse Blind Deconvolution with Smoothed l_1/l_2
Audrey Repetti, Mai Quyen-Pham, Laurent Duval, Émilie Chouzenoux, Jean-Christophe Pesquet
IEEE Signal Processing Letters, May 2015, Volume 22, Number 5, pages 539-543.
Abstract: The l1/l2 ratio regularization function has shown good performance for retrieving sparse signals in a number of recent works, in the context of blind deconvolution. Indeed, it benefits from a scale invariance property much desirable in the blind context. However, the l1/l2 function raises some difficulties when solving the nonconvex and nonsmooth minimization problems resulting from the use of such a penalty term in current restoration methods. In this paper, we propose a new penalty based on a smooth approximation to the l1/l2 function. In addition, we develop a proximal-based algorithm to solve variational problems involving this function and we derive theoretical convergence results. We demonstrate the effectiveness of our method through a comparison with a recent alternating optimization strategy dealing with the exact l1/l2 term, on an application to seismic data blind deconvolution.
*A Primal-Dual Proximal Algorithm for Sparse Template-Based Adaptive Filtering: Application to Seismic Multiple Removal
Mai-Quyen Pham, Laurent Duval, Caroline Chaux, Jean-Christophe Pesquet
IEEE Transactions on Signal Processing, August 2014, Volume 62, Issue 16, pages 4256-4269

PhD Committee:
Reporter Jean-Francois Aujol Prof. Université de Bordeaux
Reporter Mauricio D Sacchi Prof. University of Alberta
Examiner Jérôme Mars Prof. Grenoble-INP
Examiner Mai K. Nguyen Prof. Université de Cergy-Pontoise
PhD supervisor Jean-Christophe Pesquet Prof. Université Paris-Est Marne-la-Vallée
PhD co-supervisor Laurent Duval Dr. IFP Energies nouvelles (IFPEN)
PhD co-supervisor Caroline Chaux CNRS researcher, I2M, Aix-Marseille Université

Map to Copernic building

Abstract: This thesis deals with two different problems within the framework of convex and non convex optimization. The first one is an application to multiple removal in seismic data with adaptive filters and the second one is an application to blind deconvolution problem that produces characteristics closest to the Earth layers. More precisely: Unveiling meaningful geophysical information from seismic data requires to deal with both random and structured “noises”. As their amplitude may be greater than signals of interest (primaries), additional prior information is especially important in performing efficient signal separation. We address here the problem of multiple reflections, caused by wave-field bouncing between layers. Since only approximate models of these phenomena are available, we propose a flexible framework for time-varying adaptive filtering of seismic signals, using sparse representations, based on inaccurate templates. We recast the joint estimation of adaptive filters and primaries in a new convex variational formulation. This approach allows us to incorporate plausible knowledge about noise statistics, data sparsity and slow filter variation in parsimony-promoting wavelet transforms. The designed primal-dual algorithm solves a constrained minimization problem that alleviates standard regularization issues in finding hyperparameters. The approach demonstrates significantly good performance in low signal-to-noise ratio conditions, both for simulated and real field seismic data. In seismic exploration, a seismic signal (e.g. primary signal) is often represented as the results of a convolution between the “seismic wavelet” and the reflectivity series. The second goal of this thesis is to deconvolve them from the seismic signal which is presented in Chapter 6. The main idea of this work is to use an additional premise that the reflections occur as sparsely restricted, for which a study on the “sparsity measure” is considered. Some well known methods that fall in this category are proposed such as (Sacchi et al., 1994; Sacchi, 1997). We propose a new penalty based on a smooth approximation of the ℓ1/ℓ2 function that makes a difficult nonconvex minimization problem. We develop a proximal-based algorithm to solve variational problems involving this function and we derive theoretical convergence results. We demonstrate the effectiveness of our method through a comparison with a recent alternating optimization strategy dealing with the exact ℓ1/ℓ2 term. 

June 2, 2015

Facebook FAIR(ies) in Paris

Paris is buzzing about the announcement of the new european research center of Facebook in Paris. This was already known around April/May. Six  "fairies" are supposed to have joined Facebook FAIR, or Facebook Artificial Intelligence (AI) Research center. They will do some magical data science (or dédoménologie), under the guidance of Yann LeCun. He was the host of the day "Data Science and Massive Data Analysis" on the campus of ESIEE Paris, Ecole des Ponts ParisTech and Université Paris-Est Marne-la-Vallée (Paris at large) on June 12th 2014. It's not eerie, it's ESIEE.

After Menlo Park and New York, this center, the third and the first outside the US, has been attracted to City of lights.  Where they will bring their TORCH for our enlightenment. Attracted by  moths around a flame, by the local talents and excellent education facilities in artificial intelligence and computer science, and probably substantial financial incentives.They are also nearing Google headquarters in Paris. Google's HQ inauguration enjoyed the presence of former rightist président Nicolas Sarkozy. Will there be a people appearance of leftist François Hollande? Internet giants (GAFA, ABTX) are waging wars in our data. Remember Bertrand Russell, for who "War does not determine who is right - only who is left".

Camille Couprie (NYU, on leave from IFPEN)
Florent Perronnin (form. Panasonic and Xerox)
Hervé Jégou (INRIA)
Holger Schwenk (DeepLingo, université du Maine). Nota: the Facebook link is not active (201506032340), and points to G. Synnaeve's as if Holger was not in charge yet. His other page.
Gabriel Synnaeve (Ecole normale supérieure)
So a large portion of the fairies are male, which may not be fair to the magical creatures, knowing that female fairies usually have more power. Two handful more talents would joint by the end of the year, 25 to 50 in the years to come. Figures vary with sources, depending on permanent positions or PhD and post-doctorants. That is big data.

Google news full coverage
Facebook opens an artificial intelligence research lab in Paris
Intelligence artificielle : Facebook écrit une partie du futur à Paris.
Facebook mise sur Paris 
Intelligence Artificielle : Facebook ouvre un centre de recherche à Paris

April 10, 2015

Dédoménologie : la science du traitement de données (signal, images, etc.)

[Où l'on propose le néologisme dédoménologie pour désigner la technique, la pratique, la science du traitement de signal et de l'analyse d'images, au cœur du domaine naissant de la science des données, en passant par Euclide]
Ce sont les mots qui existent, ce qui n'a pas de nom n'existe pas. Le mot lumière existe, la lumière n'existe pas. (Francis Picabia, ou Francis-Marie Martinez de Picabia, Écrits)
Quelle analyste d'image, quel traiteur de signal n'a jamais eu des difficultés à décrire son métier ? Pas en détail bien sûr :
En fait, je m'intéresse aux propriétés cyclostationnaires des coefficients d'ondelettes de mouvements browniens fractionnaires dans les images de textures. Enfin quand je dis cyclostationnaire, il faut entendre périodiquement corrélé, hein, je ne parle pas des processus presque périodiques.

Le traitement du signal, des images ou des données requiert très souvent des périphrases. Des exemples parlants :
Tu vois Photoshop ? 
Très mauvais exemple. L'interlocuteur voit rapidement une journée de "travail" à bouger le mulot pour changer une teinte, sélectionner des objets à la baguette magique.
Tu connais le mp3 ? Le format JPEG ?
Et de rentrer dans des détails sordides de données numériques redondantes, de quantification, dont on cherche à extraire uniquement la partie perceptible utile. Avec le risque de remarques déplacées :
Le son du mp3, moi je trouver ça nul par rapport au vinyl ! [C'est pas gagné...]
L'analyste de signaux, le traiteur d'images, mais quel bruit fait-il ? C'est souvent une histoire de bruit d'ailleurs, un signal propre, une image nette, des données obvies, personne ne nous demande jamais de les analyser, de les traiter. C'est un peu comme les médecins en occident : ils voient peu de gens en bonne santé, à part les hypocondriaques, qui ont bien sûr un petit problème de santé, à un autre niveau. Le traitement de signal, l'analyse de données, on n'en fait pas en petite classe, du moins pas directement. Ce n'est pas au baccalauréat. Donc cette matière, la plupart des gens n'ont pas eu à la subir, et ayant peu de contact avec des mesures expérimentales, rares sont ceux qui ont dû avouer qu'ils ne savaient pas traiter les échantillons chèrement acquis. Pourtant, la donnée numérique est au cœur du monde réel, et il est fort possible que cela ne fasse qu'empirer.
Traitement d'images pour la science de la physique des matériaux

Ah ça, les mathématiques, on voit bien. La physique, la biologie, c'est à peu près clair. La chimie, évident. De loin, on sent que c'est compliqué, mais que ces gens gens-là se comprennent. Il y a des cases pour cela à l'Académie des sciences. Parfois ils ont des prix Nobel. Pas en mathématiques, mais on sait bien qu'il  y une espèce de prix Nobel des mathématiques, la médaille Fields. Et pourtant on connait mal le sens de ces mots : mathématiques vient du grec par le latin, avec un sens original de "science, connaissance". Physique vient de la "connaissance de la nature". Dans biologie, il y a le vivant, les produits bios... La chimie, c'est un peu plus compliqué (des mélanges, de la magie noire, de l'arabe et du grec). Mais statistiques, on retrouve la trace de l’État : science qui a pour but de faire connaître l'étendue, la population, les ressources agricoles et industrielles d'un État. L'électronique, on voit bien les petits électrons qui bougent dans les fils.

Analyse de signal chromatographique bidimensionnel
Et pourtant, la réalité scientifique est bien plus éparpillée : un biologiste qui étudie une bactérie a souvent peu à partager avec une spécialiste des champignons. Rien à voir, comme un spécialiste des dauphins et un lombriculteur. Cédric Villani, qui fait un grand effort de divulgation ces derniers temps (Les mathématiques sont un art comme les autres), et qui sera invité du prochain congrès de la communauté francophone et groupement de recherche et d'études du traitement du signal et de l'image (GRETSI 2015) à Lyon, montre une prudence naturelle quand on l'interroge sur d'autres mathématiques que les siennes. Physiciens des particules et mécaniciens des fluides se rencontrent rarement. La carte des sciences est bien plus complexe que les contribuables ne le pensent généralement.

Une carte des sciences en graphe

Comme les médecins qui ne soignent pas uniquement leurs bobos, les traiteurs de signaux traitent souvent les problèmes des autres disciplines. Les images pour la physique des matériaux, les signaux de chromatographie pour les chimistes analytiques, les réflexions sismiques des géophysiciens. Les traiteurs de signaux, les analyseurs d'images doivent comprendre un peu de ces disciplines. Savoir utiliser différentes techniques (analyse spectrale, statistiques, algèbre linéaire, modélisation paramétrique, optimisation, normalisation) et les mettre en pratique par des algorithmes et des programmes, même du matériel.
Déconvolution aveugle de réflexions sismiques

Il y a donc de la pratique (praxis) et de la technique (tekné) dans cette discipline composite, à la frontière d'autres sciences et disciplines. Il a fallu d'abord des mesures expérimentales avant de pouvoir les traiter. Personnellement, je me sens un praticien de certains types de données. Je connais leurs pathologies de base, j'ai quelques traitements qui marchent parfois. Technicien de la donnée, praticien de l'échantillon, ça donnerait des métiers d'infopracteurs/trices ou infopraticiens/praticiennes, d'infotechniciens/ciennes. J’aime a priori l’idée de rendre leur noblesse aux termes de praxis et de tekné. Donc « infopraxie » ou « infotechnie » comme nom de discipline ? Mais -technie, -praxie ça fait rebouteux, mécanicien auto. Et puis info, c’est trop « informatique ». Datalogie, ça aurait pu être bien : c'est une racine composite latino-hellénique, qui reflète bien l'aspect interdisciplinaire de cette science. Malheureusement, c'est déjà pris par l'informatique à nouveau (computer science). Il faudrait donc un truc plus sérieux.

Et par un tournant de sérendipité, je tombe sur  un texte d’Euclide, Dedomena qui a vu son titre traduit en latin en « data ». Il s'agit d'un texte sur la nature et les implications d'information donnée pour résoudre un problème géométrique. On y est. Il définit comment un objet est décrit en forme, en position, en grandeur. Ces critères sont très nettement ceux que l'on extrait au quotidien de données numériques. Et puis Euclide, c'est un bel hommage : les espaces euclidiens sont à la base de nombreux concepts algorithmes, et la minimisation de la distance euclidienne, c'est notre pain quotidien. Il suffit de regarder cet interlude de Martin Vetterli, From Euclid to Hilbert, Digital Signal Processing.

Alors je propose de renommer le traitement de signal ou des images, des données en général, en  dédoménologie, ou l’art de ceux qui parlent de, qui analysent des données. Cela plonge directement le traitement des signaux et des images au centre de la science des données en général. Ça aurait de la gueule sur une carte de visite, non ? Une mnémonique pour retenir ce terme : des domaines aux logis.

A partir de là, on peut étendre le vocabulaire à la dédoménotaxie, pour les opérations de classement, de tri de données (pour coller Euclide dans un taxi, c'est ici), à la dédoménotropie pour les flux de données, la dédoménomancie pour les aspects de prédiction (façon predictive analytics). Dénoménonomie, à la manière de l'astronomie, c'est peut-être dur à porter. Voila pour la science. Pour les aspects sociétaux, la mode est à la crainte des usages de manipulation de l'opinion et d'une gouvernance accrue par les données : dédoménodoxie et dédoménocratie. Attention donc à la dédoménophobie.

P. S. : après des recherches complémentaires, en anglais, le mot dedomenology pour désigner le concept de data science semble déjà avoir été émis, sans trop de succès.
P. S. 2 : on me signale (Hervé T.) que l'homophonie avec démonologie est suspecte. D'un : le dialbe est dans les détails, en science des données aussi. De deux : le démon de Maxwell, sans autre la qualité d'un filtre, est le parangon du tri du bon grain et de l'ivraie. De trois :  en informatique; un démon est : 
Un daemon (prononcé /ˈdiːmən/ ou /ˈdeɪmən/, du grec δαιμων) ou démon désigne un type de programme informatique, un processus ou un ensemble de processus qui s'exécute en arrière-plan plutôt que sous le contrôle direct d'un utilisateur.
Tout est dit.