November 29, 2014

Big data and Data science: LIX colloquium 2014

Sketch of the Hype Cycle for Emerging Technologies
Data science and Big data are two concepts at the tip of the tongue and the top of the Gartner Hype Cycle for Emerging Technologies. Close to the peak of inflated expectations. The Data science LIX colloquium 2014 at Ecole Polytechnique, organized by Michalis Vazirgiannis from DaSciM was held yesterday on the Plateau de Saclay, which may have prevented some to attend the event. Fortunately, it was webcast.

The talks covered a wide range of topics pertaining to Datacy (data + literacy). The community detection in graphs (with survey) keynote promoted local optimization (OSLOM, with order statistics). It was said than "We should define validation procedures even before starting developing algorithms", including negative tests; on random graphs, a clustering method should find non prominent cluster (except the whole graph), in other words no signal in noise. But there was no mention to phase transition in clustering. The variety of text data (SMS, tweets, chats, emails, news web pages, books, 100 languages, OCR and spelling mistakes) and its veracity was questioned with Facebook estimating that between 5% and 11% of accounts are fake, and 68.8 percent of the Internet is spam (how did they get the 3 figures precision?). News-hungry people would be interested in EMM News, a media watch tool aggregating 10000 RSS flux and 4000 news aggregators. With all these sources, some communities are concerned with virtual ghost town effects, and look for way to spark discussions (retweets and the likes) to keep social activity alive. Flat approaches or hierarchical grouping are still debated challenges in large-scale classification and web-scale taxonomies. Potentially novel graph structures (hypernode graphs, related to hypergraphs or maybe n-polygraphs) with convex stability and spectral theory are also proposed in the first part of the colloquium.

Big Data Cap Gap: the space between all and relevant data
While Paris-Saclay center for data science has opened its website, the unbalanced data was exposed around the HiggsML data-driven challenge. Less than 100 Higgs bosons (expected) to be detected in 10^10 yearly. Big-data analogs of the greek Pythia, as well as efficient indexing and mining methods would necessary to harness the data beast. More industrials talks concluded the colloquium, given by AXA, Amazon and Google representatives, which i could not attend, left with the so-called "crap gap" in mind, i. e. the gap between Relevant Data and Big Data. 

Innovation driven by large data sets still requires, at least, vague goals in mind. In Latin, "Ignoranti quem portum petat nullus suus uentus (ventus) est", wrote Sénèque in his 71th letter to Lucilius. A possible translation in English: "When a man does not know what harbour he is making for, no wind is the right wind". In German, "Dem weht kein Wind, der keinen Hafen hat, nach dem er segelt". And "Il n'y a point de vent favorable pour celui qui ne sait dans quel port il veut arriver" in French.

All of the information, and possibly information you need, may be found in the following program and videos. As the videos are not split into talks, the time codes are provided, thanks to the excellent suggestion (and typos corrections) by Igor Carron.

LIX colloquium 2014 on Data Science LIVE part 1
  • 00:00:00 > 00:22:22: Introduction and program
  • 00:22:22 > 01:22:18: Keynote speech: Community detection in networks, Santo Fortunato, Aalto University
  • 01:22:18 > 01:57:30: Text and Big Data, Gregory Grefenstette, Inria Saclay - Île de France
  • 01:57:30 > 02:29:23: Accessing Information in Large Document Collections: classification in web-scale taxonomies, Eric Gaussier, Université Joseph Fourier (Grenoble I)
  • 02:29:23 > 03:01:32: Shaping Social Activity by Incentivizing Users, Manuel Gomez Rodriguez, Max Planck Institute for Software Systems
  • 03:01:32 > 03:38:00: Machine Learning on Graphs and Beyond, Marc Tommasi, Inria Lille

LIX colloquium 2014 on Data Science LIVE part 2
  • 00:00:00 > 00:33:57: Learning to discover: data science in high-energy physics and the HiggsML challenge, Balázs Kégl, CNRS
  • 00:34:11 > 01:06:15: Big Data on Big Systems: Busting a Few Myths, Peter Triantafillou, University of Glasgow
  • 01:06:15 > 01:38:29: Big Sequence Management, Themis Palpanas, Paris Descartes University

LIX colloquium 2014 on Data Science LIVE part 3 
  • 00:00:00 > 00:38:11: Understanding Videos at YouTube Scale, Richard Washington, Google
  • 00:38:11 > 01:05:48: AWS's big data/HPC innovations, Stephan Hadinger, Amazon Web Services
  • 01:05:48 > 02:02:41: Big Data in Insurance - Evolution or disruption? Stéphane Guinet, AXA 
  • 02:02:41 > 02:06:35: Closing words on a word cloud (with time, series, graph and classification are the big four)