Wednesday, January 13, 2016

Big data demystified

On 13th December 2013, I received the book "Big Data" written by Viktor Mayer-Schönberger and Kenneth Cukieran as a Christmas present from my supervisor. The authors are respectively professor of internet governance at University of Oxford and data editor at Economist. The concept was still in its infancy those days and had not yet become the broadly pervasive buzzword that it is right now. The hustle and bustle of doctoral studies left no free time for reading the book until recently. I had already listened to a handful of presentations focused on or touching upon the concept when I started reading through the book. I did therefore not have very high expectations of the book. Surprisingly, I found the contents still fresh and informative. It actually helped me reassembling all the scattered pieces of knowledge I had acquired about big data by then. Below comes a summary of the contents as well as some own reflections:

Big data is, in the words of the authors, a paradigm shift in statistical analysis. The rationale is that immense quantitative changes enable qualitative changes; so has the significant increase in the extent of processable digital data resulted in a change of state. The three major shifts characterizing the big data phenomenon are:
  • the shift from small sample sets augmented through extrapolation to big amounts of information sourced from the entire population;
  • the shift from exactitude to messiness; and - probably the most important one - 
  • the shift from causality to correlation.
We can no longer consent to lose valuable nuances as a side effect of using samples instead of the entire studied population. Moreover, the astonishingly large datasets of today do seldom exist in one place as compelled by the conventional statistical analysis methods. Thus, we have to shed our fears of messiness and step into the big data realm in the pursuit of the subtle though significant insights that we have been discarding all the time. And finally, in the big data realm, it is no longer important to investigate why things happen; rather it is good enough to predict what would happen next.

The importance of data had already been recognized in logic some decades ago when inductive reasoning was coined as opposed to the traditional deductive reasoning school. Though, due to the abundance of data and data-crunching technologies, data has recently become even more valuable and is often prioritized over theoretical reasoning. In the light of the capabilities brought about by big data technologies, we can now contemptuously describe traditional statistical analysis methods as inadequate and stochastic hypothesis-driven trial and error! As big data technologies mature, further aspects of the phenomenon such as the secondary usages (also called the option values) of data are revealed and taken into account in high-end business models and vision statements of the pioneer information management firms.

As any other emerging technology, big data also has its downsides. The fact that data anonymization in a big data world is nothing but a blatant myth poses a serious threat to our privacy. It also questions the decency and legitimacy of using data for commercial and even research purposes. Eventually, we will face unprecedented cases as this: imagine a real-life situation where the probability of committing a serious crime by a suspect at a specific moment according to statistical analysis of his/her behavior is so huge that could serve as a fairly good evidence for the police for seizing him/her. At that stage, predictions could be so accurate that it would no longer be prudent to postpone preventive measures until human lives are lost and irreversible damages are perpetrated by the statistically-identified potential criminal.