Big Data is a buzz phrase at the moment. Every week, I read about another company, out to solve the Big Data issue.
Wikipdedia defines it as:

In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.

It has become closely associated with the theory that if only we can process all this data, then the actions and explanations will become clear. Out of my workplaces, I heard this train of thought at Google particularly and it has become very popular in recent years. The difficulty is that I think it is deeply incorrect.

More data does not necessarily lead to a better chance of understanding. In fact adding irrelevant variables confuses the matter. Imagine the scenario where you have only 3 data points. It would be much easier to understand than if you added another 97 irrelevant ones.

When you add more data, you invariably end up with many false correlations and it becomes difficult to work out which ones truly mean causation. This is especially because the growth in total data when you add a variable is exponential, not linear.

Let’s step back and use a real life example. There was something called the Redskins Rule. This stated:

“If the Redskins win their last home game before the election, the party that won the previous election wins the next election and that if the Redskins lose, the challenging party’s candidate wins.”

This actually held true in every American presidential election from 1940 to 2000. Did anyone actually believe that it really influenced the election? I hope not, but it shows the difficulty when there is too much information. There are so many variables in life that statistics like this are not just likely but certain to come up when you add enough data.

Big Data is a buzz phrase that cannot be ignored, but it shouldn’t be seen as a panacea. Human intervention and interpretation will always be needed. It will never be a matter of inputting all the information into a tool that will tell you what to do. The great thing the Big Data movement can bring are better analysis tools to allow these decisions to be translated. It is the reason the Bayesian theorem is as popular as it is, and will continue to be so (it states up front the analyst’s belief, so real life thinking can be used, rather than a lemming-like belief in only using the data).

It will continue to be matter of a trained analyst helping to filter down the variables, using the tools at their disposal, and then interpreting the data. The human race is not replicated yet.