June 6, 2015

data, of considerable size

Data science, big data, data mining, MapReduce, buzzword. Data is being generated in an increasing pace as the amount of it is exponentially growing. Problems arise when handling such large amounts of information and interpreting what it all means. Drawing misleading conclusions from that data can happen by accident or in purpose.

flood #

Data is being generated by both people and machines. As the adoption of some technology reaches saturation, new technologies and new types of information gathering become available. Meanwhile, as the saturation is reached, the accuracy of the data keeps growing well past the adoption. Machines collect data about people, each other, and generate more data based on the previously collected. It is all stored somewhere, where no one is likely to ever review it. Luckily, the costs of storing data seems to decline at a compatible rate. The costs of managing and understanding it, however, might not be as affordable.

It’s somewhat comforting to know that much of the existing data is junk. There’s kittens, dick pics, b-roll footage, duplicates and backups. Storage space being so cheap, people can’t be bothered to go through that stuff themselves.

big data and them scientists #

Data science, to my understanding, consist of collection, processing, normalisation and interpretation of data. For now, some need to be done manually, whilst most can be automated. The more standardised and consistent any stream of data is, the easier it is to automatise it’s processing into valuable insights, estimations and predictions. Automation is mainly helpful with extracting meaningful data from large sets, or rather crunching large amounts of numbers and calculating whatever numbers statisticians like to juggle.

Traditionally, much of the available information could be handled within a computer’s memory and a suitably competent person could directly query the wanted data out of it. Interpreting the information and making logical conclusions required, and arguably still does, the interaction of a human. Now artificial intelligence, as in algorithms and self learning programs, can do some of those decisions. But as always, having people within that process can make the end result susceptible to abuse, ignorance or mental biases.

correlation != causation #

Or absence of evidence does not equal evidence of absence. #

Mistaking correlation with causation can happen fast and be uncomfortably intuitive. Historically, and therefore biologically, people have had to deal with simple structures of cause and effect. Now, the world being all the more complex and advances accumulating on themselves, mistaking correlation for causation can happen all too easily. Things happen simultaneously and are interrelated, and it is increasingly difficult to establish the actual cause and effect. Or even better, isolate those two and be certain they always follow.

Most errors happen due to ignorance. Many wrong conclusions are, for example, made based on luck and the skewness of available information. Where failures are eradicated and silenced, successes are praised and celebrated. More importantly, successes get attention and occupy our minds. Failing to acknowledge all or most of the possible futures, the one that happened by chance receives too much credit.

Sometimes we see what we want to see. Spinning numbers and looking for causations in a huge dataset can become self-fulfilling.

deliberate misinterpretation #

What is more unsettling, is the deliberate misinterpretation of data. As this field of occupation gets more complex, it is far more difficult for “regular” people to spot manipulation from the end results and “smart” conclusions.

Even if the data is correct and honest, it can still be presented in a misleading fashion. Large numbers are unintuitive for most people, as are statistics in general. Sometimes, we get confused about percentages, or differences between millions and billions.

I suspect numbers, graphs and piecharts often appear in an article to support and validate the ready-made opinion of the writer. Displaying data gives the message a somewhat scientific feel to it, and likely increases it’s believability.

Differentiating correlation from causation is important, as is being aware of the source and source material. The existence of information mined from large datasets is not automatically a sign of honest or smart interpretation of the original information.

Kudos