We live in an era of big data. Indeed, we live in an era where that moniker scarcely seems to do justice to the amount of data we are producing and is available to make decisions with. It also scarcely seems believable that in such an era, we could be suffering from a shortage of data, yet that’s precisely what a new book by David Hand, Emeritus Professor of Mathematics and Senior Research Investigator at Imperial College London, argues.
Dark Data explores the data we don’t have, and the complications this brings to our ability to make effective decisions. Hand argues that the phrase big data can lull us into a false sense of security, and instead of overcoming the shortcomings of ‘small’ data, it often has just the same challenges.
While data is undoubtedly powerful, there are considerable problems when key data is missing, and this is far more common than we like to think. Hand describes fifteen different types of dark data, including the kind we know we’re missing and the kind we’re oblivious to.
The elephant in the room
Hand uses the famous example of a man laying down powder on the road to keep elephants away to illustrate his point. Obviously there are no elephants, so the man believes his powder is working, but the reality is he’s simply missing vital information on what would happen if he didn’t put powder down.
Similarly, Hurricane Sandy was lauded as the first social media natural disaster in history, with people excitedly believing that the tweets people made would play a huge part in relief and rescue efforts. Of course, while the 20 million tweets sent during Sandy were significant, they were concentrated in a small, densely populated part of the total area affected the storm, so presented a wholly incomplete picture. What’s more, the people quite probably in most need of help would also probably have been unable to tweet in the first place.
The book details the various ways in which dark data exists and the various circumstances in which it can arise. Through this, Hand hopes that organizations will be able to better be able to detect and mitigate the impact of dark data on their decision making.
The book is well worth reading for anyone with an interest in data, statistics or AI should find something of value in the book, and it’s one I recommend.