Making it easier to clean big data

Big data is great, but it’s only really useful if you can derive insights from it.  With much of the data we harvest somewhat messy and unstructured, organizations often spend far more time tidying up the data than they do gaining insights from it.

I’ve written before about automated approaches to doing this, with projects such as Active Clean from researchers at Columbia University and the University of California at Berkeley, which uses prediction models to test out datasets, and uses the results to understand the fields that require cleaning whilst simultaneously updating the models at the same time.

“Big data sets are still mostly combined and edited manually, aided by data-cleaning software like Google Refine and Trifacta or custom scripts developed for specific data-cleaning tasks,” the researchers say. “The process consumes up to 80 percent of analysts’ time as they hunt for dirty data, clean it, retrain their model and repeat the process. Cleaning is largely done by guesswork.”

Error spotting

Whilst sorting our what is largely inconsequential data is a major part of an analysts’ time, there is also the sizeable task of cleaning up erroneous data that can skew datasets.

A new tool, called Vizier, developed by researchers at the University of Buffalo aims to help by proactively catching data errors.  The tool allows users to interactively work with datasets, cleaning, curating and visualizing data in what the team hope are meaningful ways.

The tool is intended for very large datasets with millions of data points.

“We are creating a tool that’ll let you work with the data you have, and also unobtrusively make helpful observations like ‘Hmm… have you noticed that two out of a million records make a 10 percent difference in this average?'” the team say.

The hope is that these kind of automated tools will make it easier for a wider spread of organizations to start utilizing data, as they won’t have to invest huge sums in building the kind of teams required to clean up the data they’re collecting.

This is especially so as a growing number of government agencies are releasing data, and open data is rapidly becoming the defacto way of operating for governments across the western world.

As an open source tool, Vizier hopes to play an active and positive role in this ecosystem.

“We want to make it easier for data scientists — and eventually data hobbyists — to discover and communicate not only what the data says, but why the data says that,” the team say.

Related

Facebooktwitterredditpinterestlinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *

Captcha loading...