Big data is all the rage, but there is much to suggest that few organizations are truly utilizing it in their decision making. Reasons for this are many, but include the grueling and thankless task of cleaning up the data we have. Often it’s in a pretty messy state and certainly not such that we can derive quick insights from it. There is also difficulties in knowing just what bits of the data we hold are useful in making predictions.
It’s on this latter task that a team of MIT researchers have developed an automated tool. The researchers recently published a couple of papers on the process, including the preparation of data and even the creation of problem specifications.
“The goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in,” the authors say. “[Data scientists want to know], ‘Why don’t you show me the top 10 things that I can do the best, and then I’ll dig down into those?’ So [these methods are] shrinking the time between getting a data set and actually producing value out of it.”
Real world problems
The researchers attempted to keep their work as grounded in real world challenges as possible, and indeed the genesis of their study were the frequent complaints brought to them by industry researchers. For instance, it would be common that data scientists would take months to define a prediction problem, even when they had data ready and available.
The researchers, who are bringing their tool to market via their Feature Labs company, developed a new programming language, called Trane, to reduce the time data scientists spend on defining prediction problems to days rather than months. The team are confident that similar improvements can be made for label-segment featurize (LSF) processes.
The system was tested out on real-world questions that were posed by data scientists working with around 60 datasets. Even with a relatively small sample, the system was capable of devising not only all of the questions posed by the data scientists, but also many that they hadn’t considered.
The work represents a big step forward in allowing data scientists to represent prediction problems in a more efficient way so that these can be more easily shared between data analysts and domain experts, which is an area of real difficulty at the moment.
It’s likely to be one of many tools that emerge to make us more effective at working with the big data that our organizations are generating.