Organizations have a rapidly expanding amount of data with which to work, but deriving insight from that data is a challenge, not least because of persistent skills shortages in the industry. Data analysis typically begins with the identification of so called ‘features’, which are data points that are believed to have predictive powers. The identification of these features is usually something that requires a degree of experience.
A team from MIT have developed a new tool, called FeatureHub, that they believe will make feature identification easier. The tool, which is documented in a recently published paper, takes a collaborative approach to the task, with data scientists working together to review a problem and propose features, which are then tested by the software against target data to gauge their usefulness.
When the software was tested, a team of 32 data scientists spent five hours each with it before tackling a couple of data-science challenges. The suggestions proposed by the system were compared with those submitted by the community at Kaggle, with each suggestion rated on a 100-point scale. Interestingly, the suggestions proposed by the software were both within 3-5 points of the winning entries on Kaggle.
Efficient solutions
Where the software comes into its own is in the timeliness of its suggestions. Whereas a high performing entry on Kaggle would usually take at least a few weeks of work, FeatureHub returned a suggestion within days.
The team hope that eventually the platform will obtain a scale similar to its namesake and inspiration, GitHub, which is a huge platform and repository for open-source programming projects.
“I do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention,” they say. “I think that the concept of massive and open data science can be really leveraged for areas where there’s a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses.”
The project builds upon previous work by the team, which I documented last year. The work was chronicled in couple of papers, including the preparation of data and even the creation of problem specifications.
“The goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in,” the authors say. “[Data scientists want to know], ‘Why don’t you show me the top 10 things that I can do the best, and then I’ll dig down into those?’ So [these methods are] shrinking the time between getting a data set and actually producing value out of it.”
The researchers, who are bringing their tool to market via their Feature Labs company, developed a new programming language, called Trane, to reduce the time data scientists spend on defining prediction problems to days rather than months. The team are confident that similar improvements can be made for label-segment featurize (LSF) processes.
Suffice to say, the project is at an early stage, but given the challenges inherent in making sense of data it’s an interesting area to follow.