Despite the vast quantities of data and information online, the ability to answer particular questions remains a challenge. This is in large part because computers have difficulties classifying plain text, and this remains a major challenge for AI researchers today.
A recent paper highlights a new approach to the extraction of this information that promises to turn conventional machine learning on its head.
A fresh approach
This conventional approach typically requires training the algorithms on test data that allows it to search for patterns that match those given it by human annotators. The general feeling is that the more data the algorithm is fed, the better it will be equipped to deal with challenges.
The paper argues for the opposite approach however, with the algorithms trained on minimal data. This is option is often forced on researchers because of a paucity of appropriate data.
“In information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article,” the authors say. “That’s very different from what you or I would do. When you’re reading an article that you can’t understand, you’re going to go on the web and find one that you can understand.”
So that’s what the algorithm was programmed to do. It begins by assigning a classification a confidence score to determine how accurate they believe it to be. If this score is too low, it then automatically loads up a search engine and queries the topic to draw upon other text on it.
It looks at each search result in turn, continuously re-evaluating the confidence score based upon the new information, and returning to the knowledge pool whenever the score remains too low.
“The base extractor isn’t changing,” the team say. “You’re going to find articles that are easier for that extractor to understand. So you have something that’s a very weak extractor, and you just find data that fits it automatically from the web.”
The interesting thing is that all of this takes place autonomously, from the evaluation of weaknesses to the construction of the search query to the whole cycle starting over again.
In initial experiments where the system was fed around 300 documents to begin with, the algorithm proved adept at determining appropriate search terms to beef up its knowledge, and then mining on average around 10 articles from the web in order to do so.
When this was compared against more traditional machine-learning approaches to the same task, it resulted in an out-performance of around 10% for the new method.