With the whole fake news debate, there has been considerable scrutiny given to the news media in recent years. A big part of this trend has been the huge growth in content, with digital channels churning out content around the clock. Whilst some of this content might be somewhat ‘fake’, there is also a high degree of ‘me too’ content that is not especially adding to the discourse.
Research led by the University of Pennsylvania sees an AI tool used to autonomously rank content according to its ‘content density’. The system was able to accurately sort and classify news stories across a range of domains, comparing each piece of content with articles already correctly classified.
The algorithm was trained on a batch of around 50,000 articles from the New York Times linguistic dataset. This contains not only the original articles, but also their metadata and short summaries of each piece. The leading paragraph of each story was then compared to the summary attached to the article, with the difference between the two used as an indicator of the information richness of the piece. This is because the summary will have incredibly dense content and will therefore be a good benchmark to compare against.
Rating the news
The content density is therefore the difference between the two scores. The stories were initially rated by a combination of recruits from Mechanical Turk, the research team, and the algorithm they’d developed. These articles, together with their content density team, are then fed to the algorithm so that it can develop an understanding of what is and is not content dense.
As you might expect, the rules for what is and is not content dense will vary significantly depending on the topic of the story. For instance, sports stories would veer towards the non content-dense end of things.
The algorithm was put through its paces against a training subset that had already been labeled accurately. It was able to provide a good reflection of the density in around 80% of instances, which is a reasonable start point.
“We have confirmed that the automatic annotation of data captures distinctions in informativeness as perceived by people,” the authors say. “We also show proof-of-concept experiments that show how the approach can be used to improve single-document summarization of news and the generation of summary snippets in news-browsing applications. In future work the task can be extended to more fine-grained levels, with predictions on sentence level and the predictor will be integrated in a fully functioning summarization system.”