Fake news has been one of the most important topics of recent years, and has prompted at least a modicum of action from social networks and other intermediaries. Perhaps the most visible of these was the announcement of moderators on Facebook to police the content that’s shared, but obviously with the scale and scope of the ‘fake news industry’, this was always a losing battle.
A team from the University of Michigan believe they’ve developed an algorithm that can do a better job autonomously. The work, which was documented in a recently published paper, looks for linguistic cues to identify what it believes to be fake news.
During testing, the system was capable of correctly spotting fake news stories 76% of the time, which compares favorably to human success rates of just 70%. What’s more, they believe their linguistic-based approach allows them to spot fake news even before it’s been debunked in fact checking sites.
Turning the tide
Most social media and news aggregation sites today rely on human editors to keep on top of the influx of news stories. These editors often rely on external fact verification services, which themselves struggle to keep up with the very latest stories. This can result in damage already being done by the time the fake news is detected. By using linguistic analysis, the team hope to significantly increase the speed and accuracy of checking.
“You can imagine any number of applications for this on the front or back end of a news or social media site,” the researchers say. “It could provide users with an estimate of the trustworthiness of individual stories or a whole news site. Or it could be a first line of defense on the back end of a news site, flagging suspicious stories for further review. A 76 percent success rate leaves a fairly large margin of error, but it can still provide valuable insight when it’s used alongside humans.”
Such linguistic algorithms are increasingly deployed in the analysis of speech, but the challenge was to build ones based upon the right kind of data to ensure it was trained successfully. The very nature of fake news can make it hard to collect, as it can rise and fall incredibly quickly. It’s also spread across multiple genres, making collection complicated.
In the end, this data collection challenge resulted in the team developing their own data via crowdsourcing to turn genuine news stories into fakes. The team suggest that this is how most fake news stories are generated, with humans producing them in return for a financial bounty.
This produced a collection of some 500 stories that were labeled to train the algorithm so that it could distinguish between the fake and real stories. The algorithm was then tested on a second dataset to see how accurately it could spot the fakes.
The team plan to make the details of their system freely available online so that news sites, social media networks and aggregators can build on it to create their own autonomous fake news detection systems. They believe that the system can be made even more robust if metadata, such as links and comments, are incorporated into the technology.