Exploring The Language Of Fake News

Recently I looked at work done by a team from the Fraunhofer Institute in Germany to identify fake news, both from the content itself but also metadata associated with it.  Such projects are proliferating at quite a pace, with the latest from a MIT-based team, who have documented their work in a recently published paper.

The paper highlights how there are subtle, yet consistent differences between real and fake news stories, and machine-learning algorithms can be trained to spot these differences.  The researchers developed a deep-learning model that is able to detect the language patterns between real and fake news.

It was tested on a new topic that it hadn’t encountered in training, which required the system to classify each article based purely on the language patterns it observed.  The authors believe in this sense it more realistically replicates how humans consume the news.

“In our case, we wanted to understand what was the decision-process of the classifier based only on language, as this can provide insights on what is the language of fake news,” the researchers say.  “A key issue with machine learning and artificial intelligence is that you get an answer and don’t know why you got that answer, and showing these inner workings takes a first step toward understanding the reliability of deep-learning fake-news detectors.”

The anatomy of fake news

The analysis was able to identify a number of words that are more likely to appear in either real news or fake news, which in turn allowed them to identify subtle differences in the language used in both forms of content.

The system was trained on a sample of around 12,000 fake news articles from the data website Kaggle.  The samples were gleaned from 244 different websites.  These were then contrasted with around 2,000 real articles from the New York Times, and 9,000 or so from the Guardian.

The model was first put to the test in the traditional way using a training set of topics to test whether the system was able to identify the fake stories.  They worried however that this may create an inevitable bias in the system, as some topics are more prone to seeing fake stories than others.

So the researchers trained their model on topics that aren’t so common, such as Donald Trump, with the model then being tested on stories that did contain the word “Trump”.  The results revealed that the traditional approach was able to achieve an accuracy of around 93%, the second approach performed at 87%.

To further understand matters, the researchers traced their steps backwards, so that each time a prediction was made, the team were able to identify the precise word or part of the story that triggered the prediction.  The work is far from complete, and the team admit that they need to polish the system a lot more before it can be of real value to readers.  For instance, they may choose to combine the model with automated fact-checkers to help readers combat misinformation.

“If I just give you an article, and highlight those patterns in the article as you’re reading, you could assess if the article is more or less fake,” they explain. “It would be kind of like a warning to say, ‘Hey, maybe there is something strange here.'”

Facebooktwitterredditpinterestlinkedinmail