How We Feel When AI Flags Toxic Speech

Toxic speech has sadly become all too common online, and the scale of many online communities makes human policing of such speech incredibly difficult for platform owners.  As such, they have turned to AI to help them identify toxic speech in their communities.  New research from Stanford University explores how we generally feel about such tools being used in such ways.

The study finds that despite these AI tools being pretty good in technical tests, they often frustrate users as identifying hate speech and misinformation is nothing if not messy.

“It appears as if the models are getting almost perfect scores, so some people think they can use them as a sort of black box to test for toxicity,” the researchers say. “But that’s not the case. They’re evaluating these models with approaches that work well when the answers are fairly clear, like recognizing whether ‘java’ means coffee or the computer language, but these are tasks where the answers are not clear.”

Stopping hate speech

Facebook alone has said that their AI tools have successfully removed around 27 million pieces of hate speech in just a few months during 2020.  In nearly all of these cases the tools acted before humans had raised them as an issue.

It shows considerable progress that is being mirrored by the other major platforms, all of whom have deployed AI-based tools of their own as the vast quantity of content posted online makes human moderation impossible.

The study highlights the potential gap, however, between how developers think things are going and the reality so that more sophisticated and nuanced tools can be developed in the future.

Prickly problem

Suffice to say, it’s a problem with no straightforward solution, not least due to the fact that there is no real unanimous agreement on what are often contested issues.  What’s more, there are often distinct differences in terms of how people react to different pieces of content.

For instance, the researchers highlight how difficult it was for human annotators to agree on tweets that contained words from a lexicon of hate speech.  Just 5% of the tweets used received majority agreement, and just 1.3% unanimous agreement.  Similar findings have emerged when trying to understand what is and is not misinformation.

And yet, despite this, in the ROCAUC test, AI models consistently score 0.95 on a scale in which 1.0 is perfect performance (and 0.5 guesswork).  These are lab-based tests, however, and the Stanford research finds that in the real world, the score falls to a maximum of 0.73.

The researchers believe their work gives us a more nuanced insight into the performance of these tools by fully accounting for what people really believe and the level of disagreement among us.

Real-world views

They did this by developing an algorithm that filtered out things like misunderstanding, ambivalence, and inconsistency from the way in which we label things like hate speech.  This provided them with a good estimate of the level of disagreement present.

For instance, if an annotator labeled the same kind of language consistently was of particular importance.  These became “primary labels”, which were then used as a more precise dataset to capture the true range of opinions we have about hateful content.

This was then used to refine datasets that were used to train the AI models to spot toxic content, misinformation, and so on.  This highlighted the significant reduction in performance of the AI tools, with a score of 0.73 in the ROCAUC test for toxicity and just 62% accuracy in spotting misinformation.  Even the accuracy in identifying pornography fell to just 0.79.

Knowing the limits

It’s almost impossible to expect AI models to make decisions that satisfy everyone, especially over things as contentious as hate speech that always generates disagreement.  Indeed, the situation may not even be improved by more precise definitions of hate speech, as users may simply suppress their true opinions in order to provide a societally acceptable answer.

If the platforms have a more realistic picture of what users actually believe, however, as well as the groups that hold particular views, then it becomes easier for them to design systems that can make more informed decisions.  However they work, it is likely that value judgments will be made that will always generate a degree of controversy.

“Is this going to resolve disagreements in society? No,” the researchers conclude. “The question is what can you do to make people less unhappy. Given that you will have to make some people unhappy, is there a better way to think about whom you are making unhappy?”

Facebooktwitterredditpinterestlinkedinmail