Opinion: can we use machine learning to automate fact-checking and verification in the battle against fake news?

Newspapers containing fake headlines and stories displayed on a New York newsstand by the Columbia Journalism Review to educate news consumers about the dangers of misinformation. Photo: Michael Brochstein/SOPA Images/LightRocket via Getty Images

Since its creation, the internet has been evolving and taking part in every aspect of the human life - and that is not any different when it comes to news. In fact, by reading this article through a browser and not on a printed paper, you are embracing this process. Recent studies show that approximately 65 percent of the US adult population accesses the news through social media and more than a billion people worldwide are active on a daily basis on Facebook alone.

This creates a whole new paradigm where blogs, forums and social networking websites are not subject to traditional journalistic standards. This results in a lower quality of information being consumed by the readers. When presented with a false sentence and a true one and asked to indicate which one is the fake, humans are just four percent better than chance. Furthermore, readers typically find only a third of all text-based deceptions. This reflects the so-called "truth bias" or the notion that people are more apt to judge communications as truthful.

Recently there were cases of innocent people being attacked by mobs in India when the attackers were lead by rumours spread over the WhatApp mobile phone application. The recent presidential campaign in Brazil also saw a great number of fake news being spread on groups of the same messaging platform. The fact that these messages are encrypted makes the conventional process of fact-checking and dealing with misinformation even less efficient.

For more than a decade, such agencies as Snopes, Politifact and FactCheck have been preventing the spread of false news, hoaxes and incomplete or neglected information. Many press companies, websites and journalistic groups work on the hard tasks of monitoring social media, identifying potential false claims and debunking or confirming them.

Unfortunately, manual fact checking is an intellectually demanding and laborious process. As it takes an average of 13 hours for the true version to be shared after the rumour "peaks", the right version of the facts often receive less attention and rarely reach those who first believed the lie. As Jonathan Swift said in his classic essay "The Art of Political Lying", "falsehood flies, and truth comes limping after it". 

As if the situation wasn't bad enough, humans have an automatic unconscious response when exposed to a counter-evidence to what they already believe is true. Sometimes, the exposure of a fake news story paradoxically not only fails to debunk the false idea, but increases their confidence about it in their minds. This effect is known as the "backfire effect".

But with the aid of artificial intelligence, data scientists have been making good progress in the task to automatically detect fake news or deception in a text. Similar to how some people display visual signs when lying, humans generally use spoken or written language in a different way when they are trying to deceive other and we can use these characteristics to train an artificial intelligence algorithm to identify possible fake news.

Psychoanalysis studies have already shown the relationship between a number of linguistic aspects to the presence or absence of deception in a given piece of text. Applying these findings to the domain of news articles might be a way to increase people awareness. For example, it was shown that false news is, on average, less objective than documents describing real events. Other semantic aspects to be taken into account are emotiveness, affect, moral bias and formality. Syntactic characteristics have aided many Natural Language Processing tasks and are also a great means to make computers make sense of language.

Data is the hindrance. Similarly to most machine learning models, the amount of data available for training is often what differentiates a good model from a bad one. In general terms, the more data you feed a machine, the better it gets at what it is supposed to do. By using crawlers, a type of bot that is used to navigate the web and save its content, the process of obtaining news articles is automated and can be performed continuously. This is in itself not a simple process and the wrong content of an article is captured more often than wanted. After sufficient data is collected and used to train the deep learning classifier, the model can then be used to predict whether a document is fake or not, and that is why it is called classifier.

There are many types of machine learning models, but the principle is roughly the same: finding correlation between the training data and the input to determine the prediction. The higher the similarities between the input and the fake news used to train the model, the higher will be the probability assigned.

The initial step with a piece of text is to measure each of the linguistic aspects by assigning a continuous score to it. These scores range from zero to 1 depending on how present the aspect being measured is in the given text. For example, an article with a 0.84 formality score is much more formal than one with a formality score of 0.22. After all the linguistic scores are calculated, they serve as an input for the machine, along with any other information it might take into account.

With enough data, a good selection of linguistic aspects and a well adjusted machine learning model, a simple and automatic way to the detect fake news might be closer than expected.


The views expressed here are those of the author and do not represent or reflect the views of RTÉ