DataHack Radio #21: Detecting Fake News using Machine Learning with Mike Tamir, Ph.D.

Pranav Dar 13 Jun, 2019 • 5 min read

Introduction

Fake news is one of the biggest scourges in our digitally connected world. That is no exaggeration. It is no longer limited to little squabbles – fake news spreads like wildfire and is impacting millions of people every day.

How do you deal with such a sensitive issue? Millions of articles are being churned out every day on the internet – how do you tell real from fake? It’s not as easy as turning to a simple fact checker. They are typically built on a story-by-story basis. Can we turn to machine learning?

It’s a prevalent and pressing issue – and hence we invited Mike Tamir, Ph.D., as our guest on DataHack Radio. Mike has been working on a project called FakerFact that aims to identify and separate truth from fiction. His team’s approach is based on using machine learning algorithms of the Natural Language Processing (NLP) variety.

In this episode, Kunal and Mike discuss several aspects of the FakerFact algorithms, including:

The idea behind FakerFact
How Mike and his team collect data for training the FakerFact NLP algorithms
The importance of updating existing datasets and retraining these algorithms
Dealing with Biases in the data

And much, much more. I would recommend this podcast to EVERY data scientist – it touches on a critical issue plaguing our society.

All our DataHack Radio podcast episodes are available on the below platforms – subscribe today!

I have summarized the episode discussion in this article. Happy listening!

The Idea Behind FakerFact

“The challenge of misinformation has been prevalent for years now and we still haven’t got our arms around it as a society.”

You might know how difficult it is to detect intent in text if you’ve worked on NLP projects. The sheer amount of layers in the human language feels overwhelming! To make a machine understand it – that’s a lot of effort.

Things have been improving however in the last few years. There’s been a huge leap in NLP frameworks. We have demonstrated the ground-breaking developments here. In short, NLP techniques can now parse through the given text and perform all sorts of human-level tasks.

FakerFact is Mike Tamir’s project which he started with a few fellow researchers a couple of years back. Most fact checkers available online tend to be black and white – they attempt to tell you if a piece of given information is real or fake. FakerFact takes a different angle to fact-checking:

“Can we teach machine learning algorithms to tell the difference between bits of text that are just about education, reporting, etc. versus bits of text that are presenting opinions, using satire, are filled with hate speech, have a hidden agenda, etc.?”

You can read more here about how FakerFact works and how you can use it in your browser.

Collecting Data for Training the FakerFact Algorithms and Combating Bias

“That’s one of the hardest challenges in data science.”

Mike and his team start with top-level domains. They use different algorithms for doing a reverse bootstrapping process. This helps the team carry down from the domain level to the individual article level for training.

One of the most important things they have to pay attention to is stratification. This is pretty understandable – you don’t want the model to be biased based on the samples, right? Mike illustrated this point using a brilliant example of right-wing v left-wing articles.

As a data scientist, you are going to love this section of the podcast. It’s really important for us to understand and mitigate bias right at the start of our data collection process. You can imagine how critical that is for a fact-based application like FakerFact.

Most of the fake news datasets we see online are based on certain events, like the 2016 US elections. That’s a very specific sample and can lead to serious bias in the model if used exclusively. It’s important to diversify using different domains and time periods.

Now, separating the truth from fiction is what FakerFact aims to do. That means it relies on the audience to tell the algorithms whether a particular article is credible or not. But can you rely entirely on your audience to generate that insight? No! The Fakerfact team has several strategies in place to mitigate any bias that might come from user feedback.

Updating the Datasets to Keep Up with the Growing Number of Articles

Collecting the data once isn’t going to cut it given how quickly information spreads in today’s connected world and the number of articles being churned out. Mike and his team constantly update their datasets. They’re on their fifth iteration right now.

“We are constantly scraping data. We have millions and millions of articles that are fed into our dataset.”

Of course, this means that with each update, the team needs to run and check their baseline results all over again. Are they performing at the same level? Do they need to change the architecture? Questions like these are essential to keep FakerFact at the top of the game.

Dealing with Unknown Biases in the Data

Anyone who’s worked on even a slightly complicated NLP project knows there’s no smooth sailing to building a model. There will be obstacles along the way. You might miss out on a certain point, or an unknown bias might creep in which no one would have thought of in a million years.

Mike picked up two examples his team encountered when building the FakerFact model. The first was about authors promoting themselves on Twitter.

But it’s the second example that really stood out for me. On a certain website (name mentioned in the podcast), the FakerFact algorithm consistently pulled up the articles. The team couldn’t figure out why – the articles looked like usual journalistic pieces. Can you guess what the issue was?

The algorithms were parsing through the comments section of each article. So the results became invariably biased (that’s the state in most political or journalistic websites). A classic example of unknown bias.

Mike Tamir’s Industry Experience

Changing gears, Kunal asked Mike to touch on his rich industry experience, especially his previous role at Uber as the Head of Data Science. I have summarized this part of the podcast below:

Creating simulations for autonomous vehicles: At Uber, Mike’s team did open research on creating Q-learning adaptive stress testing. In fact, some of that work will soon be published!
Mike’s other roles involved working on spot pricing, recommendations on the Uber Eats application and various other aspects of autonomous vehicles

The Near Future of Natural Language Processing (NLP)

And finally – where does Mike Tamir see NLP heading in the next few years?

“It’s safe to we’ll continue to see dramatic improvements in how we are able to work on text.”

2018 was a breakthrough year for NLP. We saw frameworks and libraries like BERT, ULMFiT, Transformer-XL, among others. But the base for that was built in 2017. Going forward, perhaps in the next 2-3 years, Mike said he could see these techniques merging.

It’s already happening in 2019 and should continue to pick up the pace going forward. Really interesting times lie ahead!

End Notes

Fake news is no laughing matter anymore. It has transformed quickly from being a mere nuisance to costing lives around the world. Any step towards dealing with it in the right way is a welcome sight. I personally quite like FakerFact’s approach to this.

I loved Mike’s ability to explain complicated concepts and tie them together into an understandable format. It certainly helps to know how FakerFact functions under the hood and the different ways the team uses to mitigate bias. It’s a goldmine of information for those of us working in NLP.

Pranav Dar 13 Jun 2019

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Intermediate NLP Podcast

Responses From Readers

Andrew Morris 11 Apr, 2019

Fake news certainly exists, but this is a problem which applies as much to the main-stream media as to any other source of news or opinion. Fake news detection techniques can be divided into those based on style and those based on content, or fact checking. Too often it is assumed that bad style (bad spelling, bad punctuation, limited vocabulary, using terms of abuse, ungrammaticality, etc.) is a safe indicator of fake-news. This is to skew judgement in favour of professional journalists, who are not paid to check facts and are most likely to be producing stylistically correct opinions according to the agenda of their rich establishment employer, whose news sources are most likely fed, selected and censored by government agencies. So how about fact-checking? This has two main problems. One is that, besides trivial facts like names and dates, most of the "facts" which need to be checked concern human affairs. These are not true facts, from hard science or mathematics, which everyone would agree on, but are better described as opinions, which neither objectively true or false. The other problem with fact checking is that, partly due to legitimate privacy restrictions, few of the trivial facts mentioned above are yet searchable. This is not to say that automatic fact-checking is not possible to some extent. It is just to warn that the fake-news techniques currently in use are highly suspect from a technical point of view. As with lie detection, there is a known strong tendency to give computer generated fake-news detection more credit than it deserves. More than ever, this is a case where the machine’s opinion must be backed up by clear and fully verifiable indications for the basis of its decision, in terms of the facts checked and the authority by which the truth of each fact was determined.