Who is the world cheering for? 2014 FIFA WC winner predicted using Twitter feed (in R)
Sports are filled with emotions! Cheering of audience, reactions to events on various media channels are some of the factors, which make a huge impact on the mind of the players.
If people support you, your chances to win are greatly enhanced. Live example of this fact, are the statistics of Indian cricket team playing in India and abroad. The win rate of Indian cricket team in India is approximately twice the win rate abroad.
Football is again a game driven largely by emotions. Cards (Yellow/Red), have been kept in the game to limit these emotions. If you think about places, where people express their emotions, Facebook and Twitter come out on top of the list. In this article we will make a prediction using a simplistic algorithm on the winner of 2014 FIFA world cup. This prediction will be based on the emotions expressed by people on Twitter. The entire code has been shared on github. We will just keep bits and pieces of this code as a reference in this article.
The first step is to fetch the tweets, which have a reference to both FIFA and a team (out of 4). We have done this using hashtags on twitter. In this case we have put an upper threshold of 1000 tweets because of constrained hardware resources. You can use the following code to fetch the same :
Once we have all the tweets, we need to clean the tweets and then check the sentiment of these tweets. Following is the code, I have used to pull out cleaned words and map them to a positive and negative strings.
Once we have a well defined function which can score the tweets individually, we now score out tweets after converting them to factors (Refer to the github code).
Once we have a sentiment score against each tweet, we now try to summarize the score and fetch useful information from the same. You can use the following code to do the summarization
As you can clearly see, we have a clear winner from the graphs i.e. Argentina. Let’s summarize this dataset into a cross tab.
#Tweets with a +ve/-ve score for each team
%Tweets with a +ve/-ve score for each team
We can use different parameters to come up with the rank ordering. I have considered following to rank order teams :
Criterion 1 : %positive tweet – %negative tweets
Criterion 2 : weighted score
Criterion 3 : Fixture of matches
Using all three, we see our clear winner is Argentina and Brazil is clearly on rank 4. But Germany and Netherland are really close. But using the 3rd criterion, we see Germany is the one competing against Brazil. Hence, we have a clear rank order. The prediction can be seen in the 1st picture of the article.
I am not a follower of the sport: football, but this analysis has excited me enough to compare my prediction to the actuals. I see myself as an unbiased analyst to make this prediction. The technique used in this article, is an over simplistic model to make such a strong prediction, but a good point to start one.
An actual model should have all kinds of input like past performance of each team versus each other, venue, players injured and finally the sentiment feed. I will like to hear more inputs to make this model more accurate. These recommendation can be used to either enhance the sentiment analysis algorithm or to include new types of input variables. People who follow the sport will be the best ones to make recommendations on this.
Here is another application – if you feel strongly against this prediction, or have your own algorithm to predict a winner – you can use the difference in the two models to create a nice betting strategy!
Did you find the article useful? Have you worked on similar objective before? How can we enhance this code to make more accurate predictions? Share with us similar kinds of post to enable us make a even stronger model.
If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.