Bring it on! Analytics Vidhya Author identification challenge

Kunal Jain 18 Apr, 2015 • 3 min read
What is the best form of analytics learning? Applying it to practical problems! This is exactly what led us to create this interesting problem, solving which would be a lot of fun! The challenge should be a good combo of basic text mining and predictive modeling. If you haven’t got your hands dirty learning these techniques, now is the time to do it!
Microsoft-predictive-analytics

Background

It’s family time for Kunal and Tavish! Both of them are on a break and have decided to stay away from any email / phone communications. Other Analytics Vidhya team members are not only filling in their shoes, but also have their own tasks to be completed!
Meet Navnit, our Tech Lead and web developer – who has decided that this is probably the best time to change the CMS (content management system) for our site. He creates a backup of data in our database, moves it over to a DVD and starts the migration.
Due to a mismatch in data models – a few fields got lost in the transition. Navnit, realized this only after he has deleted the entire data on previous CMS. When he realized this, he thought – nothing to worry, he has the backup, he can put up the lost fields through the DVD back. He checks that the DVD is on his desk and plans to restore the data first thing tomorrow morning.
jenika
Enter Jenika, Kunal’s lovely year old daughter – who feels that entire Analytics Vidhya office, is her playground. During her visit, she comes across this shiny blue disc, which she has not seen before! Nice toy, dad has in his office – she might have thought! In the hour, she had before Navnit is back in office, she tried eating her new toy, sliding it on the the ground and what not!
Poor Navnit, his only source of missing fields can not be used now! He checked and the last backup was taken on 6th July 2014,11:59 p.m. On comparing, one of the most critical fields lost is the author name. So, he can not identify the author for articles posted after 6th July 2014. He decides to finally learn and apply some predictive analytics for his work!
It’s your turn to help Navnit get back the data, before Kunal or Tavish are back in office (1st September 2014)!

Problem statement:

Classify all the articles written by Tavish or Kunal on analyticvidhya.com by the author’s name.
What Data you need to use for training your model?
All the articles written by Tavish and Kunal before 7th July 2014 can be used to train the model. You can use the date of article publish, day of article publish, tags of article publish and the content of the article. The data needs to be scrapped out of the website and used on local server.
What Data you need to score your model?
All the articles (excluding this article) written by Tavish and Kunal after 6th July 2014 need to be scored using your model.
Help : You can take reference from this video to start this analysis
What is the evaluation metric?
Average mis-classification rate of both training and scoring will be taken as the evaluation metric. For example 5 out of 10 in scoring and 50 out of 50 in training were found to be correct classes in training and scoring respectively. The average mis-classification rate will be 0.5 * (5/10 + 0/50) = 0.25. Hence, your score is 75%. You need to build model which has high predictive power and also stable over populations.

End Notes:

  1. The aim of this challenge is to foster analytical thinking in our reader’s mind and have some fun with practical machine learning / analytics challenges!
  2. We will give the winner of this challenge a chance to blog about his solution on Analytics Vidhya. Of course, he takes away all the visibility, which comes on the platform!
  3. Last but not the least, the entire story presented before is hypothetical. It was created with the sole aim to create this challenge. All our data is secure and darling Jenika understands that she can’t play around with Dad’s stuff in office!

Happy learning!

Bonus:

If you want to foster discussion on any aspect of this problem, please feel free to do this through comments below. This is your chance to engage in community learning!

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

Kunal Jain 18 Apr 2015

Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Ashvi
Ashvi 15 Aug, 2014

Hi I am a final year B.com(H) student at SRCC and I want to build my career in analytic. I really like your website a lot. Its a great platform for a fresher like me to get a kick start to my career! :) For solving this problem, you said that we need to prepare a model on a programming tool "R". I have never worked on this tool before. in fact never even heard of it. can you please provide me with any information about the basics to use the tool. I am a beginner in analytic so please bear with me! Regards Ashvi Mittal

Manju
Manju 15 Aug, 2014

Hi Kunal, Its interesting. Do we have any timeline for completion? Regds, Manju

Sandeep
Sandeep 15 Aug, 2014

I am New to the industry, still acquiring knowledge. I really don't know how to solve this problem. But am really excited to discover that to what extent analytics can help!

guindilla
guindilla 29 Aug, 2014

This video is the companion video of the text mining recommendation above. It presents the concepts used in the code, and I found it interesting: https://www.youtube.com/watch?v=2Znairz-hvU Cheers.

Sagar
Sagar 03 Sep, 2014

Hie.. Kunal sir, I am completed my B.E in computer engg.in 2013 but now I want to move in analytics so I joined Base SAS but confused about what was the my next step because I don't have any idea about this industry.so after base SAS certification Advance SAS or BI certification which one help me to enter in industry. please help me .