Pranav Dar — July 14, 2018


  • Microsoft has released a ton of free datasets on their ” platform
  • These datasets currently span 8 categories; all of them are from research Microsoft has been involved in recently
  • There is a heavy focus on natural language processing, but you also have options in computer vision, geospatial analysis, etc.



Open source has always been the overriding theme of the data science and machine learning community. It’s hearing to see so many big research and tech companies like Google, Uber, Facebook, NVIDIA, and Intel open source their research to the wider audience. It not only benefits the companies, but expands the reach of ML to an already growing and thriving user base.

I have also seen Microsoft make their shift to artificial intelligence and ML in recent years with their flagship product – Azure ML. So it comes as welcome news that Microsoft’s research arm has launched ‘Microsoft Research Open Data‘, a platform that hosts a collection of tons of free datasets. These datasets span a variety of diverse domains and categories so you are free to pick your choice.

The different categories currently available are listed below:

  • Biology
  • Computer Science
  • Engineering
  • Environmental Science
  • Information Science
  • Mathematics
  • Physics
  • Social Science

These are all active research areas for Microsoft. The data they have provided has been curated and collected over a number of years for different studies and activities they have been involved in.

Of course since this is Microsoft’s platform, they have provided us with an option to either download the dataset directly or use a virtual machine (VM) powered by Azure. This VM is preloaded with popular development tools to make it a seamless experience for the user. Below is an image shown in their blog post which shows the development tools:


Our take on this

I love this move by Microsoft! They are not often spoken about when it comes to open source research, but they have certainly made their mark with this release. There’s a bit of something for every data scientist – natural language processing, computer vision, image processing, geospatial, among others.

There is a very heavy focus on NLP in this collection. That might be because of Microsoft’s emphasis on improving their voice assistant and other chatbot-like related applications. Whatever the case, everyone wins! I cannot recommend enough browsing these datasets and downloading one to get started on your own project. This represents a real-life industry project you can excel at to improve or polish your skills and get noticed.

Use the comments section below to ask any questions in case you get stuck anywhere or are unable to download any dataset.


Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!


About the Author

Pranav Dar
Pranav Dar

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *