10 Datasets by INDIAai for your Next Data Science Project

Pankaj Singh 10 May, 2024
5 min read


Did you know India is among the top nations investing in and leveraging AI? India’s AI investment is fifth worldwide.

Per Statista, The Artificial Intelligence market in India is projected to grow by 28.63% (2024-2030), resulting in a market volume of US$28.36bn in 2030.

Quiet impressive, right? It is visible that AI is booming, and India is doing its part to take it to the next level with INDIAai.

But what exactly is INDIAai? 

It is a knowledge portal, a research organization, and an ecosystem-building initiative that aims to unite and promote collaborations with various entities in India’s AI ecosystem.

What else does it provide?

If you are in your final year and looking for a data science project, INDIAai will help you with the required datasets.

Here, you can access high-quality datasets in data science, which is indispensable for fostering innovation and driving impactful research. Fortunately, initiatives like INDIAai contribute significantly to this endeavor by curating and disseminating diverse datasets catering to various domains and research interests. Among the plethora of datasets offered by IndiaAI, the 10 are intriguing options for aspiring data scientists and researchers.

Datasets by IndiaAI

Overview of 10 Datasets

The 10 datasets curated by INDIAai encompass various data sources spanning multiple domains and use cases. They are meticulously curated, annotated, and accessible to researchers, practitioners, and enthusiasts alike. Whether you’re interested in natural language processing, computer vision, healthcare analytics, or socioeconomic research, the datasets offer you an opportunity for exploration and discovery.

Datasets by INDIAai for Your Data Science Projects

Here are datasets by INDIAai for your data science projects:

Global Youth Tobacco Survey (GYTS-4)

The International Institute for Population Sciences (IIPS), operating under the Ministry of Health and Family Welfare, conducted the Global Youth Tobacco Survey (GYTS-4) in 2019. This comprehensive survey aimed to assess tobacco usage among schoolchildren aged 13-15 across various states and union territories (UTs). It delved into demographic factors such as gender, school location (rural or urban), and school administration type (public or private) to provide a nuanced understanding of tobacco consumption patterns among this demographic group.

Download Link: Global Youth Tobacco Survey (GYTS-4)

National Financial and Economic Data

The Department of Economic Affairs meticulously compiles comprehensive national financial and economic data. This invaluable repository encompasses critical metrics such as external debt, central government borrowing, monthly economic reports, and succinct national summary data pages, providing a robust foundation for informed decision-making and strategic planning at both macro and micro levels.

Download Link: National Financial and Economic Data

Indian Census Data

Explore an extensive array of invaluable resources at our digital library, where a treasure trove of census tables, reports, and various digital files spanning from 1991 to 2011 awaits your discovery. Delve into rich datasets, insightful reports, and meticulously curated information, all available for seamless download in digital format, empowering researchers, policymakers, and curious minds alike to unlock new insights and perspectives. Whether unraveling demographic trends, conducting historical research, or seeking data-driven solutions, our comprehensive collection is a beacon of knowledge, fostering exploration and innovation with every click.

Download Link: Indian Census Data

Herbarium Dataset of the Wildlife Institute of India (WII)

The Wildlife Institute of India recently unveiled its groundbreaking Wildlife Herbarium Dataset, comprising 4591 specimens. This comprehensive collection encompasses various flora and fauna, meticulously cataloged and digitized for scientific exploration. Leveraging the Global Biodiversity Information Facility (GBIF) network, these digital specimens are readily accessible to researchers worldwide, facilitating unparalleled insights into the natural world.

This invaluable resource serves as a cornerstone for conservation efforts and ecological research. Scientists and conservationists can harness the power of this dataset to monitor biodiversity trends, track endangered species, and devise effective conservation strategies. By analyzing the information contained within these specimens, researchers can unravel ecological mysteries, identify critical habitats, and safeguard vulnerable ecosystems.

Download Link: Herbarium Dataset of the Wildlife Institute of India (WII)

Voice Call Quality Customer Experience

Voice Call Quality Customer Experience data collected by the Ministry of Communications, Department of Telecommunications (DOT), and the Telecom Regulatory Authority of India (TRAI) is a vital barometer of telecommunications performance in India. This comprehensive dataset encapsulates the nuanced quality metrics of voice calls across diverse regions, telecom operators, and technological infrastructures.

The collaboration between the Ministry of Communications and TRAI ensures the meticulous gathering, analysis, and dissemination of data, fostering transparency and accountability within the telecommunications sector. By assessing various parameters such as call drops, call setup success rates, voice clarity, and network coverage, this data empowers stakeholders to make informed decisions and drive continuous improvement in service delivery.

Download Link: Voice Call Quality Customer Experience

List of MSME Registered Units

The dataset contains comprehensive information regarding Micro, Small, and Medium Enterprises (MSMEs) registered under the Udyog Aadhaar Memorandum. It encompasses many details concerning these registered units, ranging from demographic information to operational specifics.

Download Link: MSME Registered Units

Local Government Directory (LGD) – Local Bodies with PIN Codes

The Local Government Directory (LGD) – Urban dataset, provided by the Ministry of Panchayati Raj, is a comprehensive resource for urban governance. It encompasses a wide array of information crucial for effective administration and planning at the local level, particularly focusing on areas within urban jurisdictions.

This dataset includes detailed information on various facets of urban governance, ranging from administrative structures to demographic profiles. It offers insights into the organizational hierarchy, delineating the roles and responsibilities of different administrative units within urban local bodies. Moreover, it provides data on key infrastructure facilities, such as healthcare, education, transportation, and sanitation, essential for sustainable urban development.

Download Link: Local Government Directory (LGD) – Local Bodies with PIN Codes

The Lemur Project: ClueWeb09 Dataset

The ClueWeb09 dataset, created by the Language Technologies Institute at Carnegie Mellon University, is incredibly important for advancing research in information retrieval and language technologies. It contains a massive collection of 1 billion web pages gathered in early 2009, offering a diverse range of online content in ten different languages. This dataset is highly valued in the academic community and is used in various parts of the prestigious TREC conference. Its extensive coverage and size make it an essential tool for scholars and researchers, allowing them to make significant discoveries and advancements in search technology and related fields.

Download Link: The Lemur Project: ClueWeb09 Dataset

The 20 Newsgroups Datasets

The 20 Newsgroups dataset is a cornerstone of machine learning. It comprises around 20,000 documents drawn from an eclectic array of newsgroups. These documents are meticulously partitioned, ensuring a near-even distribution across 20 categories. While its origins trace back to Ken Lang, the mastermind behind Newsweeder, it’s worth noting that Lang doesn’t explicitly claim this specific collection.

Download Link: The 20 Newsgroups data sets

Reuters Corpora (RCV1, RCV2, TRC2)

In 2000, Reuters Ltd introduced the Reuters Corpus, Volume 1 (RCV1), a significant advancement in natural language processing and machine learning. This expansive collection of Reuters News stories surpassed previous datasets in size and scope, offering a diverse range of topics, languages, and sources. RCV1 quickly became a cornerstone for researchers and developers, driving text classification and analysis innovation. Over the years, it has remained a vital resource, facilitating breakthroughs in sentiment analysis and topic modeling. RCV1’s legacy underscores the importance of meticulously curated datasets in advancing the field of natural language processing.

Download Link: Reuters Corpora (RCV1, RCV2, TRC2)

For more datasets refer to this: Datasets by INDIAai


These 10 datasets curated by INDIAai represent a goldmine of opportunities for researchers, data scientists, and enthusiasts alike. They offer a rich tapestry of information for exploration and analysis, covering diverse domains such as public health, economics, biodiversity, telecommunications, governance, and language technologies. Whether you are looking for a data science project for a college internship or want to practice, these datasets are useful.

Pankaj Singh 10 May, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers