5 Popular NoSQL Databases Every Data Science Professional Should Know About

Last Updated : 21 Sep, 2020

7 min read

Overview

NoSQL databases are ubiquitous in the industry – a data scientist is expected to be familiar with these databases
Here, we will see what is a NoSQL database and why you should learn about it
We will also look at the features of 5 different NoSQL databases

Introduction

Here’s a piece of advice I wish someone had given me when I was starting out in data science – learn as much as you can about working with databases.

Here’s a quick look at where your database knowledge will come into play:

You will face questions about databases in your data science interview
You’ll be working extensively with databases in your role as a data scientist, data analyst, business analyst, etc.
You’ll be leaning on your database knowledge to collect and gather data for your data science project

And a whole lot more!

The incontrovertible truth is that we are generating data at an unprecedented pace and scale right now. The sheer fact that more than 8,500 Tweets and 900 photos on Instagram are uploaded in just one second blows my mind. It boggles the mind – how are modern-day databases coping up with such volumes of data?

To handle this much amount of data, we need a distributed database system that can run multiple nodes and are partition tolerant as well. It means even if one of the nodes goes down for any reason, the system should work seamlessly. So Partition Tolerance is a must-have thing. Now according to CAPs theorem, we cannot have Partition Tolerance, Availability, and Consistency all three at the same time.

We have to trade between Availability and Consistency. For example, in a banking application, a customer should see the correct balance regardless of where he/she accesses it from. The results can be a few seconds late but they should be highly consistent.

In this article, we will see different types of NoSQL databases, their features, and when to use each database type.

What is a NoSQL Database?
Types of NoSQL Databases
1. Document-Based Database
2. Key-Value Database
3. Wide Column Based Database
4. Graph-Based Database
Different NoSQL Databases
1. MongoDB
2. Cassandra
3. ElasticSearch
4. Amazon DynamoDB
5. HBase

What is a NoSQL Database?

So what is a NoSQL database?

You might have heard people saying that a NoSQL Database is any non-relational database that doesn’t have any relationship between the data. Well, that’s not completely true. They can also store the relationship between the data but in a different way.

We can say that “NoSQL” stands for “Not Only SQL”. Here, data is not split into multiple tables, as it allows all the data that is related in any way possible, in a single data structure. When you work with a huge amount of data, you don’t need to worry about the performance lags when you query a NoSQL database. No need to run the expensive joins! They are highly scalable and reliable and designed to work in a distributed environment.

Types of NoSQL Databases

Now that we know what a NoSQL database is, let’s explore the different types of NoSQL databases in this section.

1. Document-Based NoSQL Databases

Document-based databases store the data in JSON objects. Each document has key-value pairs like structures:

NoSQL Databases

The document-based databases are easy for developers as the document directly maps to the objects as JSON is a very common data format used by web developers. They are very flexible and allow us to modify the structure at any time.

NoSQL Databases

Some examples of document-based databases are MongoDB, Orient DB, and BaseX.

2. Key-Value Databases

As the name suggests, it stores the data as key-value pairs. Here, keys and values can be anything like strings, integers, or even complex objects. They are highly partitionable and are the best in horizontal scaling. They can be really useful in session oriented applications where we try to capture the behavior of the customer in a particular session.

Some of the examples are DynamoDB, Redis, and Aerospike.

3. Wide Column-Based Databases

This database stores the data in records similar to any relational database but it has the ability to store very large numbers of dynamic columns. It groups the columns logically into column families.

For example, in a relational database, you have multiple tables but in a wide-column based database, instead of having multiple tables, we have multiple column families.

Image Source

Here is a good resource to learn more about column-based databases:

Popular examples of these types of databases are Cassandra and HBase.

4. Graph-Based Databases

They store the data in the form of nodes and edges. The node part of the database stores information about the main entities like people, places, products, etc., and the edges part stores the relationships between them. These work best when you need to find out the relationship or pattern among your data points like a social network, recommendation engines, etc.

Some of the examples are Neo4j, Amazon Neptune, etc.

Now, let’s have a look at some of the NoSQL databases and their features.

List of the Different NoSQL Databases

1. MongoDB

MongoDB is the most widely used document-based database. It stores the documents in JSON objects.

According to the website stackshare.io, more than 3400 companies are using MongoDB in their tech stack. Uber, Google, eBay, Nokia, Coinbase are some of them.

When to use MongoDB?

In case you are planning to integrate hundreds of different data sources, the document-based model of MongoDB will be a great fit as it will provide a single unified view of the data
When you are expecting a lot of reads and write operations from your application but you do not care much about some of the data being lost in the server crash
You can use it to store clickstream data and use it for the customer behavioral analysis

If you want to start with MongoDB, I highly recommend going through the below articles:

2. Cassandra

Cassandra is an open-source, distributed database system that was initially built by Facebook (and motivated by Google’s Big Table). It is widely available and quite scalable. It can handle petabytes of information and thousands of concurrent requests per second.

Again, according to stackshare.io, more than 400 companies are using Cassandra in their tech stack. Facebook, Instagram, Netflix, Spotify, Coursera are some of them.

When to use Cassandra?

When your use case requires more writing operations than reading ones
In situations where you need more availability than consistency. For example, you can use it for social network websites but cannot use it for banking purposes
You require less number of joins and aggregations in your queries to the database
Health trackers, weather data, tracking of orders, and time series data are some good use cases where you can use Cassandra databases

3. ElasticSearch

This is also an open-source, distributed NoSQL database system. It is highly scalable and consistent. You can also call it as an Analytics Engine. It can easily analyze, store, and search huge volumes of data.

If the full-text search is a part of your use case, ElasticSearch will be the best fit for your tech stack. It even allows search with fuzzy matching.

More than 3000 companies are using Elasticsearch in their tech stack, including Slack, Udemy, Medium, and Stackoverflow.

When to use ElasticSearch?

If your use case requires a full-text search, Elasticsearch will be the best fit
If your use case involves chatbots where these bots resolve most of the queries, such as when a person types something there are high chances of spelling mistakes. You can make use of the in-built fuzzy matching practices of the ElasticSearch
Also, ElasticSearch is useful in storing logs data and analyzing it

4. Amazon DynamoDB

It is a key-value pair based distributed database system created by Amazon and is highly scalable. But unfortunately, it is not open-source. It can easily handle 10 trillion requests per day so you can see why!

More than 700 companies are using DynamoDB in their tech stack including Snapchat, Lyft, and Samsung.

When to use DynamoDB?

1. In case you are looking for a database that can handle simple key-value queries but those queries are very large in number
2. In case you are working with OLTP workload like online ticket booking or banking where the data needs to be highly consistent

5. HBase

It is also an open-source highly scalable distributive database system. HBase was written in JAVA and runs on top of the Hadoop Distributed File System (HDFS).

More than 70 companies are using Hbase in their tech stack, such as Hike, Pinterest, and HubSpot.

When to use HBase?

You should have at least petabytes of data to be processed. If your data volume is small, then you will not get the desired results
If your use case requires random and real-time access to the data, then HBase will be the appropriate option
If you want to easily store real-time messages for billions of people

End Notes

This is by no means an exhaustive list. There are more NoSQL databases out there but these are the most widely used in the industry.

If you have worked with any of these databases or any other NoSQL database, let me know in the comments section below. I would love to hear about your experience!

There is a lot of difference in the data science we learn in courses and self-practice and the one we work in the industry. I’d recommend you to go through the following crystal clear free courses to understand everything about analytics, machine learning, and artificial intelligence:

Beginner Big data Data Engineering Database NoSQL

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

5 Popular NoSQL Databases Every Data Science Professional Should Know About

Overview

Introduction

Table of Contents

What is a NoSQL Database?

Types of NoSQL Databases

1. Document-Based NoSQL Databases

2. Key-Value Databases

3. Wide Column-Based Databases

4. Graph-Based Databases

List of the Different NoSQL Databases

1. MongoDB

When to use MongoDB?

2. Cassandra

When to use Cassandra?

3. ElasticSearch

When to use ElasticSearch?

4. Amazon DynamoDB

When to use DynamoDB?

5. HBase

When to use HBase?

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap