- NoSQL databases are ubiquitous in the industry – a data scientist is expected to be familiar with these databases
- Here, we will see what is a NoSQL database and why you should learn about it
- We will also look at the features of 5 different NoSQL databases
Here’s a piece of advice I wish someone had given me when I was starting out in data science – learn as much as you can about working with databases.
Here’s a quick look at where your database knowledge will come into play:
- You will face questions about databases in your data science interview
- You’ll be working extensively with databases in your role as a data scientist, data analyst, business analyst, etc.
- You’ll be leaning on your database knowledge to collect and gather data for your data science project
And a whole lot more!
The incontrovertible truth is that we are generating data at an unprecedented pace and scale right now. The sheer fact that more than 8,500 Tweets and 900 photos on Instagram are uploaded in just one second blows my mind. It boggles the mind – how are modern-day databases coping up with such volumes of data?
To handle this much amount of data, we need a distributed database system that can run multiple nodes and are partition tolerant as well. It means even if one of the nodes goes down for any reason, the system should work seamlessly. So Partition Tolerance is a must-have thing. Now according to CAPs theorem, we cannot have Partition Tolerance, Availability, and Consistency all three at the same time.
We have to trade between Availability and Consistency. For example, in a banking application, a customer should see the correct balance regardless of where he/she accesses it from. The results can be a few seconds late but they should be highly consistent.
In this article, we will see different types of NoSQL databases, their features, and when to use each database type.
Table of Contents
- What is a NoSQL Database?
- Types of NoSQL Databases
- Document-Based Database
- Key-Value Database
- Wide Column Based Database
- Graph-Based Database
- Different NoSQL Databases
- Amazon DynamoDB
What is a NoSQL Database?
So what is a NoSQL database?
You might have heard people saying that a NoSQL Database is any non-relational database that doesn’t have any relationship between the data. Well, that’s not completely true. They can also store the relationship between the data but in a different way.
We can say that “NoSQL” stands for “Not Only SQL”. Here, data is not split into multiple tables, as it allows all the data that is related in any way possible, in a single data structure. When you work with a huge amount of data, you don’t need to worry about the performance lags when you query a NoSQL database. No need to run the expensive joins! They are highly scalable and reliable and designed to work in a distributed environment.
Types of NoSQL Databases
Now that we know what a NoSQL database is, let’s explore the different types of NoSQL databases in this section.
1. Document-Based NoSQL Databases
Document-based databases store the data in JSON objects. Each document has key-value pairs like structures:
The document-based databases are easy for developers as the document directly maps to the objects as JSON is a very common data format used by web developers. They are very flexible and allow us to modify the structure at any time.
Some examples of document-based databases are MongoDB, Orient DB, and BaseX.
2. Key-Value Databases
As the name suggests, it stores the data as key-value pairs. Here, keys and values can be anything like strings, integers, or even complex objects. They are highly partitionable and are the best in horizontal scaling. They can be really useful in session oriented applications where we try to capture the behavior of the customer in a particular session.
Some of the examples are DynamoDB, Redis, and Aerospike.
3. Wide Column-Based Databases
This database stores the data in records similar to any relational database but it has the ability to store very large numbers of dynamic columns. It groups the columns logically into column families.
For example, in a relational database, you have multiple tables but in a wide-column based database, instead of having multiple tables, we have multiple column families.
Here is a good resource to learn more about column-based databases:
Popular examples of these types of databases are Cassandra and HBase.
4. Graph-Based Databases
They store the data in the form of nodes and edges. The node part of the database stores information about the main entities like people, places, products, etc., and the edges part stores the relationships between them. These work best when you need to find out the relationship or pattern among your data points like a social network, recommendation engines, etc.
Some of the examples are Neo4j, Amazon Neptune, etc.
Now, let’s have a look at some of the NoSQL databases and their features.
List of the Different NoSQL Databases
MongoDB is the most widely used document-based database. It stores the documents in JSON objects.
According to the website stackshare.io, more than 3400 companies are using MongoDB in their tech stack. Uber, Google, eBay, Nokia, Coinbase are some of them.
When to use MongoDB?
- In case you are planning to integrate hundreds of different data sources, the document-based model of MongoDB will be a great fit as it will provide a single unified view of the data
- When you are expecting a lot of reads and write operations from your application but you do not care much about some of the data being lost in the server crash
- You can use it to store clickstream data and use it for the customer behavioral analysis
If you want to start with MongoDB, I highly recommend going through the below articles:
Cassandra is an open-source, distributed database system that was initially built by Facebook (and motivated by Google’s Big Table). It is widely available and quite scalable. It can handle petabytes of information and thousands of concurrent requests per second.
Again, according to stackshare.io, more than 400 companies are using Cassandra in their tech stack. Facebook, Instagram, Netflix, Spotify, Coursera are some of them.
When to use Cassandra?
- When your use case requires more writing operations than reading ones
- In situations where you need more availability than consistency. For example, you can use it for social network websites but cannot use it for banking purposes
- You require less number of joins and aggregations in your queries to the database
- Health trackers, weather data, tracking of orders, and time series data are some good use cases where you can use Cassandra databases
This is also an open-source, distributed NoSQL database system. It is highly scalable and consistent. You can also call it as an Analytics Engine. It can easily analyze, store, and search huge volumes of data.
If the full-text search is a part of your use case, ElasticSearch will be the best fit for your tech stack. It even allows search with fuzzy matching.
More than 3000 companies are using Elasticsearch in their tech stack, including Slack, Udemy, Medium, and Stackoverflow.
When to use ElasticSearch?
- If your use case requires a full-text search, Elasticsearch will be the best fit
- If your use case involves chatbots where these bots resolve most of the queries, such as when a person types something there are high chances of spelling mistakes. You can make use of the in-built fuzzy matching practices of the ElasticSearch
- Also, ElasticSearch is useful in storing logs data and analyzing it
4. Amazon DynamoDB
It is a key-value pair based distributed database system created by Amazon and is highly scalable. But unfortunately, it is not open-source. It can easily handle 10 trillion requests per day so you can see why!
More than 700 companies are using DynamoDB in their tech stack including Snapchat, Lyft, and Samsung.
When to use DynamoDB?
- In case you are looking for a database that can handle simple key-value queries but those queries are very large in number
- In case you are working with OLTP workload like online ticket booking or banking where the data needs to be highly consistent
It is also an open-source highly scalable distributive database system. HBase was written in JAVA and runs on top of the Hadoop Distributed File System (HDFS).
More than 70 companies are using Hbase in their tech stack, such as Hike, Pinterest, and HubSpot.
When to use HBase?
- You should have at least petabytes of data to be processed. If your data volume is small, then you will not get the desired results
- If your use case requires random and real-time access to the data, then HBase will be the appropriate option
- If you want to easily store real-time messages for billions of people
This is by no means an exhaustive list. There are more NoSQL databases out there but these are the most widely used in the industry.
If you have worked with any of these databases or any other NoSQL database, let me know in the comments section below. I would love to hear about your experience!
There is a lot of difference in the data science we learn in courses and self-practice and the one we work in the industry. I’d recommend you to go through the following crystal clear free courses to understand everything about analytics, machine learning, and artificial intelligence: