SQL vs NoSQL Databases – A Key Concept Every Data Engineer Should Know
- Understand what SQL and NoSQL databases are.
- Go through the prominent difference between SQL and No SQL Databases.
- This is not an exhaustive list. Feel free to add any other difference between SQL and NoSQL in the comments
You can’t get away from learning about databases in data science. In fact, we need to become quite familiar with how to handle databases, how to quickly execute queries, etc. as data science professionals. There’s just no way around it!
There are two things you should know – learn all you can about database management and then figure out how to efficiently go about it. Trust me, you will go a long way in the data science domain.
As a Data Engineer, you are bound to work with all kinds of databases, especially SQL and NoSQL. However, most of us already have some considerable experience with SQL databases. Where we falter is when we have to transition to NoSQL databases, and it can be a bit intimidating at first, to be honest – the beginning is always the hardest.
So, to flatten the obstacle for you, we will talk about some key differences between these two kinds of databases in this article. This will give you an overview of the two and will make it easier for you to begin your journey. Let’s begin!
Table of Contents
- What are SQL databases?
- What are NoSQL databases?
- Difference between SQL and NoSQL databases
- Schema design
- Data Structure
- Guiding Principle
What are SQL databases?
SQL is Standard Query Language that aids in querying relational databases. Hence, these databases are also often referred to as SQL databases.
The major advantage of databases over normal file storage systems is that it reduces data redundancy to a large extent, facilitates sharing of data among various users, and ensures the security of data which may be of immense importance to an organization.
Each database contains multiple tables, containing data in the form of rows and columns. And each table is related to a number of other tables within the database.
What are NoSQL databases?
NoSQL or Not only SQL came to the picture in the late 2000s. These are flexible, scalable, cost-efficient, and schema-less databases.
They were born out of the need to handle huge amounts of data we generate in today’s world, which comes in different varieties and generated at a high pace.
In comparison with SQL databases, they are of multiple types: document-based, key-value based, wide column-based, graph-based. Each has its own pros and cons.
Now let’s deep dive and look at some of the key differences between SQL and NoSQL databases.
Difference between SQL and NoSQL databases
SQL databases are relational databases that store data in multiple related tables. These tables are relations. Each relation is organized into rows and columns. Each row is a tuple and holds a record, and each column is an attribute for which each record usually holds a value. Tables in the database are related using the SQL keys.
The columns in the table hold a certain type of data. If a record contains data with any other data type, then the database will throw an error. Also, a record needs to contain the same number of values as the number of columns in the table or needs to provide a NULL value explicitly. The most popular examples of SQL databases are MySQL, PostgreSQL, and Oracle.
There are 4 types of NoSQL databases: document-based, key-value based, wide column-based, graph-based.
These databases store data in JSON-like documents. Each document has a key-value format, which means the data is semi-structured. Even if there is a missing value within a document for a key, the database will not throw an error. A popular example is MongoDB.
These databases store data in key-value format. Both keys and values can be anything, from string to complex values. The keys are stored in efficient index structures and can quickly and uniquely locate the values. This makes them ideal for applications that require fast retrieval of data. Amazon DynamoDB is an example of such databases.
Wide column-based databases
This database stores the data in records similar to any relational database but it has the ability to store very large numbers of dynamic columns. Meaning, the number of column values for rows can vary in such databases. It groups the columns logically into column families. Cassandra is a popular example.
They use nodes to store data entities like places, products, etc. and edges to store the relationship between them. There is no limit to the number and kind of relationships a node can have. Neo4j is an example of such databases.
Determining the structure or schema of the database before adding any data is a pre-requisite for SQL databases. This means that this kind of database can only store structured data. This makes it very inflexible to handle real-world data that is streaming at a ferocious pace. Updating the schema here would take a lot of time and effort and would need updating a lot of relations.
NoSQL databases on the other hand have no fixed structure. They can handle any kind of data: structured, semi-structured, or unstructured data. This means that even if the incoming data has a different number of attributes, the database will be able to handle it without any error. This makes NoSQL databases very popular because we can easily change the schema without much interruption.
There is no real difference between the two when it comes to speed. Both will perform equally well under most scenarios. However, you can notice some differences when it comes to handling complex queries and large datasets.
SQL databases require the storage of the data to be in the normalized form to prevent redundancy of data. Although this reduces the amount of storage required by the database and ensures easy updating of records, it can have some effect on the querying of the database. For example, performing complex queries like joins on a database containing multiple tables can prove to be quite taxing, especially when the size of data becomes quite significant. NoSQL databases overcome this disadvantage.
NoSQL databases do not care whether there is a duplication of data because storage is not an issue with NoSQL databases. Data in NoSQL databases are typically stored in a way that is optimized for queries. This means you can store data in the same way as you would require it after performing a query. This throws the whole problem of joins out for a toss and makes the task of querying much faster.
For example, SQL databases require you to maintain two separate tables for employee information and department information, linking them with a foreign key, maybe department id.
However, in the case of NoSQL databases, like MongoDB, you can store the complete information about the employee, including department information, within the same document, although you can perform some nesting of values if you wish.
Note: You can still perform joins in NoSQL databases.
SQL databases run on traditional machines. This means they run on a single server. Now, if you exceed the current capacity of your server, you would have to use a more powerful CPU, add more RAM, stack up the storage, etc. This is vertical scaling. This can turn out to be quite expensive, especially if you have to deal with Big Data (in the order of TB, GB, PB, etc.)
NoSQL databases on the other hand offer horizontal scaling. This means if you run out of capacity, you can simply add a machine to the cluster (a bunch of machines working together). These machines are usually much cheaper and are aptly referred to as commodity hardware. This ability of NoSQL databases has another major advantage besides the cheaper capacity building is the distribution of data.
NoSQL databases usually run on a number of interconnected machines which is known as a cluster. The data is distributed amongst the machines within the cluster. Each machine will store some portion of the data.
Now you must be wondering how is this beneficial?
Well, distributing data offers us the capability to replicate data and offer fault tolerance. Meaning, a portion of data might be replicated and stored on multiple machines.
If a machine fails, the data on it will be present on some other machine in the cluster and can be used without the user’s knowledge, thereby offering fault tolerance. This obviously is not possible with SQL databases because the storage of all the data is on the same machine.
A major benefit of SQL databases is their ability to handle transactional processing. These processes modify the contents of a database. The ACID properties of SQL databases govern:
- Atomicity – Transactions are performed one at a time or they don’t happen at all.
- Consistency – This ensures that the database is not left in half a complete state. If an error occurs, then it ensures that the roll back changes takes place.
- Isolation – Transactions occur independently. No transaction has access to any other transaction.
- Durability – The changes made to the database through transactions on completion are committed to the database and the updates are not lost.
NoSQL databases on the other hand do not provide ACID properties entirely. Instead, the CAP theorem governs them:
- Consistency – This means that the user should be able to see the same data no matter which node/machine they connect to on the system/cluster. So if a data has been written to one node, it should be replicated to all its replicas.
- Availability – This means that every request from the user should elicit a response from the system. Whether the user wants to read or write, the user should get a response even if the operation was unsuccessful.
- Partition Tolerance – Partition occurs when a node cannot receive any messages from another node in the system. It could have been because of network failure, server crash, or any other reason. Therefore, partition tolerance will ensure that the system should still be able to work even if there is a partition in the system.
However, the NoSQL databases have to make a tradeoff between Consistency and Availability when a Partition occurs. This is because, in a real-world system, the partition is bound to occur due to network failure or some other reason. Therefore, when a partition happens, a NoSQL database has to either give up on Consistency or Availability. Hence, a NoSQL distributed database is either characterized as CP or AP.
Note: NoSQL databases are not as rigid when it comes to CAP. Most do provide options to balance between Consistency and Availability. Therefore, the choice isn’t always as black and white.
The ACID property makes SQL databases of massive importance in the fields where transactions are extremely important. Banking is one such example where money transactions need to be handled properly, especially in case of a failed transfer, failure of which can cost a fortune.
Also, if your data is going to be structured and unchanging, there is no reason to use NoSQL databases. You can always harness the capabilities of your SQL databases and of course your stellar SQL knowledge!
However, if you are looking to work with a large volume of data with no set structure, NoSQL databases are the best choice! But even NoSQL databases can have a wide-ranging use case depending on the inherent structure and their preference of the CAP theorem properties.
While on one hand, ElasticSearch stores log data, Cassandra on the other hand is used by many social network websites. Nevertheless, all of these at the end of the day is aiding in handling the volume, velocity, and variety of Big Data!
In this article, we dicussed the major differences between SQL and NoSQL databases. This by no means is an exhaustive list of differences between the two databases. But hopefully, you got a good overview of the two!
Going forward I recommend you to try out the SQL course for Data Science and the following articles on SQL and NoSQL: