A Comprehensive Guide on Neo4j
This article was published as a part of the Data Science Blogathon.
Today, most organizations invest more than ever in their resources to finely leverage graph analytics to extract valuable insights from massive, complex volumes of data.
For those who don’t know, Neo4j is one of the most popular graph databases that gives developers and data scientists the most authorized and refined tools to quickly build today’s intelligent applications and smart workflows.
This comprehensive guide is intended to assist beginners in understanding the basics to practical concepts of Neo3j. It will explain all the definitions and the terms involved in Neo4j to take yourself to a higher level of expertise.
Also, it will demonstrate a working code example of installing Neo4j and how to get started with Neo4j on your systems.
Note: Before we move on with this guide, the readers are highly advised to basic knowledge of Database Systems and Graph Theory.
Table Of Contents
- What is a Graph Database?
- Why Graph Database?
- RDBMS vs. Graph Database
- Advantages Of Neo4j
- Features Of Neo4j
- Neo4j Property Graph Data Model
- How To Install Neo4j On Ubuntu
- Neo4j CQL: Clauses, Functions, Datatypes, and Operators
- About Author
What is a Graph Database?
A graph database is a purpose-built database to store and navigate relationships in computing. It stores nodes and relationships instead of tables or documents in a graph structure for semantic queries. The nodes store data entities (users, companies, or any data an organization determines to record). The edges are used to keep the relationship between the entities so that the users can understand it easily. Data is accumulated without limiting it to a pre-defined model, authorizing a very adjustable way of analyzing it and using it.
A node in a graph database can have any number and kind of relationship. In contrast, an edge always has a start node, end node, type, and direction. An edge node can also represent a parent-child relationship, actions, and ownership. Most organizations use graph-based databases to pull precise and in-depth data and solve complex queries regarding their customer and user inputs.
Why Graph Database?
Most companies in this rapidly-evolving world of technologies and the internet struggle to deal with large volumes of data. It is crucial to generate insight from existing data. Finding connections between items is as important as the data itself. So how do we solve this problem? Companies require a database technology that stores relationship details as a first-class entity to leverage data relationships.
While existing relational databases can store these relationships, they still perform very poorly when handling data relationships. The technology that is the solution to most of your business needs is a graph database. It can store relationships natively alongside the nodes (data elements) in a better and more flexible format. And when it comes to extending a data model or fitting it to changing business demands, they are optimized for traversing through data quickly.
Relational Database vs. Graph Database
If you are still stumbling to find the answer to “How does a relational database differ from a graph-based database, the below difference table will help you understand it in a better way:
|Relational Database||Graph Database|
|Format||It has tables with rows and columns.||It has nodes and edges showing relationships among each other.|
|Relationships||Relationships are connected across tables where they are established using foreign keys between tables.||Considering data, the relationships are represented between edges and nodes.|
|Complex Queries||Relational databases require complex joins between tables.||Graph databases operate quickly and do not require joins.|
Relationational databases are widely adopted for transaction applications such as online transactions and accountings.
Graph databases are mainly used for relationship-heavy use cases, including fraud detection and recommendations engines.
Advantages Of Neo4j
As we all know Graph database is the solution to make rapid progress on mission-critical enterprises. Still, there is a list of benefits of using Neo4j. We are going to study that now:
- It has a simple, flexible, and robust data model. It can be effortlessly adjusted according to your application needs and business demands.
- It delivers results based on real-time data.
- It offers high availability for big company real-time applications with transactional contracts.
- Neo4j is a schema-free database and provides a straightforward representation of connected and semi-structured data.
- Using Neo4j, you can represent and easily retrieve (traverse/navigate) connected data faster than other databases comparatively.
- Neo4j feeds a declarative query language (Cypher Query Language) to illustrate the graph visually employing ASCII-art syntax. The commands of this language are straightforward to understand and are humanly readable.
- Neo4j is fast because more interconnected data is straightforward to retrieve and navigate.
- Using Neo4j does not need complicated joins to retrieve connected/related data. It is straightforward to retrieve its neighboring node or relationship attributes without joins or indexes.
- Neo4j provides higher vertical scaling, improved operational characteristics, higher concurrency, and simplified tuning.
Features Of Neo4j
To summarize, until now, we have seen what Neo4j is. Why is Neo4j so popular? Difference between Neo4j and relational database management system, and advantages of Neo4j as a widely used graph database across all enterprises and businesses.
In this section, we will examine a list of significant features of Neo4j:
1. Data Model With Flexible Schema
Neo4j follows a data model called the property graph model. The graph contains nodes (entities), and the nodes are connected (via the relationship). Nodes and relationships hold data in key-value pairs, also called properties. There is no requirement to follow a fixed schema, and you are allowed to add or remove properties as per the condition. Neo4j also offers schema constraints.
2. ACID Properties
Neo4j supports rich ACID properties:
- A: Atomicity
- C: Consistency
- I: Isolation
- D: Durability
3. Scalability & Reliability
Neo4j allows you to scale the database by increasing the number of reads/writes operations and the volume without impacting the query processing speed and data integrity. It also furnishes permission for replication for data protection and reliability.
4. Built-in Web applications
Neo4j also offers a built-in Neo4j browser web application that can be utilized to construct and retrieve your graph data.
5. Indexing & GraphDB
Neo4j sustains Indexes by employing Apache Lucence & follows Property Graph Data Model.
6. Some Other General Features
Some of the additional features of Neo4j is:
- It provides REST API to work with programming languages such as Java, Spring, Scala, etc.
- It supports UNIQUE constraints and uses native graph storage along native GPE(Graph Processing Engine).
- It provides Java Script to work with UI MVC frameworks such as Node JS.
- Cypher API and Native Java API are the two types of Java API supported by Neo4j. Using these APIs, you can develop robust Java applications.
- In addition to these, Neo4j supports exporting query data to JSON and XLS format to work with other databases such as MongoDB or Cassandra.
Neo4j Property Graph Data Model
As we saw in the features section, Neo4j follows a property graph data model to store and manipulate its data. This section will discuss some of the critical features and central building blocks of the property graph data model, which are:
- First, it is essential to know that a graph data structure consists of nodes (data entities or discrete objects) represented by relationships.
- Nodes can have one or more labels to classify their nodes and are represented using circles.
- Relationships represent a link between a source node and a target node. They can be Unidirectional and Bidirectional only (that means it always has a direction). It is only denoted using arrow keys.
- Relationships must have a type to classify their ties, i.e., “Start Node” or “From Node” and “To Node” or “End Node.”
- Nodes and relationships can have properties where they are key-value pairs, further describing them.
How To Install Neo4j On Ubuntu
This section will explain installing and configuring the Neo4j on Ubuntu 20.04 server.
For setting-up Neo4j, the following setting is recommended:
- 1GB Ram, 15GB of free storage, and a single-core server
- All commands should be run in root mode. If you are not in root mode, controls must be followed by the Sudo command. In this tutorial, we will be using the sudo command for your ease.
- Ubuntu 20.04
The official Ubuntu package repositories do not officially include Neo4j in the standard package repository. To install the upstream supported package from Neo4j, we will add the package source pointing to the location of the Neo4j repository. Then we will add the GPG key from Neo4j for confirmation. After that, we will install Neo4j.
We start by updating the existing list of packages.
This step will install a few prerequisite packages for HTTPS connections to secure the installation. This application may be already installed in your systems by default. Still, it is safe to run the following command anyways.
We will add the security GPG key for the official Neo4j package repository in this step. This key will confirm that you can trust what you are installing is from the official Neo4j upstream repository.
Next, add the Neo4j repository to your system’s package manager list.
The final phase in this module is to install the Neo4j package and all of its dependencies. It is necessary to mention that this installation will also download and install a compatible Java package to work with Neo4j. So you can enter “Y” to accept this software install. If your system already has Java installed, the installer will skip this stage.
Start Neo4j Service
After the installation, Neo4j should be running. However, we need to enable it as a “neo4j.service” service to set it to start on a reboot of the system.
Next, examine Neo4j’s status using the “systemctl” command. This step is essential to verify that everything is working as expected.
Testing and Working With Neo4j
Now that you have Neo4j and its dependencies installed on your system and its services started, you are all set to test the DB connection and configure the admin user.
To interact with the Neo4j database on the command line, we will launch the internal utility using the “cypher-shell” command.
At first, you will be asked to enter a user and password, which by default is ‘neo4j‘-user and ‘neo4j’-password. Once you are authenticated, you will be prompted to change the administrator password to one of your choices.
Once you enter the password, you will be linked to the interactive ‘neo4j” prompt. This is where you will interact with Neo4j DB by inserting and querying nodes.
Use the exit command after setting an administrator password and testing a connection to Neo4j.
Neo4j CQL: Clauses, Functions, Datatypes, and Operators
As discussed in the earlier section, Neo4j has CQL (Cypher Query Language) as query language. Now we will see some of the clauses, functions, data types, and operators supported in CQL.
Neo4j CQL Clauses
There are read, write, and a few general clauses of Neo4j CQL.
- MATCH: is used to search the data with a specified pattern.
- OPTIONAL MATCH: exact as match, but null use when missing parts in the pattern.
- WHERE: is used to add contents to the CQL queries.
- START: is used to locate the initial points through the legacy indexes.
- LOAD CSV: is employed to import data from a CSV file stored locally in the system.
- CREATE: is used to create nodes, properties, and relationships in the DB.
- SET: is employed to update labels on nodes and properties on nodes and relationships.
- MERGE: is used to verify whether the specified pattern exists in the graph. If not, then it creates the pattern.
- DELETE: is used to delete nodes, relationships, and paths from the DB.
- REMOVE: is used to remove elements and properties of nodes and relationships.
- FOREACH: is used to update the data within a list.
- CREATE UNIQUE: is used with CREATE and MATCH to get a unique pattern by matching the existing pattern and creating the missing pattern.
- RETURN: is used to define what to have in the query result set.
- ORDER BY: is used along with RETURN or WITH to arrange the output of a query in order.
- LIMIT: is used to limit the results’ rows to a specific value.
- SKIP: is used to chain the query parts together.
- UNWIND: is used to expand a list into a series of rows.
- UNION: is employed to join the outcomes of multiple queries.
- CALL: is used to invoke a procedure deployed in the database.
Neo4j CQL Functions
Functions that are frequently used with Neo4j CQL queries.
- String: is used while working with string literals.
- Aggregation: is used to conduct some aggregation operations on CQL query results.
- Relationship: is used to fetch details of relationships such as start node, end node, etc.
Neo4j CQL Data Types
Most Neo4j data types are similar to the java language data types. They are also used to define the properties of a node or a relationship.
- Boolean: is used to define boolean values (True, False).
- byte: is used to describe an 8-bit integer.
- short: is used to determine 16-bit integers.
- int: is used to define 32-bit integers.
- long: is used to describe 64-bit integer.
- float: is used to describe a 32-bit floating-point number.
- double: is used to express a 64-bit floating-point number.
- char: is used to express a 16-bit character.
- String: is used to represent a literal string.
Neo4j CQL Operators
Here are the operators supported by Neo4j CQL.
- Mathematical Operators:- [ +, -, *, /, %, ^ ]
- Comparison Operators:- [ +, , , = ]
- Boolean Operators:- [ AND, OR, XOR, NOT ]
- String Operators:- [ + ]
- List Operators:- [ +, IN, [X], [X?..Y] ]
- Regular Expression:- [ =- ]
- Matching String:- [ STARTS WITH, ENDS WITH, CONSTRAINTS ]
The prime motive behind the launch of the Neo4j graph database was to help users solve many different kinds of business and technical needs. It is simple to use and fits your use-cases whether you depend on graph transactions, market analysis, operational optimizations, or anything else. It has always delivered a seamless experience for integrating additional tools with the rest of your existing system.
Here are a few resources to support your further journey into this tool:
- Here are resources for drivers supported by popular programming languages. These drivers authorize developers to create applications and integrations utilizing the programming language of their preference.
- Here are resources for extensions and integrations to expand the technological ecosystem and developers’ needs using Neo4j.
- Here are the resources to deploy and run Neo4j in production environments, local or cloud environments, and anything in between.
- Here is the resource for official reference documentation, including tutorials, guides, code examples, and much more.
- Here is the resource to get yourself certified in Neo4j. This link provides training classes online and in the classroom across the globe. You will learn the basics of advanced CQL with any skill set.
- Here is the resource to contribute to Neo4j. You are always ready to contribute no matter what level of experience you have.
Read more articles on our blog.
I am a Data Scientist with a Bachelors’s degree in computer science specializing in Machine Learning, Artificial Intelligence, and Computer Vision. Mrinal is also a freelance blogger, author, and geek with five years of experience in his work. With a background working through most areas of computer science, I am currently pursuing Masters in Applied Computing with a specialization in AI from the University of Windsor, and I am a Freelance content writer and content analyst.
Connect with me on my social media profiles and follow me for a quick virtual cup of coffee.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.