Exploring GraphRAG from Theory to Implementation

Nibedita Dutta Last Updated : 25 Nov, 2024
8 min read

GraphRAG adopts a more structured and hierarchical method to Retrieval Augmented Generation (RAG), distinguishing itself from traditional RAG approaches that rely on basic semantic searches of unorganized text snippets. The process begins by converting raw text into a knowledge graph, organizing the data into a community structure, and summarizing these groupings. This structured approach allows GraphRAG to leverage this organized information, enhancing its effectiveness in RAG-based tasks and delivering more precise and context-aware results.

Learning Objectives

  • Understand what GraphRAG is and explore the importance of GraphRAG and how it improves upon traditional Naive RAG models.
  • Gain a deeper understanding of Microsoft’s GraphRAG, particularly its application of knowledge graphs, community detection, and hierarchical structures. Learn how both global and local search functionalities operate within this system.
  • Participate in a hands-on Python implementation of Microsoft’s GraphRAG library to get a practical understanding of its workflow and integration.
  • Compare and contrast the outputs produced by GraphRAG and traditional RAG methods to highlight the improvements and differences.
  • Identify the key challenges faced by GraphRAG, including resource-intensive processes and optimization needs in large-scale applications.

This article was published as a part of the Data Science Blogathon.

What is GraphRAG?

Retrieval-Augmented Generation (RAG) is a novel methodology that integrates the power of pre-trained large language models (LLMs) with external data sources to create more precise and contextually rich outputs.The synergy of state of the art LLMs with contextual data enables RAG to deliver responses that are not only well-articulated but also grounded in factual and domain-specific knowledge. 

GraphRAG (Graph-based Retrieval Augmented Generation) is an advanced method of standard or traditional RAG that enhances it by leveraging knowledge graphs to improve information retrieval and response generation. Unlike standard RAG, which relies on simple semantic search and plain text snippets, GraphRAG organizes and processes information in a structured, hierarchical format.

Why GraphRAG over Traditional/Naive RAG?

Struggles with Information Scattered Across Different Sources. Traditional Retrieval-Augmented Generation (RAG) faces challenges when it comes to synthesizing information scattered across multiple sources. It struggles to identify and combine insights linked by subtle or indirect relationships, making it less effective for questions requiring interconnected reasoning.

Lacks in Capturing Broader Context. Traditional RAG methods often fall short in capturing the broader context or summarizing complex datasets. This limitation stems from a lack of deeper semantic understanding needed to extract overarching themes or accurately distill key points from intricate documents. When we execute a query like “What are the main themes in the dataset?”, it becomes difficult for traditional RAG to identify relevant text chunks unless the dataset explicitly defines those themes. In essence, this is a query-focused summarization task rather than an explicit retrieval task in which the traditional RAG struggles with.

Limitations of RAG addressed by GraphRAG

We will now look into the limitations of RAG addressed by GraphRAG:

  • By leveraging the interconnections between entities, GraphRAG refines its ability to pinpoint and retrieve relevant data with higher precision.
  • Through the use of knowledge graphs, GraphRAG offers a more detailed and nuanced understanding of queries, aiding in more accurate response generation.
  • By grounding its responses in structured, factual data, GraphRAG significantly reduces the chances of producing incorrect or fabricated information.

How Does Microsoft’s GraphRAG Work?

GraphRAG extends the capabilities of traditional Retrieval-Augmented Generation (RAG) by incorporating a two-phase operational design: an indexing phase and a querying phase. During the indexing phase, it constructs a knowledge graph, hierarchically organizing the extracted information. In the querying phase, it leverages this structured representation to deliver highly contextual and precise responses to user queries.

Indexing Phase

Indexing phase comprises of the following steps:

  • Split input texts into smaller, manageable chunks.
  • Extract entities and relationships from each chunk.
  • Summarize entities and relationships into a structured format.
  • Construct a knowledge graph with nodes as entities and edges as relationships.
  • Identify communities within the knowledge graph using algorithms.
  • Summarize individual entities and relationships within smaller communities.
  • Create higher-level summaries for aggregated communities hierarchically.

Querying Phase

Equipped with a knowledge graph and detailed community summaries, GraphRAG can then respond to user queries with good accuracy leveraging the different steps present in the Querying phase.

Global Search – For inquiries that demand a broad analysis of the dataset, such as “What are the main themes discussed?”, GraphRAG utilizes the compiled community summaries. This approach enables the system to integrate insights across the dataset, delivering thorough and well-rounded answers.

Local Search – For queries targeting a specific entity, GraphRAG leverages the interconnected structure of the knowledge graph. By navigating the entity’s immediate connections and examining related claims, it gathers pertinent details, enabling the system to deliver accurate and context-sensitive responses.

Python Implementation of Microsoft’s GraphRAG

Let us now look into Python Implementation of Microsoft’s GraphRAG in detailed steps below:

Step1: Creating Python Virtual Environment and Installation of Library

Make a folder and create a Python virtual environment in it. We create the folder GRAPHRAG as shown below. Within the created folder, we then install the graphrag library using the command – “pip install graphrag”.

pip install graphrag

Step2: Generation of settings.yaml File

Inside the GRAPHRAG folder, we create an input folder and put some text files in it within the folder. We have used this txt file and kept it inside the input folder. The text of the article has been taken from this news website

From the folder that contains the input folder, run the following command:

python -m graphrag.index --init --root 

This command leads to the creation of a .env file and a settings.yaml file.

Step2: Generation of settings.yaml File: GraphRAG

In the .env file, enter your OpenAI key assigning it to the GRAPHRAG_API_KEY. This is then used by the settings.yaml file under the “llm” fields. Other parameters like model name, max_tokens, chunk size amongst many others can be defined in the settings.yaml file. We have used the “gpt-4o” model and defined it in the settings.yaml file.   

GRAPHRAG_API_KEY

Step3: Running the Indexing Pipeline

We run the indexing pipeline using the following command from the inside of the “GRAPHRAG ” folder.

python -m graphrag.index --root .

All the steps in defined in the previous section under Indexing Phase takes place in the backend as soon as we execute the above command.

Prompts Folder

To execute all the steps of the indexing phase, such as entity and relationship detection, knowledge graph creation, community detection, and summary generation of different communities, the system makes multiple LLM calls using prompts defined in the “prompts” folder. The system generates this folder automatically when you run the indexing command.

Prompts Folder: GraphRAG

Adapting prompts to align with the specific domain of your documents is essential for improving results. For example, in the entity_extraction.txt file, you can keep examples of relevant entities of the domain your text corpus is on to get more accurate results from RAG.

Embeddings Stored in LanceDB

Additionally, LanceDB is used to store the embeddings data for each text chunk.

Parquet Files for Graph Data

The output folder stores many parquet files corresponding to the graph and related data, as shown in the figure below.

Parquet Files for Graph Data

Step4: Running a Query

In order to run a global query like “top themes of the document”, we can run the following command from the terminal within the GRAPHRAG folder.

python -m graphrag.query --root . --method global "What are the top themes in the document?"

A global query uses the generated community summaries to answer the question. The intermediate answers are used to generate the final answer.

The output for our txt file comes to be the following:

Response of GraphRAG for Global Search

Comparison with Output of Naive RAG:

The code for Naive RAG can be found in my Github.

1. The integration of SAP and Microsoft 365 applications
2. The potential for a seamless user experience
3. The collaboration between SAP and Microsoft
4. The goal of maximizing productivity
5. The preview at Microsoft Ignite
6. The limited preview announcement
7. The opportunity to register for the limited preview.

In order to run a local query relevant to our document like “What is Microsoft and SAP collaboratively working towards?”, we can run the following command from the terminal within the GRAPHRAG folder. The command below specifically designates the query as a local query, ensuring that the execution delves deeper into the knowledge graph instead of relying on the community summaries used in global queries.

python -m graphrag.query --root . --method local "What is SAP and Microsoft collaboratively working towards?

Output of GraphRAG

Response from GraphRAG for Local Search

Comparison with Output of Naive RAG:

The code for Naive RAG can be found in my Github.

Microsoft and SAP are working towards a seamless integration of their AI copilots, Joule and Microsoft 365 Copilot, to redefine workplace productivity and allow users to perform tasks and access data from both systems without switching between applications.

As observed from both the global and local outputs, the responses from GraphRAG are much more comprehensive and explainable as compared to responses from Naive RAG.

Challenges of GraphRAG

There are certain challenges that GraphRAG struggle, listed below:

  • Multiple LLM calls: Owing to the multiple LLM calls made in the process, GraphRAG could be expensive and slow. Cost optimization would be therefore essential in order to ensure scalability.
  • High Resource Consumption: Constructing and querying knowledge graphs involves significant computational resources, especially when scaling for large datasets. Processing large graphs with many nodes and edges requires careful optimization to avoid performance bottlenecks.
  • Complexity in Semantic Clustering: Identifying meaningful clusters using algorithms like Leiden can be challenging, especially for datasets with loosely connected entities. Misidentified clusters can lead to fragmented or overly broad community summaries
  • Handling Diverse Data Formats: GraphRAG relies on structured inputs to extract meaningful relationships. Unstructured, inconsistent, or noisy data can complicate the extraction and graph-building process

Conclusion

GraphRAG demonstrates significant advancements over traditional RAG by addressing its limitations in reasoning, context understanding, and reliability. It excels in synthesizing dispersed information across datasets by leveraging knowledge graphs and structured entity relationships, enabling a deeper semantic understanding.

Microsoft’s GraphRAG enhances traditional RAG by combining a two-phase approach: indexing and querying. The indexing phase builds a hierarchical knowledge graph from extracted entities and relationships, organizing data into structured summaries. In the querying phase, GraphRAG leverages this structure for precise and context-rich responses, catering to both global dataset analysis and specific entity-based queries.

However, GraphRAG’s benefits come with challenges, including high resource demands, reliance on structured data, and the complexity of semantic clustering. Despite these hurdles, its ability to provide accurate, holistic responses establishes it as a powerful alternative to naive RAG systems for handling intricate queries.

Key Takeaways

  • GraphRAG enhances RAG by organizing raw text into hierarchical knowledge graphs, enabling precise and context-aware responses.
  • It employs community summaries for broad analysis and graph connections for specific, in-depth queries.
  • GraphRAG overcomes limitations in context understanding and reasoning by leveraging entity interconnections and structured data.
  • Microsoft’s GraphRAG library supports practical application with tools for knowledge graph creation and querying.
  • Despite its precision, GraphRAG faces hurdles such as resource intensity, semantic clustering complexity, and handling unstructured data.
  • By grounding responses in structured knowledge, GraphRAG reduces inaccuracies common in traditional RAG systems.
  • Ideal for complex queries requiring interconnected reasoning, such as thematic analysis or entity-specific insights.

Frequently Asked Questions

Q1. Why is GraphRAG preferred over traditional RAG for complex queries?

A. GraphRAG excels at synthesizing insights across scattered sources by leveraging the interconnections between entities, unlike traditional RAG, which struggles with identifying subtle relationships.

Q2. How does GraphRAG create a knowledge graph during the indexing phase?

A. It processes text chunks to extract entities and relationships, organizes them hierarchically using algorithms like Leiden, and builds a knowledge graph where nodes represent entities and edges indicate relationships.

Q3. What are the two key search methods in GraphRAG’s querying phase?

Global Search: Uses community summaries for broad analysis, answering queries like “What are the main themes discussed?”.
Local Search: Focuses on specific entities by exploring their direct connections in the knowledge graph.

Q4. What challenges does GraphRAG face?

A. GraphRAG encounters issues like high computational costs due to multiple LLM calls, difficulties in semantic clustering, and complications with processing unstructured or noisy data.

Q5. How does GraphRAG enhance context understanding in response generation?

A. By grounding its responses in hierarchical knowledge graphs and community-based summaries, GraphRAG provides deeper semantic understanding and contextually rich answers.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details