PII Detection and Masking in RAG Pipelines

Sukanya Last Updated : 05 Apr, 2024

8 min read

Introduction

In today’s data-driven world, safeguarding Personally Identifiable Information (PII) is paramount. PII encompasses data like names, addresses, phone numbers, and financial records, vital for individual identification. With the rise of artificial intelligence and its vast data processing capabilities, protecting PII while harnessing its potential for personalized experiences is crucial. Retrieval Augmented Generation (RAG) emerges as a solution, blending information retrieval with advanced language generation models. These systems sift through extensive data repositories to extract relevant information, refining AI-generated outputs for precision and context.

Yet, the utilization of user data poses risks of unintentional PII exposure. PII detection technologies mitigate this risk, automatically identifying and concealing sensitive data. With stringent privacy measures, RAG models leverage user data to offer tailored services while upholding privacy standards. This integration underscores the ongoing endeavor to balance personalized data usage with user privacy, prioritizing data confidentiality as AI technology advances.

Learning Objectives

The article delves into developing a potent PII detection tool with the Llama Index and Presidio, a Microsoft anonymization library.
Presidio swiftly detects and anonymizes sensitive personal data, offering users customizable PII detection tools with advanced techniques like NER, Regular Expressions, and checksum algorithms.
Users can customize the anonymization process with Presidio’s flexible framework, enhancing control.
Llama Index seamlessly integrates Presidio’s functionality for an accessible solution.
The article compares Presidio with NER PII post-processing tools, showcasing Presidio’s superiority and practical benefits.

PII Detection and Masking in RAG Pipelines

Introduction
Hands-on PII detection using Llama Index Post-processing tools
Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio
Analyzing PII Masking with Presidio
Applications and Limitations
Conclusion

This article was published as a part of the Data Science Blogathon.

Hands-on PII detection using Llama Index Post-processing tools

Let’s start our exploration with the NERPIINodePostprocessor tool from Llama Index. For that, we will need to install a few necessary packages.

The list of necessary packages is listed below:

llama-index==0.10.22
llama-index-agent-openai==0.1.7
llama-index-cli==0.1.11
llama-index-core==0.10.23
llama-index-indices-managed-llama-cloud==0.1.4
llama-index-legacy==0.9.48
llama-index-multi-modal-llms-openai==0.1.4
llama-index-postprocessor-presidio==0.1.1
llama-parse==0.3.9
llamaindex-py-client==0.1.13
presidio-analyzer==2.2.353
presidio-anonymizer==2.2.353
pydantic==2.5.3
pydantic_core==2.14.6
spacy==3.7.4
torch==2.2.1+cpu
transformers==4.39.1

To test the tool, we require dummy data for PII detection. For experimentation, handwritten texts containing fabricated names, dates, credit card numbers, phone numbers, and email addresses were utilized. Alternatively, any text of choice can be used for testing, or GPT can be employed to generate text. The following texts will be utilized for our experimentation:

text = """
Hi there! You can call me Max Turner. Reach out at [email protected],
and you'll find me strolling the streets of Vienna. My plastic friend, the 
Mastercard, reads 5300-1234-5678-9000. Ever vibed at a gig by Zsofia Kovacs? 
I'm curious. As for my card, it has a limit I'd rather not disclose here; 
however, my bank details are as follows: AT611904300235473201. Turner is the 
family name. Tracing my roots, I've got ancestors named Leopold Turner and
Elisabeth Baumgartner. Also, a quick FYI: I tried to visit your website, but 
my IP (203.0.113.5) seems to be barred. I did, however, manage to post a 
visual at this link: http://MegaMovieMoments.fi.
"""

Step 1: Initializing the Tool and Importing Dependencies

With the packages installed and sample text prepared, we proceed to utilize the NERPIINodePostprocessor tool. Importing NERPIINodePostprocessor from Llama Index is necessary, along with importing the TextNode schema from Llama Index to create a text node. This step is crucial as NERPIINodePostprocessor operates on TextNode objects rather than raw strings.

Below is the code snippet for imports:

from llama_index.core.postprocessor import NERPIINodePostprocessor
from llama_index.core.schema import TextNode
from llama_index.core.schema import NodeWithScore

Step 2: Creating TextNode Objects

Following the imports, we proceed to create a TextNode object using our sample text.

text_node = TextNode(text=text)

Step 3: Post-processing Sensitive Entities

Subsequently, we create a NERPIINodePostprocessor object and apply it to our TextNode object to post-process and mask the sensitive entities.

processor = NERPIINodePostprocessor()

new_nodes = processor.postprocess_nodes(
    [NodeWithScore(node=text_node)]
)

Step 4: Reviewing Post-Processed Text and PII Entity Mapping

After completing the post-processing of our text, we can now examine the post-processed text alongside the PII entity mapping.

pprint(new_nodes[0].node.get_content())

# OUTPUT
# 'Hi there! You can call me [PER_26]. Reach out at [email protected], '
# "and you'll find me strolling the streets of [LOC_122]. My plastic friend, "
# 'the [ORG_153], reads 5300-1234-5678-9000. Ever vibed at a gig by [PER_215]? '
# "I'm curious. As for my card, it has a limit I'd rather not disclose here; "
# 'however, my bank details are as follows: AT611904300235473201. [PER_367] is '
# "the family name. Tracing my roots, I've got ancestors named Leopold "
# '[PER_367] and [PER_456]. Also, a quick FYI: I tried to visit your website, '
# 'but my IP (203.0.113.5) seems to be barred. I did, however, manage to post a '
# 'visual at this link: [ORG_627].fi.')

pprint(new_nodes[0].node.metadata)

# OUTPUT
# {'__pii_node_info__': {'[LOC_122]': 'Vienna',
#                        '[ORG_153]': 'Mastercard',
#                        '[ORG_627]': 'MegaMovieMoments',
#                        '[PER_215]': 'Zsofia Kovacs',
#                        '[PER_26]': 'Max Turner',
#                        '[PER_367]': 'Turner',
#                        '[PER_437]': 'Leopold Turner',
#                        '[PER_456]': 'Elisabeth Baumgartner'}}

Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now

Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio

Upon reviewing the results, it’s evident that the postprocessor fails to mask highly sensitive entities such as credit card numbers, phone numbers, and email addresses. This outcome deviates from our intention, as we aimed to mask all sensitive entities including names, addresses, credit card numbers, and email addresses.

While the NERPIINodePostprocessor effectively masks Named Entities like person and company names, with their respective entity type and count, it proves inadequate for masking texts containing highly sensitive content. Now that we understand the functionality of the NERPIINodePostprocessor and its limitations in masking sensitive information, let’s assess the performance of Presidio on the same text. We’ll explore Presidio’s functionality first and then proceed with utilizing Llama Index’s Presidio implementation.

Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio

Importing Essential Packages for Presidio Integration

To begin, import the requisite packages. This includes the AnalyzerEngine and AnonymizerEngine from Presidio. Additionally, import the PresidioPIINodePostprocessor, which serves as the Llama Index’s integration of Presidio.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from llama_index.postprocessor.presidio import PresidioPIINodePostprocessor

Initializing and Analyzing Text with the Analyzer Engine

Proceed by initializing the Analyzer Engine using the list of supported languages. Set it to a list containing ‘en’ for the English language. This enables Presidio to determine the language of the text content. Subsequently, utilize the analyzer instance to analyze the text.

analyzer = AnalyzerEngine(supported_languages=["en"])

results = analyzer.analyze(text=text, language="en")

Below is the result after analyzing the text content. It shows the PII entity type, its star and end index in the string and the probability score.

Initializing the Anonymizer Engine

After initializing the Analyzer Engine, proceed to initialize the Anonymizer Engine. This component will anonymize the original text based on the results obtained from the Analyzer Engine.

engine = AnonymizerEngine()

new_text = engine.anonymize(text=text, analyzer_results=results)

Below is the output from the anonymizer engine, showcasing the original text with masked PII entities.

pprint(new_text.text)

# OUTPUT
#  "Hi there! You can call me <PERSON>. Reach out at <EMAIL_ADDRESS>, and you'll "
#  'find me strolling the streets of <LOCATION>. My plastic friend, the '
#  "<IN_PAN>, reads <IN_PAN>5678-9000. Ever vibed at a gig by <PERSON>? I'm "
#  "curious. As for my card, it has a limit I'd rather not disclose here; "
#  'however, my bank details are as follows: AT611904300235473201. <PERSON> is '
#  "the family name. Tracing my roots, I've got ancestors named <PERSON> and "
#  '<PERSON>. Also, a quick FYI: I tried to visit your website, but my IP '
#  '(<IP_ADDRESS>) seems to be barred. I did, however, manage to post a visual '
#  'at this link: <URL>.'

Also Read: RAG Powered Document QnA & Semantic Caching with Gemini Pro

Analyzing PII Masking with Presidio

Presidio effectively masks all PII entities by enclosing their entity type within ‘<‘ and ‘>’. However, the masking lacks unique identifiers for entity items. Here, Llama Index integration enhances the process. The Presidio implementation of Llama Index not only returns the masked text with entity type counts but also provides a deanonymizer map for deanonymization. Let’s explore how to utilize these features.

First create a TextNode object using the input text.

text_node = TextNode(text=text)

Next, create an instance of PresidioPIINodePostprocessor and run the postprocessor on the TextNode.

processor = PresidioPIINodePostprocessor()

new_nodes = processor.postprocess_nodes(
    [NodeWithScore(node=text_node)]
)

Finally, we get the masked text from the anonymizer along with the deanonymizer map.

pprint(new_nodes[0].node.get_content())

# OUTPUT
#  'Hi there! You can call me <PERSON_5>. Reach out at <EMAIL_ADDRESS_1>, and '
#  "you'll find me strolling the streets of <LOCATION_1>. My plastic friend, the "
#  '<IN_PAN_2>, reads <IN_PAN_1>5678-9000. Ever vibed at a gig by <PERSON_4>? '
#  "I'm curious. As for my card, it has a limit I'd rather not disclose here; "
#  'however, my bank details are as follows: AT611904300235473201. <PERSON_3> is '
#  "the family name. Tracing my roots, I've got ancestors named <PERSON_2> and "
#  '<PERSON_1>. Also, a quick FYI: I tried to visit your website, but my IP '
#  '(<IP_ADDRESS_1>) seems to be barred. I did, however, manage to post a visual '
#  'at this link: <URL_1>.'


pprint(new_nodes[0].metadata)

# OUTPUT
# {'__pii_node_info__': {'<EMAIL_ADDRESS_1>': '[email protected]',
#                        '<IN_PAN_1>': '5300-1234-',
#                        '<IN_PAN_2>': 'Mastercard',
#                        '<IP_ADDRESS_1>': '203.0.113.5',
#                        '<LOCATION_1>': 'Vienna',
#                        '<PERSON_1>': 'Elisabeth Baumgartner',
#                        '<PERSON_2>': 'Leopold Turner',
#                        '<PERSON_3>': 'Turner',
#                        '<PERSON_4>': 'Zsofia Kovacs',
#                        '<PERSON_5>': 'Max Turner',
#                        '<URL_1>': 'MegaMovieMoments.fi'}}

The masked text generated by PresidioPIINodePostprocessor effectively masks all PII entities, indicating their entity type and count. Additionally, it provides a deanonymizer map, facilitating the subsequent deanonymization of the masked text.

Applications and Limitations

By leveraging the PresidioPIINodePostprocessor tool, we can seamlessly anonymize information within our RAG pipeline, prioritizing user data privacy. Within the RAG pipeline, it can serve as a data anonymizer during data ingestion, effectively masking sensitive information. Similarly, in the query pipeline, it can function as a deanonymizer, allowing authenticated users to access sensitive information while maintaining privacy. The deanonymizer map can be securely stored in a protected location, ensuring the confidentiality of sensitive data throughout the process.

The PII anonymizer tool finds utility in RAG pipelines dealing with financial documents or sensitive user/organization information, necessitating protection from unidentified or unauthorized access. It ensures secure storage of anonymized document contents within the vector store, even in the event of a data breach. Additionally, it proves valuable in RAG pipelines involving organization or personal emails, where sensitive data like addresses, password change URLs, and OTPs are prevalent, necessitating ingestion in an anonymized state.

Limitations

While the PII detection tool can be useful in RAG pipelines, there are some limitations to implementing it into an RAG pipeline.

Adding PII detection and masking can introduce additional processing time to the RAG pipeline, which may impact the overall performance and latency of the system, especially with large datasets or when real-time processing is required.
No PII detection tool is perfect; there can be instances of false positives, where non-PII data is mistakenly masked, or false negatives, where actual PII is not detected. Both scenarios can have implications for user experience and data protection efficacy.
Presidio may have limitations in understanding context and nuances across different languages, potentially reducing their effectiveness in accurately identifying PII in multilingual datasets.
While the PII anonymization tool can mask sensitive information accurately, the initial ingestion of data still requires careful handling. If a breach occurs before the data is anonymized, sensitive information could be exposed.
In cases where anonymization needs to be reversible, maintaining secure and controlled access to deanonymization keys or maps is critical, and failure to do so could compromise the integrity of the anonymization process.

Conclusion

In conclusion, the incorporation of PII detection and masking tools like Presidio into RAG pipelines marks a notable stride in AI’s capacity to handle sensitive data while upholding individual privacy. Through the utilization of advanced techniques and customizable features, Presidio elevates the security and adaptability of text generation, meeting the escalating need for data privacy in the digital era. Despite potential challenges such as latency and accuracy, the advantages of safeguarding user data with sophisticated anonymization tools are undeniable, positioning it as a crucial element for responsible AI development and deployment.

Key Takeaways

With the increasing use of AI and big data, the need to protect Personally Identifiable Information (PII) in any system that processes user data is critical.
Retrieval Augmented Generation (RAG) systems, which combine information retrieval with language generation, can potentially expose PII. Therefore, incorporating PII detection and masking mechanisms is essential to maintain privacy standards.
Microsoft’s Presidio offers robust PII detection and anonymization capabilities, making it a suitable choice for integrating into RAG pipelines. It provides predefined and customizable PII detectors, leveraging NER, Regular Expressions, and checksum.
Presidio is preferred over basic NER PII post-processing tools due to its sophisticated anonymization features, flexibility, and higher accuracy in detecting a wide range of PII entities.
The PII anonymization tool is particularly useful in RAG pipelines dealing with financial documents, sensitive organizational data, and emails, ensuring that private information is not exposed to unauthorized users.

Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sukanya

An ace multi-skilled programmer whose major area of work and interest lies in Software Development, Data Science, and Machine Learning. A proactive and detail-oriented individual who loves data storytelling, and is curious and passionate to solve complex value-oriented business problems with Data Science and Machine Learning to deliver robust machine learning pipelines that ensure maximum impact.

In my free time, I focus on creating Data Science and AI/ML content, providing 1:1 mentorships, career guidance and interview preparation tips, with a sole focus on teaching complex topics the easier way, to help people make a successful career transition to Data Science with the right skillset!

Advanced Generative AI

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

PII Detection and Masking in RAG Pipelines

Introduction

Learning Objectives

Table of contents

Hands-on PII detection using Llama Index Post-processing tools

Step 1: Initializing the Tool and Importing Dependencies

Step 2: Creating TextNode Objects

Step 3: Post-processing Sensitive Entities

Step 4: Reviewing Post-Processed Text and PII Entity Mapping

Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio

Importing Essential Packages for Presidio Integration

Initializing and Analyzing Text with the Analyzer Engine

Initializing the Anonymizer Engine

Analyzing PII Masking with Presidio

First create a TextNode object using the input text.

Next, create an instance of PresidioPIINodePostprocessor and run the postprocessor on the TextNode.

Finally, we get the masked text from the anonymizer along with the deanonymizer map.

Applications and Limitations

Limitations

Conclusion

Key Takeaways

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at