Doctran and LLMs: A Powerful Duo for Analyzing Consumer Complaints

Vikas Verma 23 Oct, 2023

7 min read

Introduction

In today’s highly competitive market, businesses strive to understand and resolve consumer complaints effectively. Consumer complaints can shed light on a wide range of issues from product defects and poor customer service to billing errors and safety concerns. They play a crucial role in the feedback (regarding products, services, or experiences) loop between businesses and their customers. Analysing and understanding these complaints can provide valuable insights into product or service improvements, customer satisfaction, and overall business growth. In this article, we will explore how to leverage the Doctran Python library to analyse consumer complaints, extract insights, and make data-driven decisions.

Learning Objectives

In this article, you will:

Learn about doctran python library and its key features
Learn about the role of doctran and LLMs in document transformation and analysis
Explore six types of document transformations supported by doctran, including extraction, redaction, interrogation, refinement, summarization, and translation
Gain an overall understanding of converting raw textual data from consumer complaints into actionable insights
Understand the doctran’s document data structure, ExtractProperty class for defining a schema to extract properties

This article was published as a part of the Data Science Blogathon.

Doctran
- Installation
- Loading the Complaint as a Doctran document
DocTransformers
Frequently Asked Questions

Doctran

Doctran is a state-of-the-art Python library designed for document transformation and analysis. It provides a set of functions to pre-process text data, extract key information, categorize/classify, interrogate, summarize the information, and translate text into other languages. Doctran utilizes LLMs (Large Language Models) such as OpenAI GPT based models and open source NLP libraries to dissect textual data.

It supports following six types of document transformations:

Extract: To Extract useful features/properties from a document.
Redact: To Remove Personally Identifiable Information (PII) such as name, email id, phone number etc. from a document before sending the data to OpenAI. Internally it makes use of spaCy library to remove the sensitive information.
Interrogate: To convert the document into question-and-answer format.
Refine: To eliminate any content from a document that does not pertain to a predefined set of topics.
Summarize: To represent the document as a concise, comprehensive, and meaningful summary.
Translate: To translate the document in other languages.

The integration is also available in LangChain framework inside document_transformers module. LangChain is a cutting-edge framework to build LLM powered applications.

LangChain provides the flexibility to explore and utilize a wide range of open source and closed source LLM models. It seamlessly allows to connect to diverse external data sources such as PDFs, text files, Excel spreadsheets, PPTs etc. It also empowers to experiment with different prompts, engage in prompt engineering, leverage built-in chains and agents, and more.

Within the document_transformers module of Langchain, there are three implementations: DoctranPropertyExtractor, DoctranQATransformer, and DoctranTextTranslator. These are used for Extract, Interrogate, and Translate document transformations, respectively.

Installation

Doctran can be easily installed using pip command.

pip install doctran

Having known about doctran library, now let’s explore different types of document transformations available in doctran using the below consumer complaint enclosed in triple backticks (“`).

“`

November 26, 2021

The Manager

Customer Service Department

Taurus Shop

New Delhi – 110023

Subject: Complaint about defective ‘VIP’ washing machine

Dear Sir,

I had purchased an automatic washing machine on 15 July 2022, model no. G 24 and the invoice no. is 1598.

Last week, the machine stopped working abruptly and has not been working since then despite all our efforts. The machine stops running after the rinsing process is completed, causing a lot of problems. Moreover, the machine since the last day or so has also started making loud noises, creating inconvenience for us.

Please send your technician to repair it and if needed get it replaced within the following week.

Hoping for an early response

Yours truly

“`

Loading the Complaint as a Doctran document

To perform document transformation using doctran, first we need to convert the raw text into a doctran document. A doctran document is a fundamental data type that are optimized for vector search. It represents a piece of unstructured data. It consists of raw content and associated metadata.

Instantiate a doctran object by specifying the OPENAI_API_KEY in the open_ai_key parameter. Next, parse the raw content as a doctran document by calling the parse() method on top of doctran object.

sample_complain  = """

November 26, 2021

The Manager
Customer Service Department
Taurus Shop
New Delhi – 110023

Subject: Complaint about defective ‘VIP’ washing machine


Dear Sir,

I had purchased an automatic washing machine on 15 July 2022, 
model no. G 24 and the invoice no. is 1598.

Last week, the machine stopped working abruptly and has not been working 
since then despite all our efforts. 
The machine stops running after the rinsing process is completed, 
causing a lot of problems. 
Moreover, the machine since the last day or so has also started making loud noises, 
creating inconvenience for us.

Please send your technician to repair it and if needed get it replaced within the following week.

Hoping for an early response

Yours truly
"""

doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content=sample_complain)
print(document.raw_content)

Output:

DocTransformers

1. Extract

One of the primary functions of doctran is to extract key properties from a document. Internally, it make use of OpenAI function calling to extract properties (data points) from a document. It uses OpenAI GPT-4 model with a token limit of 8000 tokens.

GPT-4, short for Generative Pre-trained Transformer 4 is multimodal large language model developed by OpenAI. In comparison to its predecessors, GPT-4 demonstrates an enhanced capability to tackle complex tasks. Additionally, it can use visual inputs (such as images, charts, memes etc.) alongside text. The model has achieved human-level performance on a variety of professional and academic benchmarks, including the Uniform Bar Exam.

We need to define a schema by instantiating ExtractProperty class for each of the property that we want to extract. The schema comprises several key elements: a property name, a description, data type, a list of selectable values, and a required flag, which is a boolean indicator.

Here, we have specified four properties – Category, Sentiment, Aggressiveness and Language.

from doctran import ExtractProperty
properties = [
    ExtractProperty(
        name="Category", 
        description="What type of consumer complaint this is",
        type="string",
        enum=["Product or Service", "Wait Time", "Delivery", "Communication Gap", "Personnel"],
        required=True
        ),
    ExtractProperty(
        name="Sentiment", 
        description = "Assess the polarity/sentiment",
        type="string",
        enum = ["Positive", "Negative", "Neutral"],
        required=True
        ), 
    ExtractProperty(
        name="Aggressiveness", 
        description="""describes how aggressive the complaint is, 
        the higher the number the more aggressive""",
        type="number",
        enum=[1, 2, 3, 4, 5],
        required=True
        ),   
    ExtractProperty(
        name="Language", 
        type="string",
        description = "source language",
        enum = ["English", "Hindi", "Spanish", "Italian", "German"],
        required=True
        )         
]

To retrieve the properties, we can call the extract() function on the document. This function takes the properties as a parameter.

extracted_doc = await document.extract(properties=properties).execute()

The extract operation returns a new document with properties provided in extracted_properties key.

print(extracted_doc.extracted_properties)

Output:

2. Interrogation

Doctran allows us to convert the content within a document into a Q&A format. User queries are typically phrased as questions. So, to improve search results when using a vector database, it can be helpful to transform the information into questions. Creating indexes from these questions allows for better context retrieval compared to indexing the original text.

To interrogate the document, make use of built-in interrogate() function. It returns a new document and the generated set of Q&A is available inside extracted_properties attribute.

interrogated_doc = await document.interrogate().execute()
print(interrogated_doc.extracted_properties['questions_and_answers'])

Output:

3. Summarization

Using doctran, we can also generate a concise and meaningful summary of the original text. Invoke the summarize() function to summarize the document. Additionally, specify the token_limit to configure the size of summary.

summarized_doc = await document.summarize(token_limit=30).execute()
print(summarized_doc.transformed_content)

Output:

4. Translation

Translating documents into other languages can be helpful especially when users are expected to query the knowledge base in different languages, or when state-of-the-art embedding models are not available for a given language.

Language translation for our consumer complaints use case can be useful for global businesses with multilingual customer bases. Using the built-in translate() function we can translate the information into another languages such as Hindi, Spanish, Italian, German etc.

translated_doc = await document.translate(language="hindi").execute()
print(translated_doc.transformed_content)

Output:

Conclusion

In the era of data-driven decision-making, consumer complaint analysis is a vital process that can lead to improved products and services and ultimately result in higher customer satisfaction. Using LLMs and advanced NLP tools we can convert the raw textual data into actionable insights that drive business growth and improvement. In this article, we discussed about doctran, different types of document transformations supported by this library with the help of consumer complaints.

Key Takeaways

Consumer complaints are not just grievances but also valuable sources of feedback that can provide crucial insights for businesses.
The doctran Python library, along with Large Language Models (LLMs) like GPT-4, offers a powerful toolset for transforming and analyzing documents. It supports various transformations such as extraction, redaction, interrogation, summarization, and translation.
Doctran’s extraction capabilities using OpenAI’s GPT-4 model can help businesses extract key properties from documents.
Converting document content into a question-and-answer format using doctran’s interrogation feature improves context retrieval. This approach is valuable for building effective search indexes and facilitating better search results.
Businesses with a global customer base can benefit from doctran’s language translation capabilities, making information accessible in multiple languages. Additionally, it provides the ability to generate concise and meaningful summaries of textual content.

Frequently Asked Questions

Q1. What is the main purpose of the Doctran Python library?

A: The primary purpose of the doctran Python library is to perform document transformation and analysis. It offers a set of functions to pre-process text data, extract valuable information, categorize and classify content, and translate text into different languages. It uses Large Language Models (LLMs) like OpenAI’s GPT-based models to dissect textual data.

Q2: How can you use Doctran to extract key properties from documents, and what are some examples of the properties it can extract?

A: Doctran can extract key properties from documents by using OpenAI’s GPT-4 model. These properties are defined in a schema and can be retrieved using the extract() function. Some examples are extracting category, sentiment, aggressiveness, language from the raw text.

Q3: What benefits does converting document content into a question-and-answer format provide, and how is this achieved using Doctran?

A: Converting document content into a question-and-answer format using Doctran’s interrogation feature improves information retrieval. It allows for better context retrieval compared to indexing the original text, making it more suitable for search engines. The built-in interrogate() function transforms the document into a Q&A format, enhancing search results.

Q4: Why is language translation important in consumer complaint analysis, and how does Doctran support this feature?

A: Language translation is crucial in consumer complaint analysis, particularly for businesses with multilingual customer bases. This feature ensures that information is accessible to a global audience. Doctran supports language translation using the built-in translate() function, enabling documents to be translated into various languages such as Hindi, Spanish, Italian, German, and more.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Vikas Verma 23 Oct, 2023

Beginner Generative AI NLP Python Unstructured Data

Doctran and LLMs: A Powerful Duo for Analyzing Consumer Complaints

Introduction

Learning Objectives

Table of contents

Doctran

Installation

Loading the Complaint as a Doctran document

DocTransformers

1. Extract

2. Interrogation

3. Summarization

4. Translation

Conclusion

Key Takeaways

Frequently Asked Questions

Recommended Articles

Frequently Asked Questions

Responses From Readers

Write for us