Learn everything about Analytics

Complete tutorial on Text Classification using Conditional Random Fields Model (in Python)

Introduction

The amount of text data being generated in the world is staggering. Google processes more than 40,000 searches EVERY second!  According to a Forbes report, every single minute we send 16 million text messages and post 510,00 comments on Facebook. For a layman, it is difficult to even grasp the sheer magnitude of data out there?

News sites and other online media alone generate tons of text content on an hourly basis. Analyzing patterns in that data can become daunting if you don’t have the right tools. Here we will discuss one such approach, using entity recognition, called Conditional Random Fields (CRF).

This article explains the concept and python implementation of conditional random fields on a self-annotated dataset. This is a really fun concept and I’m sure you’ll enjoy taking this ride with me!

 

Table of contents

  1. What is Entity Recognition?
  2. Case Study Objective and Understanding Different Approaches
  3. Formulating Conditional Random Fields (CRFs)
  4. Annotating Training Data
    • Annotations using GATE
  5. Building and Training a CRF Module in Python

 

What is Entity Recognition?

Entity recognition has seen a recent surge in adoption with the interest in Natural Language Processing (NLP). An entity can generally be defined as a part of text that is of interest to the data scientist or the business. Examples of frequently extracted entities are names of people, address, account numbers, locations etc. These are only simple examples and one could come up with one’s own entity for the problem at hand.

To take a simple application of entity recognition, if there’s any text with “London” in the dataset, the algorithm would automatically categorize or classify that as a location (you must be getting a general idea of where I’m going with this).

Let’s take a simple case study to understand our topic in a better way.

 

Case Study Objective & Understanding Different Approaches

Suppose that you are part of an analytics team in an insurance company where each day, the claims team receives thousands of emails from customers regarding their claims. The claims operations team goes through each email and updates an online form with the details before acting on them.

Source: mugo.ca

You are asked to work with the IT team to automate the process of pre-populating the online form. For this task, the analytics team needs to build a custom entity recognition algorithm.

To identify entities in text, one must be able to identify the pattern. For example, if we need to identify the claim number, we can look at the words around it such as “my id is” or “my number is”, etc. Let us examine a few approaches mentioned below for identifying the patterns.

  1. Regular expressions: Regular expressions (RegEx) are a form of finite state automaton. They are very helpful in identifying patterns that follow a certain structure. For example, email ID, phone number, etc. can be identified well using RegEx. However, the downside of this approach is that one needs to be aware of all the possible exact words that occur before the claim number. This is not a learning approach, but rather a brute force one
  2. Hidden Markov Model (HMM): This is a sequence modelling algorithm that identifies and learns the pattern. Although HMM considers the future observations around the entities for learning a pattern, it assumes that the features are independent of each other. This approach is better than regular expressions as we do not need to model the exact set of word(s). But in terms of performance, it is not known to be the best method for entity recognition
  3. MaxEnt Markov Model (MEMM): This is also a sequence modelling algorithm. This does not assume that features are independent of each other and also does not consider future observations for learning the pattern. In terms of performance, it is not known to be the best method for identifying entity relationships either
  4. Conditional Random Fields (CRF): This is also a sequence modelling algorithm. This not only assumes that features are dependent on each other, but also considers the future observations while learning a pattern. This combines the best of both HMM and MEMM. In terms of performance, it is considered to be the best method for entity recognition problem

 

Formulating Conditional Random Fields (CRF)

The bag of words (BoW) approach works well for multiple text classification problems. This approach assumes that presence or absence of word(s) matter more than the sequence of the words. However, there are problems such as entity recognition, part of speech identification where word sequences matter as much, if not more. Conditional Random Fields (CRF) comes to the rescue here as it uses word sequences as opposed to just words.

Let us now understand how CRF is formulated.

Below is the formula for CRF where Y is the hidden state (for example, part of speech) and X is the observed variable (in our example this is the entity or other words around it).

Broadly speaking, there are 2 components to the CRF formula:

  1. Normalization: You may have observed that there are no probabilities on the right side of the equation where we have the weights and features. However, the output is expected to be a probability and hence there is a need for normalization. The normalization constant Z(x) is a sum of all possible state sequences such that the total becomes 1. You can find more details in the reference section of this article to understand how we arrived at this value.
  2. Weights and Features: This component can be thought of as the logistic regression formula with weights and the corresponding features. The weight estimation is performed by maximum likelihood estimation and the features are defined by us.

 

Annotating training data

Now that you are aware of the CRF model, let us curate the training data. The first step to doing this is annotation.  Annotation is a process of tagging the word(s) with the corresponding tag. For simplicity, let us suppose that we only need 2 entities to populate the online form, namely the claimant name and the claim number.

The following is a sample email received as is. Such emails need to be annotated so that the CRF model can be trained. The annotated text needs to be in an XML format. Although you may choose to annotate the documents in your way, I’ll walk you through the use of the GATE architecture to do the same.

 

Email received:

“Hi,

I am writing this email to claim my insurance amount. My id is abc123 and I claimed it on 1st January 2018. I did not receive any acknowledgement. Please help.

Thanks,

randomperson”

 

Annotated Email:

<document>Hi, I am writing this email to claim my insurance amount. My id is <claim_number>abc123</claim_number> and I claimed on 1st January 2018. I did not receive any acknowledgement. Please help. Thanks, <claimant>randomperson</claimant></document>

 

Annotations using GATE

Let us understand how to use the General Architecture for Text Engineering (GATE). Please follow the below steps to install GATE.

  • Install the GATE platform by executing the downloaded installer and following the installation steps appropriately
  • Post installation, run the application executable file as shown below:
  • Once the application opens, load the emails iteratively into the language resources by right clicking on “Language Resources”>New>GATE Document as shown below. Give each email a name, set the encoding to “utf-8” so we have no issues in Python, navigate to the emails that need to be annotated by clicking on the icon in sourceUrl section as shown below.

 

  1. Open one email at a time and start the annotation exercise. There are 2 options for building annotations.
    a. Load the annotation xml into GATE and use it
    b. Create annotations on the fly and use them. In this article, we will demonstrate this approach.
  2. Click on the email in the Language Resources section for it to open. Click on the “Annotation Sets” and then select word or words and placing the cursor on it for a couple of seconds. A pop-up window for annotation comes up and you can then type in the annotation in place of “_NEW_” and hit enter. A new annotation is created as shown below. Repeat this exercise for all the annotations for each email

 

 

  • Once all the training emails are annotated, create a corpus for ease of use by navigating to Language Resources>NEW>GATE Corpus
  • Give the new corpus a name for one’s reference, click on the navigation icon and add each email that is loaded into the Language Corpus as shown below
  • Save the corpus as inline xml in a folder on your machine by right clicking on the corpus and navigating to “Inline XML(.xml)” as shown below
  • In the next pop-up window, select the annotation types that are pre-populated and remove them. Manually type the annotations and add them in place of the pre-populated annotations. Set the “includeFeatures” option to false by clicking on it and type “document” into the rootElement box. Once all these changes are made, save the file to a folder on your machine by clicking on the “Save To” icon . Following are the screenshots for reference.




  • The above process will save all the annotated emails in one folder.

 

 

Building and Training a CRF Module in Python

  • First download the pycrf module. For PIP installation, the command is “pip install python-crfsuite” and for conda installation, the command is “conda install -c conda-forge python-crfsuite
  • If the above installation doesn’t work, download the relevant pycrf module from https://anaconda.org/conda-forge/python-crfsuite/files. If you have a Windows OS 64-bit machine with python 2.7 version, then use this link: win-64/python-crfsuite-0.9.2-py27_vc9_0.tar.bz2
  • Extract the pycrfsuite and python_crfsuite-0.9.2-py2.7.egg-info files and place them in the folder where the rest of the packages are present. For example, if you use Anaconda, then these files can be placed in the anaconda>lib>site-packages folder

Once the installation is complete, you are ready to train and build your own CRF module. Let”s do this!

#invoke libraries
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import codecs
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.model_selection import train_test_split
import pycrfsuite
import os, os.path, sys
import glob
from xml.etree import ElementTree
import numpy as np
from sklearn.metrics import classification_report

 

Let’s define and build a few functions.

#this function appends all annotated files
def append_annotations(files):
    xml_files = glob.glob(files +"/*.xml")
    xml_element_tree = None
    new_data = ""
    for xml_file in xml_files:
        data = ElementTree.parse(xml_file).getroot()
        #print ElementTree.tostring(data)        
        temp = ElementTree.tostring(data)
        new_data += (temp)
    return(new_data)

#this function removes special characters and punctuations
def remov_punct(withpunct):
    punctuations = '''!()-[]{};:'"\,<>./[email protected]#$%^&*_~'''
    without_punct = ""
    char = 'nan'
    for char in withpunct:
        if char not in punctuations:
            without_punct = without_punct + char
    return(without_punct)

# functions for extracting features in documents
def extract_features(doc):
    return [word2features(doc, i) for i in range(len(doc))]

def get_labels(doc):
    return [label for (token, postag, label) in doc]

 

Now we will import the annotated training data.

files_path = "D:/Annotated/"

allxmlfiles = append_annotations(files_path)
soup = bs(allxmlfiles, "html5lib")

#identify the tagged element
docs = []
sents = []

for d in soup.find_all("document"):
   for wrd in d.contents:    
    tags = []
    NoneType = type(None)   
    if isinstance(wrd.name, NoneType) == True:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,'NA'))            
    else:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,wrd.name))    
    sents = sents + tags 
   docs.append(sents) #appends all the individual documents into one list

 

Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization.

data = []

for i, doc in enumerate(docs):
    tokens = [t for t, label in doc]    
    tagged = nltk.pos_tag(tokens)    
    data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

def word2features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

# Common features for all words. You may add more features here based on your custom use case
features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

# Features for words that are not at the beginning of a document
if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')

# Features for words that are not at the end of a document
if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

 return features

 

Now we’ll build features and create train and test data frames.

X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

 

Let’s test our model.

tagger = pycrfsuite.Tagger()
tagger.open('crf.model')
y_pred = [tagger.tag(xseq) for xseq in X_test]

 

You can inspect any predicted value by selecting the corresponding row number “i”.

i = 0

for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):

    print("%s (%s)" % (y, x))

 

Check the performance of the model.

# Create a mapping of labels to indices
labels = {"claim_number": 1, "claimant": 1,"NA": 0}

# Convert the sequences of tags into a 1-dimensional array
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])

 

Print out the classification report. Based on the model performance, build better features to improve the performance.

print(classification_report(
    truths, predictions,
    target_names=["claim_number", "claimant","NA"]))

 

#predict new data
with codecs.open("D:/ SampleEmail6.xml", "r", "utf-8") as infile:
    soup_test = bs(infile, "html5lib")

docs = []
sents = []

for d in soup_test.find_all("document"):
   for wrd in d.contents:    
    tags = []
    NoneType = type(None)   

    if isinstance(wrd.name, NoneType) == True:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,'NA'))            
    else:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,wrd.name))
    #docs.append(tags)

sents = sents + tags # puts all the sentences of a document in one element of the list
docs.append(sents) #appends all the individual documents into one list       

data_test = []

for i, doc in enumerate(docs):
    tokens = [t for t, label in doc]    
    tagged = nltk.pos_tag(tokens)    
    data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

data_test_feats = [extract_features(doc) for doc in data_test]
tagger.open('crf.model')
newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats]

# Let's check predicted data
i = 0
for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]):
    print("%s (%s)" % (y, x))

By now, you would have understood how to annotate training data, how to use Python to train a CRF model, and finally how to identify entities from new text. Although this algorithm provides some basic set of features, you can come up with your own set of features to improve the accuracy of the model.

 

End Notes

To summarize, here are the key points that we have covered in this article:

  • Entities are parts of text that are of interest for the business problem at hand
  • Sequence of words or tokens matter in identifying entities
  • Pattern recognition approaches such as Regular Expressions or graph-based models such as Hidden Markov Model and Maximum Entropy Markov Model can help in identifying entities. However, Conditional Random Fields (CRF) is a popular and arguably a better candidate for entity recognition problems
  • CRF is an undirected graph-based model that considered words that not only occur before the entity but also after it
  • The training data can be annotated by using GATE architecture
  • The Python code provided helps in training a CRF model and extracting entities from text
  • In conclusion, this article should give you a good starting point for your business problem

 

References

  1. An Introduction to Conditional Random Fields by Charles Sutton  & Andrew McCallum. (http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf).
  2. Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing by Alexander M. Rush(based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag). (http://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdf).
  3. Performing Sequence Labelling using CRF in Python by Albert Au Yeung. (http://www.albertauyeung.com/post/python-sequence-labelling-with-crf/).
  4. Using GATE as an Annotation Tool by Tom Kenter, Diana Maynard. (https://gate.ac.uk/sale/am/annotationmanual-gate2.pdf)

 

About the Author

Sidharth MacherlaSidharth Macherla – Independent Researcher, Natural Language Processing

Sidharth Macherla has over 12 years of experience in data science and his current area of focus is Natural Language Processing. He has worked across Banking, Insurance, Investment Research and Retail domains.
You can also read this article on Analytics Vidhya's Android APP Get it on Google Play

2 Comments

%d bloggers like this:
Join 150000+ Data Scientists in our Community

Receive awesome tips, guides, infographics and become expert at:




 P.S. We only publish awesome content. We will never share your information with anyone.

Subscribe!
%d bloggers like this:
Join 150000+ Data Scientists in our Community

Receive awesome tips, guides, infographics and become expert at:




 P.S. We only publish awesome content. We will never share your information with anyone.

Subscribe!