Shivam5992 Bansal — Published On April 4, 2017 and Last Modified On June 25th, 2019
Intermediate Libraries Machine Learning NLP Programming Python Text Unstructured Data

Introduction

Natural Language Processing is one of the principal areas of Artificial Intelligence. NLP plays a critical role in many intelligent applications such as automated chat bots, article summarizers, multi-lingual translation and opinion identification from data. Every industry which exploits NLP to make sense of unstructured text data, not just demands accuracy, but also swiftness in obtaining results.

Natural Language Processing is a capacious field, some of the tasks in nlp are – text classification, entity detection, machine translation, question answering, and concept identification. In one of my last article, I discussed various tools and components that are used in the implementation of NLP. Most of the components discussed in the article were described using venerated library – NLTK (Natural Language Toolkit).

In this article, I will share my notes on one of the powerful and advanced libraries used to implement nlp – spaCy.

 

Table of Content

  1. About spaCy and Installation
  2. SpaCy pipeline and properties
    • Tokenization
    • Pos Tagging
    • Entity Detection
    • Dependency Parsing
  3. Noun Phrases
  4. Word Vectors
  5. Integrating spaCy with Machine Learning
  6. Comparison with NLTK and CoreNLP

 

1. About spaCy and Installation

1.1 About

Spacy is written in cython language, (C extension of Python designed to give C like performance to the python program). Hence is a quite fast library. spaCy provides a concise API to access its methods and properties governed by trained machine (and deep) learning models.

 

1.2 Installation

Spacy, its data, and its models can be easily installed using python package index and setup tools. Use the following command to install spacy in your machine:

sudo pip install spacy

In case of Python3, replace “pip” with “pip3” in the above command.

OR download the source from here and run the following command, after unzipping:

python setup.py install

To download all the data and models, run the following command, after the installation:

python -m spacy.en.download all

You are now all set to explore and use spacy.

 

2. SpaCy Pipeline and Properties

Implementation of spacy and access to different properties is initiated by creating pipelines. A pipeline is created by loading the models. There are different type of models provided in the package which contains the information about language – vocabularies, trained vectors, syntaxes and entities.

We will load the default model which is english-core-web.

import spacy 
nlp = spacy.load(“en”)

The object “nlp” is used to create documents, access linguistic annotations and different nlp properties. Let’s create a document by loading a text data in our pipeline. I am using reviews of a hotel obtained from tripadvisor’s website. The data file can be downloaded here.

document = unicode(open(filename).read().decode('utf8')) 
document = nlp(document)

The document is now part of spacy.english model’s class and is associated with a number of properties. The properties of a document (or tokens) can listed by using following command:

dir(document)
>> [ 'doc', 'ents', … 'mem']

This outputs a wide range of document properties such as – tokens, token’s reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc. Let’s explore some of these properties.

 

2.1 Tokenization

Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating the document:

# first token of the doc 
document[0] 
>> Nice

# last token of the doc  
document[len(document)-5]
>> boston 

# List of sentences of our doc 
list(document.sents)
>> [ Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
...
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).]

 

2.2 Part of Speech Tagging

Part-of-speech tags are the properties of the word that are defined by the usage of the word in the grammatically correct sentence. These tags can be used as the text features in information filtering, statistical models, and rule based parsing.

Lets check all the pos tags of our document

# get all tags
all_tags = {w.pos: w.pos_ for w in document}
>> {97:  u'SYM', 98: u'VERB', 99: u'X', 101: u'SPACE', 82: u'ADJ', 83: u'ADP', 84: u'ADV', 87: u'CCONJ', 88: u'DET', 89: u'INTJ', 90: u'NOUN', 91: u'NUM', 92: u'PART', 93: u'PRON', 94: u'PROPN', 95: u'PUNCT'}

# all tags of first sentence of our document 
for word in list(document.sents)[0]:  
    print word, word.tag_
>> ( Nice, u'JJ') (place, u'NN') (Better, u'NNP') (than, u'IN') (some, u'DT') (reviews, u'NNS') (give, u'VBP') (it, u'PRP') (creit, u'NN') (for, u'IN') (., u'.')

Let’s explore some top unigrams of the document. I have created a basic preprocessing and text cleaning function.

#define some parameters  
noisy_pos_tags = [“PROP”]
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 
def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()

# top unigrams used in the reviews 
from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)
>> [( u'hotel', 683), (u'room', 652), (u'great', 300),  (u'sheraton', 285), (u'location', 271)]

 

2.3 Entity Detection

Spacy consists of a fast entity recognition model which is capable of identifying entitiy phrases from the document. Entities can be of different types, such as – person, location, organization, dates, numerals, etc. These entities can be accessed through “.ents” property.

Let’s find all the types of named entities from present in our document.

labels = set([w.label_ for w in document.ents]) 
for label in labels: 
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_] 
    entities = list(set(entities)) 
    print label,entities

 

2.4 Dependency Parsing

One of the most powerful feature of spacy is the extremely fast and accurate syntactic dependency parser which can be accessed via lightweight API. The parser can also be used for sentence boundary detection and phrase chunking. The relations can be accessed by the properties “.children” , “.root”, “.ancestor” etc.

# extract all review sentences that contains the term - hotel
hotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]

# create dependency tree
sentence = hotel[2] for word in sentence:
print word, ': ', str(list(word.children))
>> A :  []  cab :  [A, from] 
from :  [airport, to]
the :  [] 
airport :  [the] 
to :  [hotel] 
the :  [] hotel :  
[the] can :  []
be :  [cab, can, cheaper, .] 
cheaper :  [than] than :  
[shuttles]
the :  []
shuttles :  [the, depending] 
depending :  [time] what :  [] 
time :  [what, of] of :  [day]
the :  [] day :  
[the, go] you :  
[]
go :  [you]
. :  []

Let’s parse the dependency tree of all the sentences which contains the term hotel and check what are the adjectival tokens used for hotel. I have created a custom function that parses a dependency tree and extracts relevant pos tag.

# check all adjectives used with a word 
def pos_words (sentence, token, ptag):
    sentences = [sent for sent in sentence.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            if character in word.string: 
                   pwrds.extend([child.string.strip() for child in word.children
                                                      if child.pos_ == ptag] )
    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', “ADJ”)
>> [(u'other', 20), (u'great', 10), (u'good', 7), (u'better', 6), (u'nice', 6), (u'different', 5), (u'many', 5), (u'best', 4), (u'my', 4), (u'wonderful', 3)]

 

2.5 Noun Phrases

Dependency trees can also be used to generate noun phrases:

# Generate Noun Phrases 
doc = nlp(u'I love data science on analytics vidhya') 
for np in doc.noun_chunks:
    print np.text, np.root.dep_, np.root.head.text
>> I nsubj love
   data science dobj love
   analytics pobj on

3. Word to Vectors Integration

Spacy also provides inbuilt integration of dense, real valued vectors representing distributional similarity information. It uses GloVe vectors to generate vectors. GloVe is an unsupervised learning algorithm for obtaining vector representations for words.

Let’s create some word vectors and perform some interesting operations.

from numpy import dot 
from numpy.linalg import norm 
from spacy.en import English
parser = English()

#Generate word vector of the word - apple  
apple = parser.vocab[u'apple']

#Cosine similarity function 
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
others = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != unicode("apple")})

# sort by similarity score
others.sort(key=lambda w: cosine(w.vector, apple.vector)) 
others.reverse()


print "top most similar words to apple:" 
for word in others[:10]:
    print word.orth_
>> apples iphone f ruit juice cherry lemon banana pie mac orange

 

4. Machine Learning with text using Spacy

Integrating spacy in machine learning model is pretty easy and straightforward. Let’s build a custom text classifier using sklearn. We will create a sklearn pipeline with following components: cleaner, tokenizer, vectorizer, classifier. For tokenizer and vectorizer we will built our own custom modules using spacy.

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

import string
punctuations = string.punctuation

from spacy.en import English
parser = English()

#Custom transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic utility function to clean the text 
def clean_text(text):     
    return text.strip().lower()

Let’s now create a custom tokenizer function using spacy parser and some basic cleaning. One thing to note here is that, the text features can be replaced with word vectors (especially beneficial in deep learning models)

#Create spacy tokenizer that parses a sentence and generates tokens
#these can also be replaced by word vectors 
def spacy_tokenizer(sentence):
    tokens = parser(sentence)
    tokens = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tokens]
    tokens = [tok for tok in tokens if (tok not in stopwords and tok not in punctuations)]     return tokens

#create vectorizer object to generate feature vectors, we will use custom spacy’s tokenizer
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) classifier = LinearSVC()

We are now ready to create the pipeline, load the data (sample here), and run the classifier model.

# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

# Load sample data
train = [('I love this sandwich.', 'pos'),          
         ('this is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('this is my best work.', 'pos'),
         ("what an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('he is my sworn enemy!', 'neg'),          
         ('my boss is horrible.', 'neg')] 
test =   [('the beer was good.', 'pos'),     
         ('I do not enjoy my job', 'neg'),
         ("I ain't feelin dandy today.", 'neg'),
         ("I feel amazing!", 'pos'),
         ('Gary is a good friend of mine.', 'pos'),
         ("I can't believe I'm doing this.", 'neg')]

# Create model and measure accuracy
pipe.fit([x[0] for x in train], [x[1] for x in train]) 
pred_data = pipe.predict([x[0] for x in test]) 
for (sample, pred) in zip(test, pred_data):
    print sample, pred 
print "Accuracy:", accuracy_score([x[1] for x in test], pred_data)

>>    ('the beer was good.', 'pos') pos
      ('I do not enjoy my job', 'neg') neg
      ("I ain't feelin dandy today.", 'neg') neg
      ('I feel amazing!', 'pos') pos
      ('Gary is a good friend of mine.', 'pos') pos
      ("I can't believe I'm doing this.", 'neg') neg 
      Accuracy: 1.0

 

5. Comparison with other libraries

Spacy is very powerful and industrial strength package for almost all natural language processing tasks. If you are wondering why?

Let’s compare Spacy with other famous tools to implement nlp in python – CoreNLP and NLTK.

 

Feature Availability

Feature Spacy NLTK Core NLP
Easy installation Y Y Y
Python API Y Y N
Multi Language support N Y Y
Tokenization Y Y Y
Part-of-speech tagging Y Y Y
Sentence segmentation Y Y Y
Dependency parsing Y N Y
Entity Recognition Y Y Y
Integrated word vectors Y N N
Sentiment analysis Y Y Y
Coreference resolution N N Y

 

Speed: Key Functionalities – Tokenizer, Tagging, Parsing

Package Tokenizer Tagging Parsing
spaCy 0.2ms 1ms 19ms
CoreNLP 2ms 10ms 49ms
NLTK 4ms 443ms

 

Accuracy: Entity Extraction

Package Precition Recall F-Score
spaCy 0.72 0.65 0.69
CoreNLP 0.79 0.73 0.76
NLTK 0.51 0.65 0.58

 

Projects

Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your NLP journey with the following Practice Problems:

 

Practice Problem: Identify the Sentiments Identify the sentiment of tweets
Practice Problem : Twitter Sentiment Analysis To detect hate speech in tweets

 

End Notes

In this article we discussed about Spacy – a complete package to implement NLP tasks in python. We went through various examples showcasing the usefulness of spacy, its speed and accuracy. Finally we compared the package with other famous nlp libraries – corenlp and nltk.

Once the concepts described in this article are understood, one can implement (really) challenging problems exploiting text data and natural language processing.

I hope you enjoyed reading this article, feel free to post your doubts, questions or any thoughts in the comments section.

Learn, compete, hack and get hired!

About the Author

Shivam5992 Bansal
Shivam5992 Bansal

Shivam Bansal is a data scientist with exhaustive experience in Natural Language Processing and Machine Learning in several domains. He is passionate about learning and always looks forward to solving challenging analytical problems.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

32 thoughts on "Natural Language Processing Made Easy – using SpaCy (​in Python)"

Pratik
Pratik says: April 04, 2017 at 7:58 am
Hi Shivam, I was just looking for this information! NLTK Vs SpaCy. Thanks! Reply
Sourabh Jindal
Sourabh Jindal says: April 04, 2017 at 8:07 am
Awesome Article! Reply
Leo Vogels
Leo Vogels says: April 04, 2017 at 8:43 am
just implementing Spacy in Ubuntu. works great. thanks! Leo Vogels Reply
Chang-Jung Wu
Chang-Jung Wu says: April 04, 2017 at 11:36 am
Thanks! Great and useful article! I'm wondering whether TextBlob is faster than NLTK, and whether it is slower than Specy Reply
Shivam Bansal
Shivam Bansal says: April 04, 2017 at 1:16 pm
Hi Chang,Thanks. Spacy > TextBlob > NLTK Reply
Justin
Justin says: April 04, 2017 at 1:37 pm
Great info here, thanks! Reply
Suryanarayana Ambatipudi
Suryanarayana Ambatipudi says: April 04, 2017 at 2:19 pm
Very useful article... Thank you so much !! Reply
Venkat B
Venkat B says: April 04, 2017 at 4:54 pm
thanks Shivam. Do you have a similar article for R? Reply
margaretawdy
margaretawdy says: April 08, 2017 at 5:04 am
thank u for this awesome article:) Reply
Swapnil Gaikwad
Swapnil Gaikwad says: April 10, 2017 at 6:52 pm
Hi shivam! very nice article.It helped me to get started with spacy. I have one question for you. I am trying to extract some entities like dates, location, name from the hundreds of resume's. Will it be possible to do so using spacy? will it do the task quickly within few minutes and how much accurate it will be? Please provide your guidance as well as opinion. Thanks. Reply
octavio
octavio says: April 19, 2017 at 6:14 pm
Hi, i think there is a typo in 2.4 Dependency Parsing; in the function pos_words inside the second FOR. if character in word.string: There should be another FOR not a IF. for character in word.string: However nice post. really thank you for it. It helps me a lot. Reply
Vinay
Vinay says: April 25, 2017 at 12:31 pm
Hi Shivam, Im getting this error NameError: name 'character' is not defined Reply
Amit Jaiswal
Amit Jaiswal says: June 06, 2017 at 11:58 am
I want to train Entity please share code or help me. Example: I want to train Airtel, Reliance and Vodafone as Biller Entity. It's behaving different plz share some error less n needful src. Thanks. Amit Jaiswal [email protected] Reply
tiru
tiru says: June 14, 2017 at 1:50 pm
Nice article , thank you for posting it here , i have one question for you , how can i use dependency tree output or pos information which i got from the spacy in multi class classification problem , i would be very thank full if can i give more elaborate information on this Reply
Ben
Ben says: July 03, 2017 at 12:14 pm
Hi, great article and great explanation! I'm trying to work one step further by saving and reloading the trained pipeline: `joblib.dump(fitted_pipe, path)` Works fine as well, but apparently the pickle doesn't store the custom classes and functions. Specially `predictors` and `space_tokenizer`. So by loading the pipe.pkl it doesn't find the required classes or modules. Any hint to fix that? Thanks Ben Reply
Swedha B
Swedha B says: July 07, 2017 at 7:35 pm
Hi, It's better to use this command for installing spacy sudo pip install -U spacy sudo python -m spacy.en.download all Reply
Swedha B
Swedha B says: July 07, 2017 at 7:38 pm
Hi, Thanks for this article. Its better to use the following command below to install SpaCy package sudo pip install -U spacy sudo python -m spacy.en.download all My one got fixed using this above command Reply
KostasX
KostasX says: August 10, 2017 at 11:11 pm
Good catch! Reply
Ram
Ram says: September 21, 2017 at 11:44 pm
Hi all, I am unable to install spacy in my python 2.7 version. Anyone who has worked on this please share your thoughts Reply
Vlad
Vlad says: October 17, 2017 at 8:32 pm
Replace character with token, Reply
Parvathy Sarat
Parvathy Sarat says: October 25, 2017 at 3:20 pm
Very informative article. Wish I had come across this the first time I searched for articles and comparisons. Also, what about intent recognition in spaCy? Reply
Cameron
Cameron says: November 11, 2017 at 12:05 pm
Getting a "name 'unicode' is not defined" error on document = unicode(open('Tripadvisor_hotelreviews_Shivambansal.txt').read().decode('utf8')) document = nlp(document) What is the prereq to load to make that work...running python 3.6 jupyter notebook Reply
macbuse
macbuse says: November 22, 2017 at 3:55 am
Hi 2.2 Part of Speech Tagging you **really** want to define a function is_not_noise (to reduce the nested elif s here and make the code read more like prose). The evaluation cost is the the same if you just use logical 'and's def is_not_noise(token): return ( token.pos_ not in noisy_pos_tags and len(token.string) <= min_token_length and not token.is_stop) Reply
İSMAİL KAHRAMAN
İSMAİL KAHRAMAN says: November 24, 2017 at 1:53 am
thanks a lot for this very informative article. But I didnot understand about how we get the parse tree ( not dependency) with spacy. Reply
Jason B
Jason B says: November 27, 2017 at 8:51 pm
For the install: sudo python -m spacy.en.download all did not work, but: python -m spacy download en did. Reply
iamlordaubrey
iamlordaubrey says: January 15, 2018 at 10:30 pm
With python 3.6, all you need do is: document = open('Tripadvisor_hotelreviews_Shivambansal.txt').read() document = nlp(document) That's it. Reply
Shirish
Shirish says: January 18, 2018 at 1:06 am
Excellent article ! Thank you. If you have some bandwidth....can you check if following lines of code give you a list? apple = parser.vocab[u'apple'] cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2)) others = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != unicode("apple")}) for me print(others) gives me an empty [ ] Reply
Rashad
Rashad says: February 02, 2018 at 2:07 am
document = unicode(open(‘Tripadvisor_hotelreviews_Shivambansal.txt’).read().decode(‘utf8’)) line is depricated in Python 3. Should be just document = open(‘Tripadvisor_hotelreviews_Shivambansal.txt’).read() However, dir(document) does not produce any output! Any ideas why? Reply
Shoaib Khan
Shoaib Khan says: February 02, 2018 at 1:00 pm
A Great Article, However, I believe R is missing a lot of these features. I have used tidytext package in R but I can hardly compare it to spaCy. Is there some package very close to spaCy in R ? Reply
Rambabu Dara
Rambabu Dara says: March 06, 2018 at 12:26 pm
I have below requirement::- Can any one help on using python Features that need to be extracted Role : There are different jobs – waiters, chefs, chauffeurs – (countrywise) 1) Information to be Extracted from the CV is all possible roles that the candidate can possible be work as. 2) Assign a ranking to these roles (eg : Candidate is a better waiter than a chef) 3) Identify from the CV if there are certifications that the candidate has CV can be any docx or pdf or excel Reply
Mohamed
Mohamed says: March 10, 2018 at 12:11 am
Amazing article. Is there a way using the machine learning capability in Spacy to create something that can extract bilingual chunks from texts in 2 different languages? Reply
Rekha Sharma
Rekha Sharma says: April 11, 2018 at 3:38 pm
Hi, I am doing a task of cleaning and preprocessing of data and i dont know machine learning and NLP but i have to do that task using nlp and spacy can you please suggest me how to learn this step by step so that i can be able to code for cleaning of textual data using nlp and spacy Reply

Leave a Reply Your email address will not be published. Required fields are marked *