We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details

Introduction to StanfordNLP: An Incredible State-of-the-Art NLP Library for 53 Languages (with Python code)

[email protected] 12 May, 2020
9 min read

Introduction

A common challenge I came across while learning Natural Language Processing (NLP) – can we build models for non-English languages? The answer has been no for quite a long time. Each language has its own grammatical patterns and linguistic nuances. And there just aren’t many datasets available in other languages.

That’s where Stanford’s latest NLP library steps in – StanfordNLP.

I could barely contain my excitement when I read the news last week. The authors claimed StanfordNLP could support more than 53 human languages! Yes, I had to double-check that number.

I decided to check it out myself. There’s no official tutorial for the library yet so I got the chance to experiment and play around with it. And I found that it opens up a world of endless possibilities. StanfordNLP contains pre-trained models for rare Asian languages like Hindi, Chinese and Japanese in their original scripts.

The ability to work with multiple languages is a wonder all NLP enthusiasts crave for. In this article, we will walk through what StanfordNLP is, why it’s so important, and then fire up Python to see it live in action. We’ll also take up a case study in Hindi to showcase how StanfordNLP works – you don’t want to miss that!

 

Table of Contents

  1. What is StanfordNLP and Why Should You Use it?
  2. Setting up StanfordNLP in Python
  3. Using StanfordNLP to Perform Basic NLP Tasks
  4. Implementing StanfordNLP on the Hindi Language
  5. Using CoreNLP ‘s API for Text Analytics

 

What is StanfordNLP and Why Should You Use it?

Here is StanfordNLP’s description by the authors themselves:

StanfordNLP is the combination of the software package used by the Stanford team in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group’s official Python interface to the Stanford CoreNLP software.

That’s too much information in one go! Let’s break it down:

  • CoNLL is an annual conference on Natural Language Learning. Teams representing research institutes from all over the world try to solve an NLP based task
  • One of the tasks last year was “Multilingual Parsing from Raw Text to Universal Dependencies”. In simple terms, it means to parse unstructured text data of multiple languages into useful annotations from Universal Dependencies
  • Universal Dependencies is a framework that maintains consistency in annotations. These annotations are generated for the text irrespective of the language being parsed
  • Stanford’s submission ranked #1 in 2017. They missed out on the first position in 2018 due to a software bug (ended up in 4th place)

StanfordNLP is a collection of pre-trained state-of-the-art models. These models were used by the researchers in the CoNLL 2017 and 2018 competitions. All the models are built on PyTorch and can be trained and evaluated on your own annotated data. Awesome!

Image result for stanford NLPAdditionally, StanfordNLP also contains an official wrapper to the popular behemoth NLP library – CoreNLP. This had been somewhat limited to the Java ecosystem until now. You should check out this tutorial to learn more about CoreNLP and how it works in Python.

Below are a few more reasons why you should check out this library:

  • Native Python implementation requiring minimal effort to set up
  • Full neural network pipeline for robust text analytics, including:
    • Tokenization
    • Multi-word token (MWT) expansion
    • Lemmatization
    • Parts-of-speech (POS) and morphological feature tagging
    • Dependency Parsing
  • Pretrained neural models supporting 53 (human) languages featured in 73 treebanks
  • A stable officially maintained Python interface to CoreNLP

What more could an NLP enthusiast ask for? Now that we have a handle on what this library does, let’s take it for a spin in Python!

 

Setting up StanfordNLP in Python

There are some peculiar things about the library that had me puzzled initially. For instance, you need Python 3.6.8/3.7.2 or later to use StanfordNLP. To be safe, I set up a separate environment in Anaconda for Python 3.7.1. Here’s how you can do it:

1. Open conda prompt and type this:

conda create -n stanfordnlp python=3.7.1

2. Now activate the environment:

source activate stanfordnlp

3. Install the StanfordNLP library:

pip install stanfordnlp

4. We need to download a language’s specific model to work with it. Launch a python shell and import StanfordNLP:

import stanfordnlp

then download the language model for English (“en”):

stanfordnlp.download('en')

This can take a while depending on your internet connection. These language models are pretty huge (the English one is 1.96GB).

 

A couple of important notes

  • StanfordNLP is built on top of PyTorch 1.0.0. It might crash if you have an older version. Here’s how you can check the version installed on your machine:
pip freeze | grep torch

which should give an output like torch==1.0.0

  • I tried using the library without GPU on my Lenovo Thinkpad E470 (8GB RAM, Intel Graphics). I got a memory error in Python pretty quickly. Hence, I switched to a GPU enabled machine and would advise you to do the same as well. You can try Google Colab which comes with free GPU support

That’s all! Let’s dive into some basic NLP processing right away.

 

Using StanfordNLP to Perform Basic NLP Tasks

StanfordNLP comes with built-in processors to perform five basic NLP tasks:

  • Tokenization
  • Multi-Word Token Expansion
  • Lemmatisation
  • Parts of Speech Tagging
  • Dependency Parsing

Let’s start by creating a text pipeline:

nlp = stanfordnlp.Pipeline(processors = "tokenize,mwt,lemma,pos")
doc = nlp("""The prospects for Britain’s orderly withdrawal from the European Union on March 29 have receded further, even as MPs rallied to stop a no-deal scenario. An amendment to the draft bill on the termination of London’s membership of the bloc obliges Prime Minister Theresa May to renegotiate her withdrawal agreement with Brussels. A Tory backbencher’s proposal calls on the government to come up with alternatives to the Irish backstop, a central tenet of the deal Britain agreed with the rest of the EU.""")

The processors = “” argument is used to specify the task. All five processors are taken by default if no argument is passed. Here is a quick overview of the processors and what they can do:

Let’s see each of them in action.

 

Tokenization

This process happens implicitly once the Token processor is run. It is actually pretty quick. You can have a look at tokens by using print_tokens():

doc.sentences[0].print_tokens()

The token object contains the index of the token in the sentence and a list of word objects (in case of a multi-word token). Each word object contains useful information, like the index of the word, the lemma of the text, the pos (parts of speech) tag and the feat (morphological features) tag.

 

Lemmatization

This involves using the “lemma” property of the words generated by the lemma processor. Here’s the code to get the lemma of all the words:

This returns a pandas data frame for each word and its respective lemma:

 

Parts of Speech (PoS) Tagging

The PoS tagger is quite fast and works really well across languages. Just like lemmas, PoS tags are also easy to extract:

Notice the big dictionary in the above code? It is just a mapping between PoS tags and their meaning. This helps in getting a better understanding of our document’s syntactic structure.

The output would be a data frame with three columns – word, pos and exp (explanation). The explanation column gives us the most information about the text (and is hence quite useful).

Adding the explanation column makes it much easier to evaluate how accurate our processor is. I like the fact that the tagger is on point for the majority of the words. It even picks up the tense of a word and whether it is in base or plural form.

 

Dependency Extraction

Dependency extraction is another out-of-the-box feature of StanfordNLP. You can simply call print_dependencies() on a sentence to get the dependency relations for all of its words:

doc.sentences[0].print_dependencies()

The library computes all of the above during a single run of the pipeline. This will hardly take you a few minutes on a GPU enabled machine.

We have now figured out a way to perform basic text processing with StanfordNLP. It’s time to take advantage of the fact that we can do the same for 51 other languages!

 

Implementing StanfordNLP on the Hindi Language

StanfordNLP really stands out in its performance and multilingual text parsing support. Let’s dive deeper into the latter aspect.

 

Processing text in Hindi (Devanagari Script)

First, we have to download the Hindi language model (comparatively smaller!):

stanfordnlp.download('hi')

Now, take a piece of text in Hindi as our text document:

hindi_doc = nlp("""केंद्र की मोदी सरकार ने शुक्रवार को अपना अंतरिम बजट पेश किया. कार्यवाहक वित्त मंत्री पीयूष गोयल ने अपने बजट में किसान, मजदूर, करदाता, महिला वर्ग समेत हर किसी के लिए बंपर ऐलान किए. हालांकि, बजट के बाद भी टैक्स को लेकर काफी कन्फ्यूजन बना रहा. केंद्र सरकार के इस अंतरिम बजट क्या खास रहा और किसको क्या मिला, आसान भाषा में यहां समझें""")

This should be enough to generate all the tags. Let’s check the tags for Hindi:

extract_pos(hindi_doc)

The PoS tagger works surprisingly well on the Hindi text as well. Look at “अपना” for example. The PoS tagger tags it as a pronoun – I, he, she – which is accurate.

 

Using CoreNLP’s API for Text Analytics

CoreNLP is a time tested, industry grade NLP tool-kit that is known for its performance and accuracy. StanfordNLP has been declared as an official python interface to CoreNLP. That is a HUGE win for this library.

There have been efforts before to create Python wrapper packages for CoreNLP but nothing beats an official implementation from the authors themselves. This means that the library will see regular updates and improvements.

StanfordNLP takes three lines of code to start utilizing CoreNLP’s sophisticated API. Literally, just three lines of code to set it up!

1. Download the CoreNLP package. Open your Linux terminal and type the following command:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip

2. Unzip the downloaded package:

unzip stanford-corenlp-full-2018-10-05.zip

3. Start the CoreNLP server:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Note: CoreNLP requires Java8 to run. Please make sure you have JDK and JRE 1.8.x installed.p

Now, make sure that StanfordNLP knows where CoreNLP is present. For that, you have to export $CORENLP_HOME as the location of your folder. In my case, this folder was in the home itself so my path would be like

export CORENLP_HOME=stanford-corenlp-full-2018-10-05/

After the above steps have been taken, you can start up the server and make requests in Python code. Below is a comprehensive example of starting a server, making requests, and accessing data from the returned object.

a. Setting up the CoreNLPClient

b. Dependency Parsing and POS

c. Named Entity Recognition and Co-Reference Chains

The above examples barely scratch the surface of what CoreNLP can do and yet it is very interesting, we were able to accomplish from basic NLP tasks like Parts of Speech tagging to things like Named Entity Recognition, Co-Reference Chain extraction and finding who wrote what in a sentence in just few lines of Python code.

What I like the most here is the ease of use and increased accessibility this brings when it comes to using CoreNLP in python.

 

My Thoughts on using StanfordNLP – Pros and Cons

Exploring a newly launched library was certainly a challenge. There’s barely any documentation on StanfordNLP! Yet, it was quite an enjoyable learning experience.

A few things that excite me regarding the future of StanfordNLP:

  1. Its out-of-the-box support for multiple languages
  2. The fact that it is going to be an official Python interface for CoreNLP. This means it will only improve in functionality and ease of use going forward
  3. It is fairly fast (barring the huge memory footprint)
  4. Straightforward set up in Python

There are, however, a few chinks to iron out. Below are my thoughts on where StanfordNLP could improve:

  1. The size of the language models is too large (English is 1.9 GB, Chinese ~ 1.8 GB)
  2. The library requires a lot of code to churn out features. Compare that to NLTK where you can quickly script a prototype – this might not be possible for StanfordNLP
  3. Currently missing visualization features. It is useful to have for functions like dependency parsing. StanfordNLP falls short here when compared with libraries like SpaCy

Make sure you check out StanfordNLP’s official documentation.

 

End Notes

There is still a feature I haven’t tried out yet. StanfordNLP allows you to train models on your own annotated data using embeddings from Word2Vec/FastText. I’d like to explore it in the future and see how effective that functionality is. I will update the article whenever the library matures a bit.

Clearly, StanfordNLP is very much in the beta stage. It will only get better from here so this is a really good time to start using it – get a head start over everyone else.

For now, the fact that such amazing toolkits (CoreNLP) are coming to the Python ecosystem and research giants like Stanford are making an effort to open source their software, I am optimistic about the future.

[email protected] 12 May, 2020