A common challenge I came across while learning Natural Language Processing (NLP) – can we build models for non-English languages? The answer has been no for quite a long time. Each language has its own grammatical patterns and linguistic nuances. And there just aren’t many datasets available in other languages.
That’s where Stanford’s latest NLP library steps in – StanfordNLP.
I could barely contain my excitement when I read the news last week. The authors claimed StanfordNLP could support more than 53 human languages! Yes, I had to double-check that number.
I decided to check it out myself. There’s no official tutorial for the library yet so I got the chance to experiment and play around with it. And I found that it opens up a world of endless possibilities. StanfordNLP contains pre-trained models for rare Asian languages like Hindi, Chinese and Japanese in their original scripts.
The ability to work with multiple languages is a wonder all NLP enthusiasts crave for. In this article, we will walk through what StanfordNLP is, why it’s so important, and then fire up Python to see it live in action. We’ll also take up a case study in Hindi to showcase how StanfordNLP works – you don’t want to miss that!
Here is StanfordNLP’s description by the authors themselves:
StanfordNLP is the combination of the software package used by the Stanford team in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group’s official Python interface to the Stanford CoreNLP software.
That’s too much information in one go! Let’s break it down:
StanfordNLP is a collection of pre-trained state-of-the-art models. These models were used by the researchers in the CoNLL 2017 and 2018 competitions. All the models are built on PyTorch and can be trained and evaluated on your own annotated data. Awesome!
Additionally, StanfordNLP also contains an official wrapper to the popular behemoth NLP library – CoreNLP. This had been somewhat limited to the Java ecosystem until now. You should check out this tutorial to learn more about CoreNLP and how it works in Python.
Below are a few more reasons why you should check out this library:
What more could an NLP enthusiast ask for? Now that we have a handle on what this library does, let’s take it for a spin in Python!
There are some peculiar things about the library that had me puzzled initially. For instance, you need Python 3.6.8/3.7.2 or later to use StanfordNLP. To be safe, I set up a separate environment in Anaconda for Python 3.7.1. Here’s how you can do it:
1. Open conda prompt and type this:
conda create -n stanfordnlp python=3.7.1
2. Now activate the environment:
source activate stanfordnlp
3. Install the StanfordNLP library:
pip install stanfordnlp
4. We need to download a language’s specific model to work with it. Launch a python shell and import StanfordNLP:
import stanfordnlp
then download the language model for English (“en”):
stanfordnlp.download('en')
This can take a while depending on your internet connection. These language models are pretty huge (the English one is 1.96GB).
pip freeze | grep torch
which should give an output like torch==1.0.0
That’s all! Let’s dive into some basic NLP processing right away.
StanfordNLP comes with built-in processors to perform five basic NLP tasks:
Let’s start by creating a text pipeline:
nlp = stanfordnlp.Pipeline(processors = "tokenize,mwt,lemma,pos")
doc = nlp("""The prospects for Britain’s orderly withdrawal from the European Union on March 29 have receded further, even as MPs rallied to stop a no-deal scenario. An amendment to the draft bill on the termination of London’s membership of the bloc obliges Prime Minister Theresa May to renegotiate her withdrawal agreement with Brussels. A Tory backbencher’s proposal calls on the government to come up with alternatives to the Irish backstop, a central tenet of the deal Britain agreed with the rest of the EU.""")
The processors = “” argument is used to specify the task. All five processors are taken by default if no argument is passed. Here is a quick overview of the processors and what they can do:
Let’s see each of them in action.
This process happens implicitly once the Token processor is run. It is actually pretty quick. You can have a look at tokens by using print_tokens():
doc.sentences[0].print_tokens()
The token object contains the index of the token in the sentence and a list of word objects (in case of a multi-word token). Each word object contains useful information, like the index of the word, the lemma of the text, the pos (parts of speech) tag and the feat (morphological features) tag.
This involves using the “lemma” property of the words generated by the lemma processor. Here’s the code to get the lemma of all the words:
This returns a pandas data frame for each word and its respective lemma:
The PoS tagger is quite fast and works really well across languages. Just like lemmas, PoS tags are also easy to extract:
Notice the big dictionary in the above code? It is just a mapping between PoS tags and their meaning. This helps in getting a better understanding of our document’s syntactic structure.
The output would be a data frame with three columns – word, pos and exp (explanation). The explanation column gives us the most information about the text (and is hence quite useful).
Adding the explanation column makes it much easier to evaluate how accurate our processor is. I like the fact that the tagger is on point for the majority of the words. It even picks up the tense of a word and whether it is in base or plural form.
Dependency extraction is another out-of-the-box feature of StanfordNLP. You can simply call print_dependencies() on a sentence to get the dependency relations for all of its words:
doc.sentences[0].print_dependencies()
The library computes all of the above during a single run of the pipeline. This will hardly take you a few minutes on a GPU enabled machine.
We have now figured out a way to perform basic text processing with StanfordNLP. It’s time to take advantage of the fact that we can do the same for 51 other languages!
StanfordNLP really stands out in its performance and multilingual text parsing support. Let’s dive deeper into the latter aspect.
First, we have to download the Hindi language model (comparatively smaller!):
stanfordnlp.download('hi')
Now, take a piece of text in Hindi as our text document:
hindi_doc = nlp("""केंद्र की मोदी सरकार ने शुक्रवार को अपना अंतरिम बजट पेश किया. कार्यवाहक वित्त मंत्री पीयूष गोयल ने अपने बजट में किसान, मजदूर, करदाता, महिला वर्ग समेत हर किसी के लिए बंपर ऐलान किए. हालांकि, बजट के बाद भी टैक्स को लेकर काफी कन्फ्यूजन बना रहा. केंद्र सरकार के इस अंतरिम बजट क्या खास रहा और किसको क्या मिला, आसान भाषा में यहां समझें""")
This should be enough to generate all the tags. Let’s check the tags for Hindi:
extract_pos(hindi_doc)
The PoS tagger works surprisingly well on the Hindi text as well. Look at “अपना” for example. The PoS tagger tags it as a pronoun – I, he, she – which is accurate.
CoreNLP is a time tested, industry grade NLP tool-kit that is known for its performance and accuracy. StanfordNLP has been declared as an official python interface to CoreNLP. That is a HUGE win for this library.
There have been efforts before to create Python wrapper packages for CoreNLP but nothing beats an official implementation from the authors themselves. This means that the library will see regular updates and improvements.
StanfordNLP takes three lines of code to start utilizing CoreNLP’s sophisticated API. Literally, just three lines of code to set it up!
1. Download the CoreNLP package. Open your Linux terminal and type the following command:
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
2. Unzip the downloaded package:
unzip stanford-corenlp-full-2018-10-05.zip
3. Start the CoreNLP server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
Note: CoreNLP requires Java8 to run. Please make sure you have JDK and JRE 1.8.x installed.p
Now, make sure that StanfordNLP knows where CoreNLP is present. For that, you have to export $CORENLP_HOME as the location of your folder. In my case, this folder was in the home itself so my path would be like
export CORENLP_HOME=stanford-corenlp-full-2018-10-05/
After the above steps have been taken, you can start up the server and make requests in Python code. Below is a comprehensive example of starting a server, making requests, and accessing data from the returned object.
The above examples barely scratch the surface of what CoreNLP can do and yet it is very interesting, we were able to accomplish from basic NLP tasks like Parts of Speech tagging to things like Named Entity Recognition, Co-Reference Chain extraction and finding who wrote what in a sentence in just few lines of Python code.
What I like the most here is the ease of use and increased accessibility this brings when it comes to using CoreNLP in python.
Exploring a newly launched library was certainly a challenge. There’s barely any documentation on StanfordNLP! Yet, it was quite an enjoyable learning experience.
A few things that excite me regarding the future of StanfordNLP:
There are, however, a few chinks to iron out. Below are my thoughts on where StanfordNLP could improve:
Make sure you check out StanfordNLP’s official documentation.
There is still a feature I haven’t tried out yet. StanfordNLP allows you to train models on your own annotated data using embeddings from Word2Vec/FastText. I’d like to explore it in the future and see how effective that functionality is. I will update the article whenever the library matures a bit.
Clearly, StanfordNLP is very much in the beta stage. It will only get better from here so this is a really good time to start using it – get a head start over everyone else.
For now, the fact that such amazing toolkits (CoreNLP) are coming to the Python ecosystem and research giants like Stanford are making an effort to open source their software, I am optimistic about the future.