How Part-of-Speech Tag, Dependency and Constituency Parsing Aid In Understanding Text Data?

Abhishek Sharma Last Updated : 12 Nov, 2024
9 min read

Overview

  • Learn about Part-of-Speech (POS) Tagging,
  • Understand Dependency Parsing and Constituency Parsing

I was amazed that Roger Bacon gave the above quote in the 13th century, and it still holds, Isn’t it? I am sure that you all will agree with me.

Today, the way of understanding languages has changed a lot from the 13th century. We now refer to it as linguistics and natural language processing. But its importance hasn’t diminished; instead, it has increased tremendously. You know why? Because its applications have rocketed and one of them is the reason why you landed on this article.

Fundamental concepts of NLP

Each of these applications involve complex NLP techniques and to understand these, one must have a good grasp on the basics of NLP. Therefore, before going for complex topics, keeping the fundamentals right is important.

That’s why I have created this article in which I will be covering some basic concepts of NLP – Part-of-Speech (POS) tagging, Dependency parsing, and Constituency parsing in natural language processing. We will understand these concepts and also implement these in python. So let’s begin!

In this article, you will learn about POS tagging in NLP, explore online tools for POS tagging, see a POS tagging example, and discover various POS tagging types.

What is Part-of-Speech(POS) Tagging?

Part-of-Speech (POS) tagging is a natural language processing technique that involves assigning specific grammatical categories or labels (such as nouns, verbs, adjectives, adverbs, pronouns, etc.) to individual words within a sentence. This process provides insights into the syntactic structure of the text, aiding in understanding word relationships, disambiguating word meanings, and facilitating various linguistic and computational analyses of textual data.

In our school days, all of us have studied the parts of speech, which includes nouns, pronouns, adjectives, verbs, etc. Words belonging to various parts of speeches form a sentence. Knowing the part of speech of words in a sentence is important for understanding it.

That’s the reason for the creation of the concept of POS tagging. I’m sure that by now, you have already guessed what POS tagging is. Still, allow me to explain it to you.

Part-of-Speech(POS) Tagging is the process of assigning different labels known as POS tags to the words in a sentence that tells us about the part-of-speech of the word.

Broadly there are two types of POS tags:

Universal POS Tags :

These tags are used in the Universal Dependencies (UD) (latest version 2), a project that is developing cross-linguistically consistent treebank annotation for many languages. These tags are based on the type of words. E.g., NOUN(Common Noun), ADJ(Adjective), ADV(Adverb).

List of Universal POS Tags

Part-of-Speech(POS) Tagging : list of pos tags

You can read more about each one of them here.

Detailed POS Tags

These tags are the result of the division of universal POS tags into various tags, like NNS for common plural nouns and NN for the singular common noun compared to NOUN for common nouns in English. These tags are language-specific. You can take a look at the complete list here.

Now you know what POS tags are and what is POS tagging. So let’s write the code in python for POS tagging sentences. For this purpose, I have used Spacy here, but there are other libraries like NLTK and Stanza, which can also be used for doing the same.

import spacy
nlp=spacy.load('en_core_web_sm')
 
text='It took me more than two hours to translate a few pages of English.'

for token in nlp(text):
 print(token.text, '=>',token.pos_,'=>',token.tag_)
Part-of-Speech(POS) Tagging: pos tagging

In the above code sample, I have loaded the spacy’s en_web_core_sm model and used it to get the POS tags. You can see that the pos_ returns the universal POS tags, and tag_ returns detailed POS tags for words in the sentence.

Dependency Parsing

Dependency parsing is the process of analyzing the grammatical structure of a sentence based on the dependencies between the words in a sentence.

In Dependency parsing, various tags represent the relationship between two words in a sentence. These tags are the dependency tags. For example, In the phrase ‘rainy weather,’ the word rainy modifies the meaning of the noun weather. Therefore, a dependency exists from the weather -> rainy in which the weather acts as the head and the rainy acts as dependent or child. This dependency is represented by amod tag, which stands for the adjectival modifier.

dependency tag

Similar to this, there exist many dependencies among words in a sentence but note that a dependency involves only two words in which one acts as the head and other acts as the child. As of now, there are 37 universal dependency relations used in Universal Dependency (version 2). You can take a look at all of them here. Apart from these, there also exist many language-specific tags.

Checkout this article Tutorial on Natural Language Processing using spaCy

Various Spacy

Now let’s use Spacy and find the dependencies in a sentence.

import spacy
nlp=spacy.load('en_core_web_sm')

text='It took me more than two hours to translate a few pages of English.'

for token in nlp(text):
 print(token.text,'=>',token.dep_,'=>',token.head.text)
dependency parsing

In the above code example, the dep_ returns the dependency tag for a word, and head.text returns the respective head word. If you noticed, in the above image, the word took has a dependency tag of ROOT. This tag is assigned to the word which acts as the head of many words in a sentence but is not a child of any other word. Generally, it is the main verb of the sentence similar to ‘took’ in this case.

Now you know what dependency tags and what head, child, and root word are. But doesn’t the parsing means generating a parse tree?

Yes, we’re generating the tree here, but we’re not visualizing it. The tree generated by dependency parsing is known as a dependency tree. There are multiple ways of visualizing it, but for the sake of simplicity, we’ll use displaCy which is used for visualizing the dependency parse.

from spacy import displacy
displacy.render(nlp(text),jupyter=True)

In the above image, the arrows represent the dependency between two words in which the word at the arrowhead is the child, and the word at the end of the arrow is head. The root word can act as the head of multiple words in a sentence but is not a child of any other word. You can see above that the word ‘took’ has multiple outgoing arrows but none incoming. Therefore, it is the root word. One interesting thing about the root word is that if you start tracing the dependencies in a sentence you can reach the root word, no matter from which word you start.

Now you know about the dependency parsing, so let’s learn about another type of parsing known as Constituency Parsing.

Constituency Parsing

Constituency Parsing is the process of analyzing the sentences by breaking down it into sub-phrases also known as constituents. These sub-phrases belong to a specific category of grammar like NP (noun phrase) and VP(verb phrase).

Let’s understand it with the help of an example. Suppose I have the same sentence which I used in previous examples, i.e., “It took me more than two hours to translate a few pages of English.” and I have performed constituency parsing on it. Then, the constituency parse tree for this sentence is given by-

constituency parse tree

In the above tree, the words of the sentence are written in purple color, and the POS tags are written in red color. Except for these, everything is written in black color, which represents the constituents. You can clearly see how the whole sentence is divided into sub-phrases until only the words remain at the terminals. Also, there are different tags for denoting constituents like

  • VP for verb phrase
  • NP for noun phrases

These are the constituent tags. You can read about different constituent tags here.

Now you know what constituency parsing is, so it’s time to code in python. Now spaCy does not provide an official API for constituency parsing. Therefore, we will be using the Berkeley Neural Parser. It is a python implementation of the parsers based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.

You can also use StanfordParser with Stanza or NLTK for this purpose, but here I have used the Berkely Neural Parser. For using this, we need first to install it. You can do that by running the following command.

!pip install benepar

Then you have to download the benerpar_en2 model.

%tensorflow_version 1.x
import benepar
benepar.download('benepar_en2')

You might have noticed that I am using TensorFlow 1.x here because currently, the benepar does not support TensorFlow 2.0. Now, it’s time to do constituency parsing.

from benepar.spacy_plugin import BeneparComponent

# Loading spaCy’s en model and adding benepar model to its pipeline
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent('benepar_en2'))

text='It took me more than two hours to translate a few pages of English.'

# Generating a parse tree for the text
list(nlp(text).sents)[0]._.parse_string
constituency parse

Here, _.parse_string generates the parse tree in the form of string.

What is the use case of POS tagging?

Here are Some Use Cases of Pos tagging :

  • Syntactic Analysis: By understanding the grammatical role of each word (e.g., noun phrase, verb phrase), POS tagging helps analyze the sentence structure and relationships between words. This is achieved using hidden Markov models and other algorithms that predict the most likely sequence of POS tags based on the given text.
  • Disambiguation: Words like “play” can be a noun or verb. POS tagging helps identify the correct meaning based on context, using tagsets that define the possible tags for each word type and their contexts.
  • Language Modeling: POS tags provide valuable information about the relationships between words, which is useful for building statistical models of language. These models can be enhanced with deep learning techniques to improve their accuracy and handling of complex linguistic patterns.
  • Preprocessing for Other NLP Tasks: POS tagging is often a preliminary step for tasks like named entity recognition and information extraction. By identifying the part of speech for each word, we can better understand the structure of the text and extract relevant information more accurately. This involves prepositions and other parts of speech that help determine the relationships between entities in a sentence.
  • Lemmatization and Stemming: These techniques reduce words to their base forms (e.g., “running” to “run”). POS tags can help identify the correct base form depending on the word’s function in the sentence, distinguishing between different uses such as nouns, verbs, or interjections.
  • Grammar Checking: POS information can be used to flag potential grammatical errors, like using a verb in the wrong tense. This is particularly useful in applications such as grammar checking software, where understanding the pos tagger output helps identify mistakes.

Plays different Roles

By incorporating these keywords, we can understand how POS tagging plays a critical role in various aspects of natural language processing and syntactic analysis.

Read More about this article How NLP using NLTK Library

Why is POS tagging hard?

Here are some reasons for Pos tagging is challenging :

Word ambiguity: Many words in a corpora have multiple meanings and parts of speech depending on the context. For instance, “bat” can be a noun (a flying mammal) or a verb (to hit something). A part-of-speech tagger needs to consider the surrounding words to assign the correct tag.

Words and complex grammar: Part-of-speech taggers are trained on large amounts of training data, but they can struggle with words they haven’t encountered before (out-of-vocabulary words) or languages with complex grammatical structures.

Here are some additional factors that make POS tagging tricky:

  1. Idioms and slang: Informal language constructs often don’t follow standard grammar rules, making them difficult to tag accurately.
  2. Domain dependence: A part-of-speech tagger trained on a general dataset might not perform well on very specific domains, like legal documents or medical reports.
  3. Perception: The interpretation of a text can vary depending on individual perception, which can affect how parts of speech are tagged.
  4. Cardinal numbers: Numbers can be challenging as they can function as nouns, adjectives, or even other parts of speech depending on their use in a sentence.
  5. Transformation-based methods: These methods refine initial tagging decisions based on a set of learned rules, improving accuracy but adding complexity to the tagging process.

End Notes

Now, you know what POS tagging, dependency parsing, and constituency parsing are and how they help you in understanding the text data i.e., POS tags tells you about the part-of-speech of words in a sentence, dependency parsing tells you about the existing dependencies between the words in a sentence and constituency parsing tells you about the sub-phrases or constituents of a sentence. You are now ready to move to more complex parts of NLP. As your next steps, you can read the following articles on the information extraction.

Hope you like the article! Part-of-speech (POS) tagging in NLP is essential for understanding text structure. What is POS tagging? It labels words with grammatical categories, enhancing machine comprehension. What is part of speech tagging in NLP? It aids in disambiguation and improves algorithm accuracy. Overall, what is POS tagging in NLP? It’s a foundational technique for various applications.

Also, Read More about Natural Langugae Processing Using Python

Frequently Asked Questions

Q1.What is POS tagging?

POS tagging assigns grammatical categories (tags) to words in a text. It helps machines understand language better and is used in tasks like translation, sentiment analysis, and information extraction.

Q2.Why is POS tagging important?

POS tagging is crucial for NLP as it helps computers understand the grammatical structure and meaning of text. It’s used in tasks like syntactic analysis, semantic analysis, information extraction, machine translation, and text generation.

Q3.How does POS tagging work?

POS tagging is a process in NLP that assigns a grammatical category (e.g., noun, verb) to each word in a sentence. It uses various features and algorithms to achieve this, and has many applications in NLP tasks.

Q4.Can POS tagging be language-independent?

POS tagging is not language-independent. While there are some universal grammatical concepts, the specifics vary significantly across languages due to morphological differences, syntactic structures, lexical ambiguity, and tag sets. However, researchers are working on approaches like universal tag sets and transfer learning to make POS tagging more language-independent.

He is a data science aficionado, who loves diving into data and generating insights from it. He is always ready for making machines to learn through code and writing technical blogs. His areas of interest include Machine Learning and Natural Language Processing still open for something new and exciting.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details