Tired of Reading Long Articles? Text Summarization will make your task easier!
This article was published as a part of the Data Science Blogathon.
Introduction
Millions of web pages and websites exist on the Internet today. Going through a vast amount of content becomes very difficult to extract information on a certain topic. Google will filter the search results and give you the top ten search results, but often you are unable to find the right content that you need. There is a lot of redundant and overlapping data in the articles which leads to a lot of wastage of time. The better way to deal with this problem is to summarize the text data which is available in large amounts to smaller sizes.
Text Summarization
Text summarization is an NLP technique that extracts text from a large amount of data. It helps in creating a shorter version of the large text available.
It is important because :
- Reduces reading time
- Helps in better research work
- Increases the amount of information that can fit in an area
There are two approaches for text summarization: NLP based techniques and deep learning techniques.
In this article, we will go through an NLP based technique which will make use of the NLTK library.
Text Summarization steps
- Obtain Data
- Text Preprocessing
- Convert paragraphs to sentences
- Tokenizing the sentences
- Find weighted frequency of occurrence
- Replace words by weighted frequency in sentences
- Sort sentences in descending order of weights
- Summarizing the Article
Obtain Data for Summarization
If you wish to summarize a Wikipedia Article, obtain the URL for the article that you wish to summarize. We will obtain data from the URL using the concept of Web scraping. Now, to use web scraping you will need to install the beautifulsoup library in Python. This library will be used to fetch the data on the web page within the various HTML tags.
Use the below command:
pip install beautifulsoup4
To parse the HTML tags we will further require a parser, that is the lxml package:
pip install lxml
We will try to summarize the Reinforcement Learning page on Wikipedia.Python Code for obtaining the data through web-scraping:
Python Code:
In this script, we first begin with importing the required libraries for web scraping i.e. BeautifulSoup. The urllib package is required for parsing the URL. Re is the library for regular expressions that are used for text pre-processing. The urlopen function will be used to scrape the data. The read() will read the data on the URL. Further on, we will parse the data with the help of the BeautifulSoup object and the lxml parser.
In the Wikipedia articles, the text is present in the <p> tags. Hence we are using the find_all function to retrieve all the text which is wrapped within the <p> tags.
After scraping, we need to perform data preprocessing on the text extracted.
Text Preprocessing
The first task is to remove all the references made in the Wikipedia article. These references are all enclosed in square brackets. The below code will remove the square brackets and replace them with spaces.
# Removing Square Brackets and Extra Spaces article_text = re.sub(r'[[0-9]*]', ' ', article_text) article_text = re.sub(r's+', ' ', article_text)
The article_text will contain text without brackets which is the original text. We are not removing any other words or punctuation marks as we will use them directly to create the summaries.
Execute the below code to create weighted frequencies and also to clean the text:
# Removing special characters and digits formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text ) formatted_article_text = re.sub(r's+', ' ', formatted_article_text)
Here the formatted_article_text contains the formatted article. We will use this object to calculate the weighted frequencies and we will replace the weighted frequencies with words in the article_text object.
Convert text to sentences
The sentences are broken down into words so that we have separate entities.
sentence_list = nltk.sent_tokenize(article_text)
We are tokenizing the article_text object as it is unfiltered data while the formatted_article_text object has formatted data devoid of punctuations etc.
Finding weighted frequencies of occurrence
stopwords = nltk.corpus.stopwords.words('english') word_frequencies = {} for word in nltk.word_tokenize(formatted_article_text): if word not in stopwords: if word not in word_frequencies.keys(): word_frequencies[word] = 1 else: word_frequencies[word] += 1
All English stopwords from the nltk library are stored in the stopwords variable. Iterate over all the sentences, check if the word is a stopword. If the word is not a stopword, then check for its presence in the word_frequencies dictionary. If it doesn’t exist, then insert it as a key and set its value to 1. If it is already existing, just increase its count by 1.
maximum_frequncy = max(word_frequencies.values()) for word in word_frequencies.keys(): word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
To find the weighted frequency, divide the frequency of the word by the frequency of the most occurring word.
A glimpse of the word_frequencies dictionary:
{'Reinforcement': 0.06944444444444445, 'learning': 0.4583333333333333, 'RL': 0.013888888888888888, 'area': 0.013888888888888888, 'machine': 0.041666666666666664, 'concerned': 0.027777777777777776, 'software': 0.013888888888888888, 'agents': 0.013888888888888888, 'ought': 0.013888888888888888, 'take': 0.027777777777777776, 'actions': 0.1527777777777778, 'environment': 0.08333333333333333, 'order': 0.041666666666666664, 'maximize': 0.041666666666666664, 'notion': 0.027777777777777776, 'cumulative': 0.041666666666666664, …………
Calculate sentence scores
We have calculated the weighted frequencies. Now scores for each sentence can be calculated by adding weighted frequencies for each word.
sentence_scores = {} for sent in sentence_list: for word in nltk.word_tokenize(sent.lower()): if word in word_frequencies.keys(): if len(sent.split(' ')) < 30: if sent not in sentence_scores.keys(): sentence_scores[sent] = word_frequencies[word] else: sentence_scores[sent] += word_frequencies[word]
The sentence_scores dictionary has been created which will store the sentences as keys and their occurrence as values. Iterate over all the sentences, tokenize all the words in a sentence. If the word exists in word_frequences and also if the sentence exists in sentence_scores then increase its count by 1 else insert it as a key in the sentence_scores and set its value to 1. We are not considering longer sentences hence we have set the sentence length to 30.
A glimpse of sentence_scores dictionary:
{'Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.': 2.347222222222222, 'Reinforcement learning differs from supervised learning in not needing labeled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected.': 1.5555555555555551, 'Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).': 0.4305555555555556,
Summary of the article
The sentence_scores dictionary consists of the sentences along with their scores. Now, top N sentences can be used to form the summary of the article.
Here the heapq library has been used to pick the top 7 sentences to summarize the article.
import heapq summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get) summary = ' '.join(summary_sentences) print(summary)
Endnotes
One thought on "Tired of Reading Long Articles? Text Summarization will make your task easier!"
Rich says: December 04, 2020 at 5:24 am
Code does not run. Where is link to code? What nltk datasets are needed besides punkt, which I had to add? My code dropped out most "s" characters and the "/n" was not removed.