This article was published as a part of the Data Science Blogathon.
Millions of web pages and websites exist on the Internet today. Going through a vast amount of content becomes very difficult to extract information on a certain topic, especially for a text summarization project. Google will filter the search results and give you the top ten search results, but often you are unable to find the right content that you need. There is a lot of redundant and overlapping data in the articles which leads to a lot of wastage of time. The better way to deal with this problem is to summarize the text data which is available in large amounts to smaller sizes.
Text summarization is an NLP technique that extracts text from a large amount of data. It helps in creating a shorter version of the large text available.
It is important because :
There are two approaches for text summarization: NLP based techniques and deep learning techniques.
In this article, we will go through an NLP based technique which will make use of the NLTK library.
If you wish to summarize a Wikipedia Article, obtain the URL for the article that you wish to summarize. We will obtain data from the URL using the concept of Web scraping. Now, to use web scraping you will need to install the beautifulsoup library in Python. This library will be used to fetch the data on the web page within the various HTML tags.
Use the below command:
pip install beautifulsoup4
To parse the HTML tags we will further require a parser, that is the lxml package:
pip install lxml
We will try to summarize the Reinforcement Learning page on Wikipedia.Python Code for obtaining the data through web-scraping:
Python Code:
In this script, we first begin with importing the required libraries for web scraping i.e. BeautifulSoup. The urllib package is required for parsing the URL. Re is the library for regular expressions that are used for text pre-processing. The urlopen function will be used to scrape the data. The read() will read the data on the URL. Further on, we will parse the data with the help of the BeautifulSoup object and the lxml parser.
In the Wikipedia articles, the text is present in the <p> tags. Hence we are using the find_all function to retrieve all the text which is wrapped within the <p> tags.
After scraping, we need to perform data preprocessing on the text extracted.
The first task is to remove all the references made in the Wikipedia article. These references are all enclosed in square brackets. The below code will remove the square brackets and replace them with spaces.
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'[[0-9]*]', ' ', article_text)
article_text = re.sub(r's+', ' ', article_text)
The article_text will contain text without brackets which is the original text. We are not removing any other words or punctuation marks as we will use them directly to create the summaries.
Execute the below code to create weighted frequencies and also to clean the text:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r's+', ' ', formatted_article_text)
Here the formatted_article_text contains the formatted article. We will use this object to calculate the weighted frequencies and we will replace the weighted frequencies with words in the article_text object.
The sentences are broken down into words so that we have separate entities.
sentence_list = nltk.sent_tokenize(article_text)
We are tokenizing the article_text object as it is unfiltered data while the formatted_article_text object has formatted data devoid of punctuations etc.
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
All English stopwords from the nltk library are stored in the stopwords variable. Iterate over all the sentences, check if the word is a stopword. If the word is not a stopword, then check for its presence in the word_frequencies dictionary. If it doesn’t exist, then insert it as a key and set its value to 1. If it is already existing, just increase its count by 1.
maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
To find the weighted frequency, divide the frequency of the word by the frequency of the most occurring word.
A glimpse of the word_frequencies dictionary:
{'Reinforcement': 0.06944444444444445,
'learning': 0.4583333333333333,
'RL': 0.013888888888888888,
'area': 0.013888888888888888,
'machine': 0.041666666666666664,
'concerned': 0.027777777777777776,
'software': 0.013888888888888888,
'agents': 0.013888888888888888,
'ought': 0.013888888888888888,
'take': 0.027777777777777776,
'actions': 0.1527777777777778,
'environment': 0.08333333333333333,
'order': 0.041666666666666664,
'maximize': 0.041666666666666664,
'notion': 0.027777777777777776,
'cumulative': 0.041666666666666664,
…………
We have calculated the weighted frequencies. Now scores for each sentence can be calculated by adding weighted frequencies for each word.
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
The sentence_scores dictionary has been created which will store the sentences as keys and their occurrence as values. Iterate over all the sentences, tokenize all the words in a sentence. If the word exists in word_frequences and also if the sentence exists in sentence_scores then increase its count by 1 else insert it as a key in the sentence_scores and set its value to 1. We are not considering longer sentences hence we have set the sentence length to 30.
A glimpse of sentence_scores dictionary:
{'Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.': 2.347222222222222,
'Reinforcement learning differs from supervised learning in not needing labeled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected.': 1.5555555555555551,
'Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).': 0.4305555555555556,
The sentence_scores dictionary consists of the sentences along with their scores. Now, top N sentences can be used to form the summary of the article.
Here the heapq library has been used to pick the top 7 sentences to summarize the article.
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. In reinforcement learning methods, expectations are approximated by averaging over samples and using function approximation techniques to cope with the need to represent value functions over large state-action spaces. Policy iteration consists of two steps: policy evaluation and policy improvement. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning. Assuming full knowledge of the MDP, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration. Two elements make reinforcement learning powerful: the use of samples to optimize performance and the use of function approximation to deal with large environments. Many policy search methods may get stuck in local optima (as they are based on local search).
Text summarization of articles can be performed by using the NLTK library and the BeautifulSoup library. This can help in saving time. Higher Deep learning techniques can be further used to get more optimum summarizations. Looking forward to people using this mechanism for summarization.
There are extractive and abstractive summarization methods, each with unique approaches to condensing text.
Absolutely, text summarization techniques are versatile and can be used on articles, documents, and online content.
It employs algorithms to analyze and condense text, identifying essential information and creating concise summaries.
Code does not run. Where is link to code? What nltk datasets are needed besides punkt, which I had to add? My code dropped out most "s" characters and the "/n" was not removed.