Working with text data can be fun and interesting. There is a whole lot of opportunities in NLP, Text Analytics, Text Mining, and so on. But, before proceeding with all this, one must know how to work with text data in Python.
( Image: https://www.pexels.com/photo/coffee-writing-computer-blogging-34600/)
There arise a lot of challenges in working with text data.
Working with Text Data
Let’s say, we have an array of numbers, we can easily find the sum of all numbers and the average of all numbers. Or, let’s say, we want to create a regression model from this data. Things are pretty simple for numeric data. Numerical data can be processed very easily.
Now, coming to text data. How do we compare two book reviews, or let’s say two different comments on a Facebook post? How do we determine if a tweet carries a positive sentiment or a negative sentiment?
All these challenges can be solved with NLP, Text Analytics, Text Mining, and other text-based solutions.
Text data constitutes a large part of all data online. It can be Wikipedia pages, Twitter tweets, Amazon product reviews, and so on. With time, the amount of text data is going to increase. This data can yield many important insights and give valuable outcomes. There is an increasing need for data professionals and people who can work with data to tap into all this potential. Python can be used to process text data, and conduct various analyses and gather metrics. The data available is going to increase with time and encompass wider types of text data. But all of the data is not going to be clean or easily processable.
But before one proceeds with these things, one must know the basic text operations in Python. Knowing the way to properly use string functions in Python can make working and manipulating text data easy and fast.
Let us proceed with the code.
The list() function can be used to get all the individual characters from a string. This function returns all the characters and whitespaces as a list.
w="London is a big city." l1=list(w) print(l1)
['L', 'o', 'n', 'd', 'o', 'n', ' ', 'i', 's', ' ', 'a', ' ', 'b', 'i', 'g', ' ', 'c', 'i', 't', 'y', '.']
We can see that all the characters have been added to a list. Now, all the individual characters can be accessed. Let us check if the length of the list is equal to the length of the text.
t="London is a big city" print("Text Length:", len(t)) print("List Length:", len(list(t)))
Text Length: 20 List Length: 20
So, we can see that both have the same length, hence implying that the function works perfectly.
Suppose, we want to check if a particular string is present at the beginning of a larger text. In that case, we can use this function to check if a particular string starts with the mentioned string.
Let us see the implementation.
s= "London" t="Lo" print(s.startswith(t))
Let us check another input.
s= "London" print(s.startswith("Ne"))
This function does the opposite, as the name implies. It checks if a particular string is present at the end of another string.
s= "London" print(s.endswith("on"))
So, both the functions can be used to check the starting and ending of a string. It can be useful if we are searching for some prefix or suffix.
This function checks if all the characters in a string are in upper case or not. Implementation is simple and it returns a True or False value.
Just like the name implements, it is the opposite of the previous function. It checks if all the characters in a string are lower case or not. It returns a True or False value.
t in s:
The keyword “in” can be used to check if a particular substring is present in a larger string. This can be used to find some string in a larger text, or check if the word we need is present in a larger paragraph.
Implementation is very easy and simple.
s="Berlin is the capital of Germany" print("Berlin" in s)
We get the appropriate output.
This function returns if a particular text is in title format. For example, “United States”. Basically, all the first letters of all words must be capital for it to be in title format.
Let us see the implementation code.
s="New York" print(s.istitle())
As both the first letters of the words are capital, it is returned as True.
Let us try a different example.
This function checks if the characters in a string are all alphabets.
So, we can see that as all characters in the above string are alphabets, the function returns True.
Let us try with different input.
The output is as expected.
This function checks if all the characters in a string are numbers.
As the input is numeric, the function returns True.
This function checks if a string has either numeric characters or alphabets. If special characters are present, False will be returned.
s= "jan2021" print(s.isalnum())
Let us try a text with a special character.
This function converts all the characters of the string to lowercase. This function is used when we want uniformity in our data.
Let’s see how it works.
As we can see, all the characters have been converted to lowercase.
As the name suggests, this function converts all lowercase characters of a string to uppercase.
This function converts all the 1st letters of words to uppercase.
s="kolKata is a bIg city" print(s.title())
Kolkata Is A Big City
Earlier, we had seen how to split the text into the characters, but what if we want to get the words.
We can use the split() function to split the text into smaller texts based on a character. That is, this character will serve as the split point.
s="Mumbai is the financial capital of India" print(s.split(" "))
['Mumbai', 'is', 'the', 'financial', 'capital', 'of', 'India']
As we can see all the words have been returned in a list. Now, we can access all the words individually.
Now, think of a situation where we have to join all these to form a string.
Let us see how to implement it.
s="Mumbai is the financial capital of India" s_split= s.split(" ") res= " ".join(s_split) print(res)
Mumbai is the financial capital of India
We get the joined string.
Let us try the same with some different data.
s=["Ram",",", "Shyam",",","Ravi",",","Hari"] res= "".join(s) print(res)
We get the output as desired.
If we have to remove whitespaces around a text, we can use this function.
s= " London" s.strip()
This function also removes whitespaces from the end of the strings.
s= " London " s.strip()
We get the output as expected.
This function removes whitespaces, but only from the end of the string.
s= " London " s.rstrip()
This function can be used to find a particular string in a larger string.
The function returns the location of the search query string.
s="London is the capital of UK" s.find("is")
Here, as the string is at the 8th position, the output is 8-1=7.
This function is used to replace one string with another. Let us see how it works.
s="London is the capital of UK" s=s.replace("London", "Rome") s=s.replace("UK", "Italy") print(s)
Rome is the capital of Italy
As we can see, the appropriate edits have been made.
Suppose we are extracting text from a web source, this function can be used to split the text into sentences.
t="Germany is the capital of Germany. n London is the capital of UK. n Paris is the capital of France." t.splitlines()
['Germany is the capital of Germany. ', ' London is the capital of UK. ', ' Paris is the capital of France.']
The n stands from the newline. So, the three individual sentences are found here.
There is a lot more to learn in python.
To check the code, see this.
Data Science and Analytics | SEO | Content Creation
Connect with me on Linkedin.
My other articles on Analytics Vidhya: Link.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.You can also read this article on our Mobile APP