Must-Know Text Operations in Python before you dive into NLP!

Prateek Majumder Last Updated : 30 Oct, 2024

5 min read

This article was published as a part of the Data Science Blogathon

Working with text data can be fun and interesting. There is a whole lot of opportunities in NLP, Text Analytics, Text Mining, and so on. But, before proceeding with all this, one must know how to work with text data in Python.

( Image: https://www.pexels.com/photo/coffee-writing-computer-blogging-34600/)

There arise a lot of challenges in working with text data.

Working with Text Data

Let’s say, we have an array of numbers, we can easily find the sum of all numbers and the average of all numbers. Or, let’s say, we want to create a regression model from this data. Things are pretty simple for numeric data. Numerical data can be processed very easily.

Now, coming to text data. How do we compare two book reviews, or let’s say two different comments on a Facebook post? How do we determine if a tweet carries a positive sentiment or a negative sentiment?

All these challenges can be solved with NLP, Text Analytics, Text Mining, and other text-based solutions.

Text data constitutes a large part of all data online. It can be Wikipedia pages, Twitter tweets, Amazon product reviews, and so on. With time, the amount of text data is going to increase. This data can yield many important insights and give valuable outcomes. There is an increasing need for data professionals and people who can work with data to tap into all this potential. Python can be used to process text data, and conduct various analyses and gather metrics. The data available is going to increase with time and encompass wider types of text data. But all of the data is not going to be clean or easily processable.

But before one proceeds with these things, one must know the basic text operations in Python. Knowing the way to properly use string functions in Python can make working and manipulating text data easy and fast.

Let us proceed with the code.

list()

Python Code:

w="London is a big city."
l1=list(w)
print(l1)

t="London is a big city"
print("Text Length:", len(t))
print("List Length:", len(list(t)))

s= "London"
t="Lo"
print("Checking if s startswith 't':")
print(s.startswith(t))

The list() function can be used to get all the individual characters from a string. This function returns all the characters and whitespaces as a list.
We can see that all the characters have been added to a list. Now, all the individual characters can be accessed. Let us check if the length of the list is equal to the length of the text.
So, we can see that both have the same length, hence implying that the function works perfectly.
Suppose, we want to check if a particular string is present at the beginning of a larger text. In that case, we can use this function to check if a particular string starts with the mentioned string.

Let us see the implementation.

s= "London"
t="Lo"
print(s.startswith(t))

Output:

True

Let us check another input.

s= "London"
print(s.startswith("Ne"))

Output:

False

s.endswith(t)

This function does the opposite, as the name implies. It checks if a particular string is present at the end of another string.

s= "London"
print(s.endswith("on"))

Output:

True

So, both the functions can be used to check the starting and ending of a string. It can be useful if we are searching for some prefix or suffix.

s.isupper()

This function checks if all the characters in a string are in upper case or not. Implementation is simple and it returns a True or False value.

w="BERLIN"
print(w.isupper())

Output:

True

s.islower()

Just like the name implements, it is the opposite of the previous function. It checks if all the characters in a string are lower case or not. It returns a True or False value.

w="BERLIN"
print(w.islower())

Output:

False

t in s:

The keyword “in” can be used to check if a particular substring is present in a larger string. This can be used to find some string in a larger text, or check if the word we need is present in a larger paragraph.

Implementation is very easy and simple.

s="Berlin is the capital of Germany"
print("Berlin" in s)

Output:

True

We get the appropriate output.

s.istitle()

This function returns if a particular text is in title format. For example, “United States”. Basically, all the first letters of all words must be capital for it to be in title format.

Let us see the implementation code.

s="New York"
print(s.istitle())

Output:

True

As both the first letters of the words are capital, it is returned as True.

Let us try a different example.

print("roMe".istitle())

Output:

False

s.isalpha()

This function checks if the characters in a string are all alphabets.

s="dsnlmls"
print(s.isalpha())

Output:

True

So, we can see that as all characters in the above string are alphabets, the function returns True.

Let us try with different input.

s="56700#"
print(s.isalpha())

Output:

False

The output is as expected.

s.isdigit()

This function checks if all the characters in a string are numbers.

s="2021"
print(s.isdigit())

Output:

True

As the input is numeric, the function returns True.

s.isalnum()

This function checks if a string has either numeric characters or alphabets. If special characters are present, False will be returned.

s= "jan2021"
print(s.isalnum())

Output:

True

Let us try a text with a special character.

s="@1234"
print(s.isalnum())

Output:

False

s.lower()

This function converts all the characters of the string to lowercase. This function is used when we want uniformity in our data.

Let’s see how it works.

s="KOLKATA"
print(s.lower())

Output:

kolkata

As we can see, all the characters have been converted to lowercase.

s.upper()

As the name suggests, this function converts all lowercase characters of a string to uppercase.

s='Kolkata'
print(s.upper())

Output:

KOLKATA

s.title()

This function converts all the 1st letters of words to uppercase.

s="kolKata is a bIg city"
print(s.title())

Output:

Kolkata Is A Big City

s.split()

Earlier, we had seen how to split the text into the characters, but what if we want to get the words.

We can use the split() function to split the text into smaller texts based on a character. That is, this character will serve as the split point.

s="Mumbai is the financial capital of India"
print(s.split(" "))

Output:

['Mumbai', 'is', 'the', 'financial', 'capital', 'of', 'India']

As we can see all the words have been returned in a list. Now, we can access all the words individually.

s.join()

Now, think of a situation where we have to join all these to form a string.

Let us see how to implement it.

s="Mumbai is the financial capital of India"
s_split= s.split(" ")
res= " ".join(s_split)
print(res)

Output:

Mumbai is the financial capital of India

We get the joined string.

Let us try the same with some different data.

s=["Ram",",", "Shyam",",","Ravi",",","Hari"]
res= "".join(s)
print(res)

Output:

Ram,Shyam,Ravi,Hari

We get the output as desired.

s.strip()

If we have to remove whitespaces around a text, we can use this function.

s= "    London"
s.strip()

Output:

London

This function also removes whitespaces from the end of the strings.

s= "    London  "
s.strip()

Output:

‘London’

We get the output as expected.

s.rstrip()

This function removes whitespaces, but only from the end of the string.

s= " London   "
s.rstrip()

Output:

' London'

s.find()

This function can be used to find a particular string in a larger string.

The function returns the location of the search query string.

s="London is the capital of UK"
s.find("is")

Output:

Here, as the string is at the 8th position, the output is 8-1=7.

s.replace()

This function is used to replace one string with another. Let us see how it works.

s="London is the capital of UK"
s=s.replace("London", "Rome")
s=s.replace("UK", "Italy")
print(s)

Output:

Rome is the capital of Italy

As we can see, the appropriate edits have been made.

s.splitlines()

Suppose we are extracting text from a web source, this function can be used to split the text into sentences.

t="Germany is the capital of Germany. n London is the capital of UK. n Paris is the capital of France."
t.splitlines()

Output:

['Germany is the capital of Germany. ',
 ' London is the capital of UK. ',
 ' Paris is the capital of France.']

The n stands from the newline. So, the three individual sentences are found here.

There is a lot more to learn in python.

To check the code, see this.

About me:

Prateek Majumder

Data Science and Analytics | SEO | Content Creation

Connect with me on Linkedin.

My other articles on Analytics Vidhya: Link.

Thank You.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Prateek Majumder

Prateek is a dynamic professional with a strong foundation in Artificial Intelligence and Data Science, currently pursuing his PGP at Jio Institute. He holds a Bachelor's degree in Electrical Engineering and has hands-on experience as a System Engineer at TCS Digital, where he excelled in API management and data integration. Prateek also has a background in product marketing and analytics from his time with start-ups like AppleX and Milkie Way, Inc., where he was involved in growth campaigns and technical blog management. Recognized for his structured thinking and problem-solving abilities, he has received accolades like the Dr. Sudarshan Chakraborty Award for Best Student Performance. Fluent in multiple languages and passionate about technology, Prateek continues to expand his expertise in the rapidly evolving AI and tech landscape.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Must-Know Text Operations in Python before you dive into NLP!

Working with Text Data

list()

s.endswith(t)

s.islower()

s.istitle()

s.isalpha()

s.isdigit()

s.isalnum()

s.lower()

s.upper()

s.title()

s.split()

s.join()

s.strip()

s.rstrip()

s.find()

s.replace()

s.splitlines()

About me:

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Must-Know Text Operations in Python before you dive into NLP!

Working with Text Data

list()

s.endswith(t)

s.islower()

s.istitle()

s.isalpha()

s.isdigit()

s.isalnum()

s.lower()

s.upper()

s.title()

s.split()

s.join()

s.strip()

s.rstrip()

s.find()

s.replace()

s.splitlines()

About me:

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques