Text Preprocessing in Python

Text Preprocessing in Python

Last Updated : 26 Apr, 2025

Text processing is a key part of Natural Language Processing (NLP). It helps us clean and convert raw text data into a format suitable for analysis and machine learning. In this article, we will learn how to perform text preprocessing using various Python libraries and techniques focusing on the NLTK (Natural Language Toolkit) library.

1. Importing Libraries

We will be importing nltk, regex, string and inflect.

Python

import nltk
import string
import re
import inflect
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

2. Convert to Lowercase

We lowercase the text to reduce the size of the vocabulary of our text data.

Python

def text_lowercase(text):
    return text.lower()

input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!";
text_lowercase(input_str)

Output:

"hey, did you know that the summer break is coming? amazing right !! it's only 5 more days !!"

3. Removing Numbers

We can either remove numbers or convert the numbers into their textual representations. To remove the numbers we can use regular expressions.

Python

def remove_numbers(text):
    result = re.sub(r'\d+', '', text)
    return result

input_str = "There are 3 balls in this bag, and 12 in the other one."
remove_numbers(input_str)

Output:

'There are balls in this bag, and in the other one.'

4. Converting Numerical Values

We can also convert the numbers into words. This can be done by using the inflect library.

Python

p = inflect.engine()

def convert_number(text):
    temp_str = text.split()
    new_string = []

    for word in temp_str:
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)

        else:
            new_string.append(word)

    temp_str = ' '.join(new_string)
    return temp_str

input_str = 'There are 3 balls in this bag, and 12 in the other one.'
convert_number(input_str)

Output:

'There are three balls in this bag, and twelve in the other one.'

5. Removing Punctuation

We remove punctuations so that we don't have different forms of the same word. For example if we don't remove the punctuation then been. been, been! will be treated separately.

Python

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
remove_punctuation(input_str)

Output:

'Hey did you know that the summer break is coming Amazing right Its only 5 more days '

6. Removing Whitespace

We can use the join and split function to remove all the white spaces in a string.

Python

def remove_whitespace(text):
    return  " ".join(text.split())
input_str = "we don't need   the given questions"
remove_whitespace(input_str)

Output:

"we don't need the given questions"

7. Removing Stopwords

Stopwords are words that do not contribute much to the meaning of a sentence hence they can be removed. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text. Below is the list of stopwords available in NLTK

Python

nltk.download('punkt_tab')
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text

example_text = "This is a sample sentence and we are going to remove the stopwords from this."
remove_stopwords(example_text)

Output:

['This', 'sample', 'sentence', 'going', 'remove', 'stopwords', '.']

8. Applying Stemming

Stemming is the process of getting the root form of a word. Stem or root is the part to which affixes like -ed, -ize, -de, -s, etc are added. The stem of a word is created by removing the prefix or suffix of a word.

Example:

books ---> book
looked ---> look
denied ---> deni
flies ---> fli

There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them.

Python

stemmer = PorterStemmer()

def stem_words(text):
    word_tokens = word_tokenize(text)
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems

text = 'data science uses scientific methods algorithms and many types of processes'
stem_words(text)

Output:

['data',
'scienc',
'use',
'scientif',
'method',
'algorithm',
'and',
'mani',
'type',
'of',
'process']

9. Applying Lemmatization

Lemmatization is a NLP technique that reduces a word to its root form. This can be helpful for tasks such as text analysis and search as it allows you to compare words that are related but have different forms.

Python

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

def lemma_words(text):
    word_tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]
    return lemmas
  
input_str = "data science uses scientific methods algorithms and many types of processes"
lemma_words(input_str)

Output:

['data',
'science',
'us',
'scientific',
'method',
'algorithm',
'and',
'many',
'type',
'of',
'process']

In this guide we learned different NLP text preprocessing technique which can be used to make a NLP based application and project.

Must Read:

Natural Language Processing (NLP) Tutorial
Phases of Natural Language Processing (NLP)
POS(Parts-Of-Speech) Tagging in NLP

Text Preprocessing in Python

J

jacobperalta

Improve

Article Tags :

Practice Tags :

Similar Reads

Python Tutorial - Learn Python Programming Language

Python is one of the most popular programming languages. Itâ€™s simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo

Machine Learning Tutorial

Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Machin

Linear Regression in Machine learning

Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea

Python Exercise with Practice Questions and Solutions

Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test

Python Programs

Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co

Support Vector Machine (SVM) Algorithm

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or

Logistic Regression in Machine Learning

Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two po

100+ Machine Learning Projects with Source Code [2025]

This article provides over 100 Machine Learning projects and ideas to provide hands-on experience for both beginners and professionals. Whether you're a student enhancing your resume or a professional advancing your career these projects offer practical insights into the world of Machine Learning an

K means Clustering â€“ Introduction

K-Means Clustering is an Unsupervised Machine Learning algorithm which groups unlabeled dataset into different clusters. It is used to organize data into groups based on their similarity. Understanding K-means ClusteringFor example online store uses K-Means to group customers based on purchase frequ

K-Nearest Neighbor(KNN) Algorithm

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally used for classification but can also be used for regression tasks. It works by finding the "k" closest data points (neighbors) to a given input and makesa predictions based on the majority class (for classification) or th