1. PROCESSING TEXT IN NATURAL LANGUAGE
Natural language processing (NLP) is a machine learning
technology that gives computers the ability to interpret, manipulate,
and comprehend human language.
Organizations today have large volumes of voice and text data from
various communication channels like emails, text messages, social
media newsfeeds, video, audio, and more.
They use NLP software to automatically process this data, analyze the
intent or sentiment in the message, and respond in real time to
human communication.
2. Why is NLP important?
• Natural language processing (NLP) is critical to fully and efficiently analyze
text and speech data.
• It can work through the differences in dialects, slang, and grammatical
irregularities typical in day-to-day conversations.
Companies use it for several automated tasks, such as to:
• Process, analyze, and archive large documents
• Analyze customer feedback or call center recordings
• Run chatbots for automated customer service
• Answer who-what-when-where questions
• Classify and extract text
3. Natural language processing (NLP) is a field that focuses on making natural
human language usable by computer programs.
NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.
If you’re familiar with the basics of using Python and would like to get your feet
wet with some NLP, then you’ve come to the right place.
Find text to analyze
• Preprocess your text for analysis
• Analyze your text
• Create visualizations based on your analysis
4. Pre-processing Text
• Text pre-processing is a crucial step in performing sentiment analysis, as it
helps to clean and normalize the text data, making it easier to analyse.
• The pre-processing step involves a series of techniques that help transform raw
text data into a form you can use for analysis.
• Some common text pre-processing techniques include tokenization, stop word
removal, stemming, and lemmatization.
5. Working with NLTK
• To work with NLTK, you first need to install it using pip install nltk.
• Then, import the library and download necessary data like corpora and models using
nltk.download().
• NLTK provides tools for various NLP tasks such as tokenization, stemming, part-of-speech
tagging, and more.
Here's a breakdown of how to get started:
• 1. Installation:
• Open your terminal or command prompt and Run the command: pip install nltk.
• 2. Importing NLTK:
In your Python script, import the necessary module:
Import nltk
6. Downloading NLTK Data:
NLTK requires downloading specific corpora and models for various tasks.
You can download them using the following command:
nltk.download()
This will open a graphical interface where you can choose which data to download.
For beginners, downloading the book collection is a good starting point.
7. Basic NLTK Operations:
Tokenization: Breaking down text into smaller units (words, sentences
Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence.
This will allow you to work with smaller pieces of text that are still relatively
coherent and meaningful even outside of the context of the rest of the text.
It’s your first step in turning unstructured data into structured data, which is
easier to analyse.
When you’re analysing text, you’ll be tokenizing by word and tokenizing by
sentence. Here’s what both types of tokenization bring to the table:
8. Tokenizing by word: Words are like the atoms of natural language.They’re the
smallest unit of meaning that still makes sense on its own.
Tokenizing your text by word allows you to identify words that come up
particularly often.
For example, if you were analyzing a group of job ads, then you might find that the
word “Python” comes up often.
That could suggest high demand for Python knowledge, but you’d need to look
deeper to know more.
Tokenizing by sentence: When you tokenize by sentence, you can analyze how
those words relate to one another and see more context.
9. Example:
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
text="This is rama and krishna. Avanthi is a college for Professionals"
word=word_tokenize(text)
sentence=sent_tokenize(text)
print(word)
print(sentence)
10. Stemming : Reduce their works to their root form
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
text="This is ram and krishna. Avanthi is a college for Professionals "
words=word_tokenize(text)
sentences=sent_tokenize(text)
print(words)
print(sentences)
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
11. Example of POS Tagging
Consider the sentence: "The quick brown fox jumps over the lazy dog.“
After performing POS Tagging:
• "The" is tagged as determiner (DT)
• "quick" is tagged as adjective (JJ)
• "brown" is tagged as adjective (JJ)
• "fox" is tagged as noun (NN)
• "jumps" is tagged as verb (VBZ)
• "over" is tagged as preposition (IN)
• "the" is tagged as determiner (DT)
• "lazy" is tagged as adjective (JJ)
• "dog" is tagged as noun (NN)
12. Part-of-Speech (POS) Tagging: Identifying the grammatical role of each word.
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
text="This is rama and krishna. Avanthi is a college for professional"
words=word_tokenize(text)
sentences=sent_tokenize(text)
print(words)
print(sentences)
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
from nltk.tag import pos_tag
tagged_words = pos_tag(words)
print(tagged_words)
13. Stop Word Removal: Removing common words that don't carry much meaning (e.g., "the", "a", "is").
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
text="This is ram and krishna. Avanthi is a college"
words=word_tokenize(text)
sentences=sent_tokenize(text)
print(words)
print(sentences)
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
from nltk.tag import pos_tag
tagged_words = pos_tag(words)
print(tagged_words)
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if w.lower() not in stop_words]
print(filtered_words)
14. Accessing Corpora:
NLTK provides access to various corpora (text datasets).
from nltk.corpus import gutenberg
# List available corpora
print(gutenberg.fileids())
# Access a specific corpus (e.g., Shakespeare's Hamlet)
hamlet = gutenberg.words('shakespeare-hamlet.txt')
print(hamlet[:50]) # Print the first 50 words
15. Sentiment Analysis:
NLTK can be used for sentiment analysis.
You'll need to download the vader_lexicon corpus for this.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
text = "This movie was fantastic! I loved it."
scores = analyzer.polarity_scores(text)
print(scores)