SlideShare a Scribd company logo
Text mining: Introduction and data preparation
Overview of Text mining What is Text Mining? Text Mining, "also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text."
Need for Text mining: We can better understand the need for Text mining using a practical example. Ex: The Bio Tech Industry. -80% of  biological knowledge is only in research paper (unstructured data). - If  a scientist  manually  read 50 research paper/week and only 10% of those data are useful   then   he/she manages only 5 research paper/week.
Need for Text mining But online databases like Medline adds more than 10,000 abstracts per month using  Text mining   Thus the performance of gathering relevant data is increased dramatically when we use text mining .It shows the need for Text mining.
Challenges in Text Mining Information is in unstructured textual form Large textual data base almost all publications are also in electronic form Very high number of possible “dimensions” (but sparse): all possible word and phrase types in the language!! Complex and subtle relationships between concepts in text
Challenges in Text Mining “ AOL merges with Time-Warner” “Time-Warner is bought by AOL” Word ambiguity and context sensitivity automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit) Noisy data Example: Spelling mistakes
Text Mining Process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics
Text Mining Process Text/Data Mining Classification Clustering Associations Analyzing results
Applications The potential applications are countless. Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification Web search etc etc.
Tokenization Convert a sentence into a sequence of  tokens i.e  words. Why do we tokenize? Because we do not want to treat a sentence as a sequence of  characters Tokenizing general English sentences is relatively straightforward. Use spaces as the boundaries Use some heuristics to handle exceptions
Tokenisation issues  separate possessive endings or abbreviated forms from preceding words:  Mary’s    Mary ‘s Mary’s    Mary is Mary’s    Mary has separate punctuation marks and quotes from words  : Mary.    Mary  . “ new”    “  new  “
  Dictionary creation Dictionary is used to locate  occurrence of a particular term in the documents. It will reduce the retrivel time of an algorithm. They are stored as linked list
Example Brutus −-> 1 2 4 11 31 45 173 174 Caesar −-> 1 2 4 5 6 16 57 132 . . . Calpurnia −-> 2 31 54 101 In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.
Feature generation and selection Importance of feature selection Machine Learning It improve the efficiency in many machine learning. Over fitting problem Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. Improve Efficiency of training
Feature selection methods for classification Filter Method pre-process computation of score for each feature and then select feature according to the score Wrapper Method The wrapper utilize learning as a black box to score subset features Embedded Method Feature selection is perform within the process of training the algorithm
Parsing tasks Separate words from spaces and punctuation Clean up Remove redundant words Remove words with no content Cleaned up list of Words referred to as tokens
Simple Algorithm for parsing # Initialize, description-the entire text charcount<-nchar(Description) # number of records of text Line count<-length(Description) Num<-Line count*6 # Array to hold location of spaces Position<-rep(0,Num) dim(Position)<-c(Linecount,6)
Simple Algorithm for parsing # Array for Terms Terms<-rep(“”,Num) dim(Terms)<-c(Linecount,6) wordcount<-rep(0,Linecount)
Search for Spaces for (i in 1:Linecount) { n<-charcount[i] k<-1 for (j in 1:n) { Char<-substring(Description[i],j,j) if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} wordcount[i]<-k }}
Get Words # parse out terms for (i in 1:Linecount) { # first word if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)
Get Words for (j in 1:wordcount) { if (Position[i,j]>0) { Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) } } }
Conclusion In this presentation Overview of text mining Tokenization Dictionary creation Feature selection Parsing are studied in detail.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

PDF
SA2: Text Mining from User Generated Content
John Breslin
 
PPTX
Data Mining: Text and web mining
DataminingTools Inc
 
PDF
Text Mining Analytics 101
Manohar Swamynathan
 
PPTX
Introduction to Text Mining
Minha Hwang
 
PDF
Search explained T3DD15
Hans Höchtl
 
PPT
Text mining
Malik Imran
 
PPTX
Textmining Information Extraction
guest0edcaf
 
PPT
Boolean Retrieval
mghgk
 
SA2: Text Mining from User Generated Content
John Breslin
 
Data Mining: Text and web mining
DataminingTools Inc
 
Text Mining Analytics 101
Manohar Swamynathan
 
Introduction to Text Mining
Minha Hwang
 
Search explained T3DD15
Hans Höchtl
 
Text mining
Malik Imran
 
Textmining Information Extraction
guest0edcaf
 
Boolean Retrieval
mghgk
 

What's hot (19)

PPTX
Text Analytics for Dummies 2010
Seth Grimes
 
PDF
A Framework to Automatically Extract Funding Information from Text
Deep Kayal
 
PPTX
Ir 02
Mohammed Romi
 
PPTX
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics
 
PPTX
Ir 03
Mohammed Romi
 
PPT
Predictive Text Analytics
Seth Grimes
 
PPTX
Information Retrieval-1
Jeet Das
 
PPTX
Information Retrieval
ssbd6985
 
PPT
Cs583 info-retrieval
Borseshweta
 
PDF
Modern association rule mining methods
ijcsity
 
PDF
Information Extraction
Rubén Izquierdo Beviá
 
PPTX
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
PPTX
Text data mining1
KU Leuven
 
PPTX
Ir 09
Mohammed Romi
 
PPTX
Information retrieval 7 boolean model
Vaibhav Khanna
 
PPT
Email Data Cleaning
feiwin
 
PPTX
Text Analytics Presentation
Skylar Ritchie
 
PPTX
Ir 08
Mohammed Romi
 
PDF
International Journal of Engineering Research and Development
IJERD Editor
 
Text Analytics for Dummies 2010
Seth Grimes
 
A Framework to Automatically Extract Funding Information from Text
Deep Kayal
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics
 
Predictive Text Analytics
Seth Grimes
 
Information Retrieval-1
Jeet Das
 
Information Retrieval
ssbd6985
 
Cs583 info-retrieval
Borseshweta
 
Modern association rule mining methods
ijcsity
 
Information Extraction
Rubén Izquierdo Beviá
 
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
Text data mining1
KU Leuven
 
Information retrieval 7 boolean model
Vaibhav Khanna
 
Email Data Cleaning
feiwin
 
Text Analytics Presentation
Skylar Ritchie
 
International Journal of Engineering Research and Development
IJERD Editor
 
Ad

Viewers also liked (9)

DOCX
Thoai Hoa Khop La Gi
teodoro856
 
PDF
PRIMER CICLO- ACTIVIDAD 1. FICCIÓN-INFORMACIÓN.
Maria0842
 
PPS
FHTM Overview
FHTM - TedderInc
 
PDF
Commonyouthfocusgroupgbarriersv1.2
Rex Villavelez
 
KEY
Natural Language Processing
guestf72905
 
PPTX
Genocide Ben F. Issues in Africa
The Unquiet Library: Student Work
 
PPTX
K-POS GOESTING
Dirk Sterkendries
 
PDF
De Europese digitale interne markt: hope or hype
Ricoh Nederland
 
PDF
Nashville Leadership Breakfast
Michael Burcham
 
Thoai Hoa Khop La Gi
teodoro856
 
PRIMER CICLO- ACTIVIDAD 1. FICCIÓN-INFORMACIÓN.
Maria0842
 
FHTM Overview
FHTM - TedderInc
 
Commonyouthfocusgroupgbarriersv1.2
Rex Villavelez
 
Natural Language Processing
guestf72905
 
Genocide Ben F. Issues in Africa
The Unquiet Library: Student Work
 
K-POS GOESTING
Dirk Sterkendries
 
De Europese digitale interne markt: hope or hype
Ricoh Nederland
 
Nashville Leadership Breakfast
Michael Burcham
 
Ad

Similar to Textmining Introduction (20)

PPTX
data science and analytics in computer science
uthradevia5
 
PDF
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
PPT
Structured Document Search and Retrieval
Optum
 
ODP
The search engine index
CJ Jenkins
 
PDF
05_nlp_Vectorization_ML_model_in_text_analysis.pdf
ReemaAsker1
 
PPTX
01 IRS to upload the data according to the.pptx
tiggu56
 
PPTX
01 IRS-1 (1) document upload the link to
tiggu56
 
PPTX
Text Analytics
Ajay Ram
 
PDF
Data Science - Part XI - Text Analytics
Derek Kane
 
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
PDF
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
PPT
Tovek Presentation by Livio Costantini
maxfalc
 
PDF
Topic detecton by clustering and text mining
IRJET Journal
 
PDF
IRJET - BOT Virtual Guide
IRJET Journal
 
PPT
Advanced full text searching techniques using Lucene
Asad Abbas
 
PDF
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
 
PDF
E017252831
IOSR Journals
 
PDF
Extraction of Data Using Comparable Entity Mining
iosrjce
 
PDF
Elasticsearch and Spark
Audible, Inc.
 
PDF
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
data science and analytics in computer science
uthradevia5
 
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
Structured Document Search and Retrieval
Optum
 
The search engine index
CJ Jenkins
 
05_nlp_Vectorization_ML_model_in_text_analysis.pdf
ReemaAsker1
 
01 IRS to upload the data according to the.pptx
tiggu56
 
01 IRS-1 (1) document upload the link to
tiggu56
 
Text Analytics
Ajay Ram
 
Data Science - Part XI - Text Analytics
Derek Kane
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
Tovek Presentation by Livio Costantini
maxfalc
 
Topic detecton by clustering and text mining
IRJET Journal
 
IRJET - BOT Virtual Guide
IRJET Journal
 
Advanced full text searching techniques using Lucene
Asad Abbas
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
 
E017252831
IOSR Journals
 
Extraction of Data Using Comparable Entity Mining
iosrjce
 
Elasticsearch and Spark
Audible, Inc.
 
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 

Recently uploaded (20)

PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Software Development Methodologies in 2025
KodekX
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 

Textmining Introduction

  • 1. Text mining: Introduction and data preparation
  • 2. Overview of Text mining What is Text Mining? Text Mining, &quot;also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.&quot;
  • 3. Need for Text mining: We can better understand the need for Text mining using a practical example. Ex: The Bio Tech Industry. -80% of biological knowledge is only in research paper (unstructured data). - If a scientist manually read 50 research paper/week and only 10% of those data are useful then he/she manages only 5 research paper/week.
  • 4. Need for Text mining But online databases like Medline adds more than 10,000 abstracts per month using Text mining   Thus the performance of gathering relevant data is increased dramatically when we use text mining .It shows the need for Text mining.
  • 5. Challenges in Text Mining Information is in unstructured textual form Large textual data base almost all publications are also in electronic form Very high number of possible “dimensions” (but sparse): all possible word and phrase types in the language!! Complex and subtle relationships between concepts in text
  • 6. Challenges in Text Mining “ AOL merges with Time-Warner” “Time-Warner is bought by AOL” Word ambiguity and context sensitivity automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit) Noisy data Example: Spelling mistakes
  • 7. Text Mining Process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics
  • 8. Text Mining Process Text/Data Mining Classification Clustering Associations Analyzing results
  • 9. Applications The potential applications are countless. Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification Web search etc etc.
  • 10. Tokenization Convert a sentence into a sequence of tokens i.e words. Why do we tokenize? Because we do not want to treat a sentence as a sequence of characters Tokenizing general English sentences is relatively straightforward. Use spaces as the boundaries Use some heuristics to handle exceptions
  • 11. Tokenisation issues separate possessive endings or abbreviated forms from preceding words: Mary’s  Mary ‘s Mary’s  Mary is Mary’s  Mary has separate punctuation marks and quotes from words : Mary.  Mary . “ new”  “ new “
  • 12. Dictionary creation Dictionary is used to locate occurrence of a particular term in the documents. It will reduce the retrivel time of an algorithm. They are stored as linked list
  • 13. Example Brutus −-> 1 2 4 11 31 45 173 174 Caesar −-> 1 2 4 5 6 16 57 132 . . . Calpurnia −-> 2 31 54 101 In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.
  • 14. Feature generation and selection Importance of feature selection Machine Learning It improve the efficiency in many machine learning. Over fitting problem Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. Improve Efficiency of training
  • 15. Feature selection methods for classification Filter Method pre-process computation of score for each feature and then select feature according to the score Wrapper Method The wrapper utilize learning as a black box to score subset features Embedded Method Feature selection is perform within the process of training the algorithm
  • 16. Parsing tasks Separate words from spaces and punctuation Clean up Remove redundant words Remove words with no content Cleaned up list of Words referred to as tokens
  • 17. Simple Algorithm for parsing # Initialize, description-the entire text charcount<-nchar(Description) # number of records of text Line count<-length(Description) Num<-Line count*6 # Array to hold location of spaces Position<-rep(0,Num) dim(Position)<-c(Linecount,6)
  • 18. Simple Algorithm for parsing # Array for Terms Terms<-rep(“”,Num) dim(Terms)<-c(Linecount,6) wordcount<-rep(0,Linecount)
  • 19. Search for Spaces for (i in 1:Linecount) { n<-charcount[i] k<-1 for (j in 1:n) { Char<-substring(Description[i],j,j) if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} wordcount[i]<-k }}
  • 20. Get Words # parse out terms for (i in 1:Linecount) { # first word if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)
  • 21. Get Words for (j in 1:wordcount) { if (Position[i,j]>0) { Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) } } }
  • 22. Conclusion In this presentation Overview of text mining Tokenization Dictionary creation Feature selection Parsing are studied in detail.
  • 23. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net