SlideShare a Scribd company logo
Unstructured Data Analytics
 Text Mining
 Web Mining
 Big Data?
 Use of Software packages
Evaluation Criteria:
 Student Presentations : 30%
 Two presentations
 Midterm – in class exam: 30%
 Project: 40%
 10% for dataset
 10% for presentation
Text Mining
 Discovery of new, previously unknown information
by extracting information from different written
sources
 By computer?
What is Text-Mining?
 Finding interesting regularities in large textual
datasets…” (Usama Fayad, adapted)
 where interesting means: non-trivial, hidden, previously
unknown and potentially useful
 Finding semantic and abstract information from
the surface form of textual data…”
What is Text Data Mining?
 The metaphor of extracting ore from rock:
 Does make sense for extracting documents of interest
from a huge pile.
 But does not reflect notions of DM in practice:
 finding patterns across large collections
 discovering heretofore unknown information
Real Text DM
 The point:
 Discovering heretofore unknown information is not what
we usually do with text.
 (If it weren’t known, it could not have been written by
someone!)
 However:
 There is a field whose goal is to learn about patterns in
text for their own sake ...
Computational Linguistics (CL)!
 Goal: automated language understanding
 this isn’t possible
 instead, go for subgoals, e.g.,
 word sense disambiguation
 phrase recognition
 semantic associations
 Common current approach:
 statistical analyses over very large text
collections
Why CL Isn’t TDM
 A linguist finds it interesting that “cloying” co-
occurs significantly with “Jar Jar Binks” ...
 … But this doesn’t really answer a
question relevant to the world outside
the text itself.
Why CL Isn’t TDM
 We need to use the text indirectly to answer
questions about the world
 Direct:
 Analyze patent text; determine which word patterns indicate
various subject categories.
 Indirect:
 Analyze patent text; find out whether private or public
funding leads to more inventions.
Why CL Isn’t TDM
 Direct:
 Cluster newswire text; determine which terms are
predominant
 Indirect:
 Analyze newswire text; gather evidence about which
countries/alliances are dominating which financial sectors
Nuggets vs. Patterns
 TDM: we want to discover new information …
 … As opposed to discovering which statistical patterns
characterize occurrence of known information.
 Example: WSD (Word Sense Disambiguation)
 not TDM: computing statistics over a corpus to determine
what patterns characterize Sense S.
 TDM: discovering the meaning of a new sense of a word.
Nuggets vs. Patterns
 Nugget: a new, heretofore unknown item
of information.
 Pattern: distributions or rules that
characterize the occurrence (or non-
occurrence) of a known item of
information.
 Application of rules can create nuggets in
some circumstances.
Text Mining
 Large amount of information
 Document databases
 Research papers
 News articles
 Books
 Email message
 Blogs
Text Mining is Difficult!
 Abstract concepts are difficult to represent
 “Countless” combinations of subtle,
abstract relationships among concepts
 Many ways to represent similar concepts
 E.g. space ship, flying saucer, UFO
 Concepts are difficult to visualize
 High dimensionality
 Tens or hundreds of thousands of features
Why Text Mining is NOT Difficult
 Highly Redundant data
 most of the data mining methods count
on this property
 Just about any simple algorithm can
get “good” results for simple tasks:
 Pull out “important” phrases
 Find “meaningfully” related words
 Create some sort of summary from
documents
Data Mining and Text Mining
 Data Mining extracts knowledge from structured
data
 Credit card, Insurance records, Call records etc.
 Text Mining works with unstructured, text
documents
 Written language is not structured
Applications
 Information Retrieval
 Search and Database operations
 Semantic Web
 Knowledge Representation and
Reasoning
 Natural Language Processing
 Computational Linguistics
 Machine Learning and Text Mining
 Data Analysis
Challenges
 Humans cannot sift through ever increasing amount of
textual documents
 Written language is easily understood by humans (?)
 My feet are killing me; My head is on fire
 This box weighs a ton;
 You have been so so incredibly helpful and thanks (for nothing)
 Extremely difficult to make a computer understand this
language
Machine Translation
 "the spirit is willing but the flesh is weak." Translated into Russian and
Translated back to English:
"The vodka is good, but the meat is rotten“
 "out of sight, out of mind" translating back as "invisible insanity”
 A Chinese translation of The Grapes of Wrath was called "The Angry Raisins."
Information Retrieval
 Get me the average Compensation of all the
employees stationed in Pune
 Get me all the documents that contain the word
“Education” and “Compensation” published in the
last 4 years
 Find the association patterns between “Education”
and “Compensation”
 First is SQL; Second is IR, third is TM
Levels of Text Processing
 Word Level
 Sentence Level
 Document Level
 Document-Collection Level
 Linked-Document-Collection Level
 Application Level
Text Mining Tasks
 Information Extraction
 Topic Tracking
 Summarization
 Classification
 Clustering
 Association
Evolution of Text Mining
 In general, two processes are used to access
information from a series of documents
 Information Retrieval
 Retrieves a set of documents as an answer to a logical query
using a set of key words
 Information Extraction
 Aims to EXTRACT specific information from documents which
is further analyzed for trends
Approaches to Access Textual Information
 Library science
 Information science
 Natural language processing
Library Book Summarization and Classification
 Earliest library catalog by Thomas Hyde (Bodelian Library at
Oxford)
 Creation of science abstracts using IBM 701 (Luhn, 1958)
 A word frequency analysis was performed
 A number of relatively significant words were counted for each sentence
 Linear distances between significant words was calculated as a metric of
sentence significance
 The most significant sentences were extracted to form the abstract
 Doyle (1961) suggested classification based on word
frequencies and associations!
Information Science
 Bibliometrics developed to provide numerical means
to study and measure texts and information – Ex:
Citation Index
 Groups of important articles can be used to track the
development of pathways of a particular development
 Produced various indexing systems, document
storage and manipulating applications and search
systems
Text mining   introduction-1
Natural Language Processing
 Stages in NLP
 Document
 Tokenization
 Lexical analysis
 Semantic analysis
 Information extraction
 Author’s intended meaning
Role of Message Understanding Conferences
 Initiated by Naval Ocean Systems Center (DARPA)
 Initially to analyze military messages
 Defined “Named Entity Recognition”
 Formalized test metrics – “Recall and Precision”
 Identified importance of robustness in Machine
Learning models (ML Models)
 Need for minimizing over-training of ML models
 Importance of word-sense disambiguation
Some Definitions
 Document
 A unit of discrete textual data within a collection that
usually, but not necessarily, correlates with some
real-world document such as
 a business report, legal memorandum, e-mail, research
paper, manuscript, article, press release, or news
story
Some Definitions
 Document Collection
 Any grouping of text-based documents.
 The number of documents in such collections can range
from the many thousands to the tens of millions
 can be either static, (the initial complement of
documents remains unchanged), or
 dynamic (characterized by their inclusion of new or
updated documents over time)
Weakly-structured documents
 A document is actually a “Structured Object”!
 Has a rich amount of semantic and syntactical structure
 Contains typographical elements such as punctuation
marks, capitalization, numbers, and special characters such
as white spacing, carriage returns, underlining, asterisks,
tables, columns etc.
 Help identify important document subcomponents such as
paragraphs, titles, publication dates, author names, table
records, headers, and footnotes
Semi-Structured documents
 Documents with extensive and consistent format
elements in which field-type metadata can be more
easily inferred – such as
 some e-mail, HTML Web pages, PDF files, and word-
processing files with document templating or style-sheet
constraints
 Some journal articles/ conference proceedings
What is Unique to Text Data?
 Large number of zero attributes
 Each word is an attribute. Each document has only a few words.
Phenomenon called high-dimensional sparsity.
 There may also be a wide variation in the number of nonzero values across
different documents.
 Distance computation.
 For example, while it is possible, in theory, to use the Euclidean function for
measuring distances, the results are usually not very effective from a practical
perspective. This is because Euclidean distances are extremely sensitive to
the varying document lengths (the number of nonzero attributes).
What is Unique to Text Data?
 Non-negativity:
 The frequencies of words take on nonnegative values only.
Combined with high-dimensional sparsity, it enables the use
of specialized methods for document analysis.
 The presence of a word in a document is statistically more
significant than its absence. Unlike traditional
multidimensional techniques, incorporating the global
statistical characteristics of the data set in pairwise distance
computation is crucial.
What is Unique to Text Data?
 Side information:
 In some domains, such as the Web, additional side
information is available.
 Examples include hyperlinks or other metadata associated
with the document.
 These additional attributes can be leveraged to enhance the
text mining process further.
Document Features
 Basic idea is to transform a document from an irregular
and implicitly structured representation into an explicitly
structured representation.
 Identification of a simplified subset of document features
that can be used to represent a particular document as a
whole
 It is called the representational model of a document
and individual documents are represented by the set of
features that their representational models contain
Commonly Used Document Features
 Characters,
 Words,
 Terms,
 Concepts
Leads to
 Categories and
 Sentiments
Characters
 The individual component-level letters, numerals, special
characters and spaces are the building blocks of higher-level
semantic features such as words, terms, and concepts.
 A character-level representation can include the full set of all
characters for a document or some filtered subset.
 Character-based representations that include some level of
positional information are more useful and common.
 character-based representations can be unwieldy for some
types of text processing techniques
Words
 Specific words from a “native” document present the basic
level of semantics and referred to as existing in the native
feature space of a document
 A single word-level feature is one linguistic token. Phrases,
multiword expressions, multiword hyphenates do not
constitute single word-level features.
 It is necessary to filter these features for items such as stop
words, symbolic characters, and meaningless numerics.
Terms
 Single and multiword phrases selected from the corpus
of a native document.
 These can only be made up of specific words and
expressions found within the native document
 Hence, a term-based representation is composed of a
subset of the terms in that document.
“President Kalam moved from a hut in the fishing village
to the Rashtrapathi Bhavan”
Concepts
 Features generated for a document by means of manual, statistical,
rule-based, or hybrid methodologies.
 Extracted from documents using complex preprocessing routines that
identify syntactical units that are related to specific concept identifiers.
 A document collection of reviews of sports cars may not actually
include the specific word “automotive” or the specific phrase “test
drives,” but the concepts “automotive” and “test drives” might be the
set of concepts used to identify and represent the collection.
Concepts
 Many methodologies involve a degree of cross-
referencing against an external knowledge source.
 For manual and rule-based categorization methods, the
cross-referencing and validation involves interaction
with a preexisting domain ontology, lexicon, or formal
concept hierarchy
 May use the mind of a human domain expert.
Efficient and Effective Representation
 Terms and concepts reflect the features with the most
condensed and expressive levels of semantic value.
 They represent the same efficiency as character- or word-
based models but more efficient
 Term-level representations can sometimes be easily and
automatically generated from the original source text
 Concept-level representations, as a practical matter, often
involve some level of human interaction.
Words-Properties
 Homonomy: same form, but different meaning (e.g. bank: river
bank, financial institution)
 Polysemy: same form, Related meaning (e.g. bank: blood bank,
financial institution; FIX) http://www.wordnet-online.com/fix.shtml
 usually regarded as distinct from homonymy, in which the
multiple meanings of a word may be unconnected or unrelated
 Synonymy: different form, same meaning (e.g. singer, vocalist)
 Hyponymy: one word denotes a subclass of an another (e.g.
breakfast, meal)
 Hypernym: also known as a superordinate, is broader than that of a
hyponym. A hypernym consists of hyponyms
Concepts
 Concept-level representations are better at handling
synonymy and polysemy. Also, best at relating a given
feature to its various hyponyms and hypernyms.
 Can be processed to support very sophisticated
concept hierarchies.
 Best for leveraging the domain knowledge afforded by
ontologies and knowledge bases.
Concepts
 Require complex heuristics, during preprocessing
operations required to extract and validate concept-type
features
 They are often domain dependent
 manually generated concepts are fixed and labor
intensive to assign
Domain Knowledge
 Concepts belong to the descriptive attributes of a document as well
as to domains.
 A domain is a specialized area of interest with dedicated
ontologies, lexicons, and taxonomies of information
 Domains can be very broad areas of subject matter (Financial
Services) or narrowly defined specialisms (securities trading,
derivatives, MFs etc.)
 Domain knowledge is more important in Text mining as compared
to data mining
Text Mining Functional Architecture
System architecture for generic text mining system
 Questions?
Components and Tasks of IE engines
Zoning
POS Tagging
Sense Disambiguation
Shallow Parsing
Deep Parsing
Anaphora Resolution
Integration
Tokenization
Morphological and
Lexical Analysis
Semantic Analysis
Domain Analysis
Components Tasks

More Related Content

What's hot (20)

PPT
Dimensions of Media Object Comprehensibility
Lawrie Hunter
 
PPTX
Information Extraction
Ignacio Delgado
 
PPT
Role of Text Mining in Search Engine
Jay R Modi
 
PDF
An analysis on Filter for Spam Mail
AM Publications
 
PDF
Lecture: Semantic Word Clouds
Marina Santini
 
PPS
E-text in EFL - Four flavours
Przemyslaw Kaszubski
 
PDF
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
ijaia
 
PDF
A Simple Information Retrieval Technique
idescitation
 
PPT
PPT slides
butest
 
PDF
Introduction to Natural Language Processing
dhruv_chaudhari
 
PPT
Text mining, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
PPTX
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
Marina Santini
 
PDF
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
IJCSIS Research Publications
 
PDF
Adaptive information extraction
unyil96
 
PDF
September 2021: Top10 Cited Articles in Natural Language Computing
kevig
 
PDF
Information Extraction
Rubén Izquierdo Beviá
 
PPTX
Use of ontologies in natural language processing
ATHMAN HAJ-HAMOU
 
PDF
Dictionary based concept mining an application for turkish
csandit
 
PPT
The role of linguistic information for shallow language processing
Constantin Orasan
 
Dimensions of Media Object Comprehensibility
Lawrie Hunter
 
Information Extraction
Ignacio Delgado
 
Role of Text Mining in Search Engine
Jay R Modi
 
An analysis on Filter for Spam Mail
AM Publications
 
Lecture: Semantic Word Clouds
Marina Santini
 
E-text in EFL - Four flavours
Przemyslaw Kaszubski
 
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
ijaia
 
A Simple Information Retrieval Technique
idescitation
 
PPT slides
butest
 
Introduction to Natural Language Processing
dhruv_chaudhari
 
Text mining, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
Marina Santini
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
IJCSIS Research Publications
 
Adaptive information extraction
unyil96
 
September 2021: Top10 Cited Articles in Natural Language Computing
kevig
 
Information Extraction
Rubén Izquierdo Beviá
 
Use of ontologies in natural language processing
ATHMAN HAJ-HAMOU
 
Dictionary based concept mining an application for turkish
csandit
 
The role of linguistic information for shallow language processing
Constantin Orasan
 

Similar to Text mining introduction-1 (20)

PPTX
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
PPTX
Text data mining1
KU Leuven
 
PPT
Week12
Esha Meher
 
PPTX
Text mining
ThejeswiniChivukula
 
PDF
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 
PPTX
Introduction to Text Mining
Minha Hwang
 
PPTX
Text mining
Pankaj Thakur
 
PDF
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
PPT
Web & text mining lecture10
Houw Liong The
 
PDF
Text Mining Analytics 101
Manohar Swamynathan
 
PDF
Capitalizing on Machine Reading to Engage Bigger Data
Shalin Hai-Jew
 
PPT
Text Analytics for Semantic Computing
Meena Nagarajan
 
PDF
A systematic study of text mining techniques
ijnlc
 
PDF
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
PDF
Data Science - Part XI - Text Analytics
Derek Kane
 
PPTX
Text mining
Ali A Jalil
 
PPT
Irmac presentation for website
Frank Barnes
 
PPT
turban_ch07ch07ch07ch07ch07ch07dss9e_ch07.ppt
DEEPAK948083
 
PPT
4499994.ppt
BNCProductions
 
PPTX
Text Mining
Biniam Asnake
 
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
Text data mining1
KU Leuven
 
Week12
Esha Meher
 
Text mining
ThejeswiniChivukula
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 
Introduction to Text Mining
Minha Hwang
 
Text mining
Pankaj Thakur
 
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
Web & text mining lecture10
Houw Liong The
 
Text Mining Analytics 101
Manohar Swamynathan
 
Capitalizing on Machine Reading to Engage Bigger Data
Shalin Hai-Jew
 
Text Analytics for Semantic Computing
Meena Nagarajan
 
A systematic study of text mining techniques
ijnlc
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
Data Science - Part XI - Text Analytics
Derek Kane
 
Text mining
Ali A Jalil
 
Irmac presentation for website
Frank Barnes
 
turban_ch07ch07ch07ch07ch07ch07dss9e_ch07.ppt
DEEPAK948083
 
4499994.ppt
BNCProductions
 
Text Mining
Biniam Asnake
 
Ad

More from Sumit Sony (7)

PPT
Web structure mining
Sumit Sony
 
PPT
Web mining
Sumit Sony
 
PPTX
Web content mining
Sumit Sony
 
PPTX
Sentiment analysis and opinion mining
Sumit Sony
 
PPTX
Basic techniques in nlp
Sumit Sony
 
PPTX
Web usage mining
Sumit Sony
 
PPTX
Deep learning
Sumit Sony
 
Web structure mining
Sumit Sony
 
Web mining
Sumit Sony
 
Web content mining
Sumit Sony
 
Sentiment analysis and opinion mining
Sumit Sony
 
Basic techniques in nlp
Sumit Sony
 
Web usage mining
Sumit Sony
 
Deep learning
Sumit Sony
 
Ad

Recently uploaded (20)

PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPT
Performance Review for Security and Commodity.ppt
chatwithnitin
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
deep dive data management sharepoint apps.ppt
novaprofk
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Performance Review for Security and Commodity.ppt
chatwithnitin
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 

Text mining introduction-1

  • 2.  Text Mining  Web Mining  Big Data?  Use of Software packages
  • 3. Evaluation Criteria:  Student Presentations : 30%  Two presentations  Midterm – in class exam: 30%  Project: 40%  10% for dataset  10% for presentation
  • 4. Text Mining  Discovery of new, previously unknown information by extracting information from different written sources  By computer?
  • 5. What is Text-Mining?  Finding interesting regularities in large textual datasets…” (Usama Fayad, adapted)  where interesting means: non-trivial, hidden, previously unknown and potentially useful  Finding semantic and abstract information from the surface form of textual data…”
  • 6. What is Text Data Mining?  The metaphor of extracting ore from rock:  Does make sense for extracting documents of interest from a huge pile.  But does not reflect notions of DM in practice:  finding patterns across large collections  discovering heretofore unknown information
  • 7. Real Text DM  The point:  Discovering heretofore unknown information is not what we usually do with text.  (If it weren’t known, it could not have been written by someone!)  However:  There is a field whose goal is to learn about patterns in text for their own sake ...
  • 8. Computational Linguistics (CL)!  Goal: automated language understanding  this isn’t possible  instead, go for subgoals, e.g.,  word sense disambiguation  phrase recognition  semantic associations  Common current approach:  statistical analyses over very large text collections
  • 9. Why CL Isn’t TDM  A linguist finds it interesting that “cloying” co- occurs significantly with “Jar Jar Binks” ...  … But this doesn’t really answer a question relevant to the world outside the text itself.
  • 10. Why CL Isn’t TDM  We need to use the text indirectly to answer questions about the world  Direct:  Analyze patent text; determine which word patterns indicate various subject categories.  Indirect:  Analyze patent text; find out whether private or public funding leads to more inventions.
  • 11. Why CL Isn’t TDM  Direct:  Cluster newswire text; determine which terms are predominant  Indirect:  Analyze newswire text; gather evidence about which countries/alliances are dominating which financial sectors
  • 12. Nuggets vs. Patterns  TDM: we want to discover new information …  … As opposed to discovering which statistical patterns characterize occurrence of known information.  Example: WSD (Word Sense Disambiguation)  not TDM: computing statistics over a corpus to determine what patterns characterize Sense S.  TDM: discovering the meaning of a new sense of a word.
  • 13. Nuggets vs. Patterns  Nugget: a new, heretofore unknown item of information.  Pattern: distributions or rules that characterize the occurrence (or non- occurrence) of a known item of information.  Application of rules can create nuggets in some circumstances.
  • 14. Text Mining  Large amount of information  Document databases  Research papers  News articles  Books  Email message  Blogs
  • 15. Text Mining is Difficult!  Abstract concepts are difficult to represent  “Countless” combinations of subtle, abstract relationships among concepts  Many ways to represent similar concepts  E.g. space ship, flying saucer, UFO  Concepts are difficult to visualize  High dimensionality  Tens or hundreds of thousands of features
  • 16. Why Text Mining is NOT Difficult  Highly Redundant data  most of the data mining methods count on this property  Just about any simple algorithm can get “good” results for simple tasks:  Pull out “important” phrases  Find “meaningfully” related words  Create some sort of summary from documents
  • 17. Data Mining and Text Mining  Data Mining extracts knowledge from structured data  Credit card, Insurance records, Call records etc.  Text Mining works with unstructured, text documents  Written language is not structured
  • 18. Applications  Information Retrieval  Search and Database operations  Semantic Web  Knowledge Representation and Reasoning  Natural Language Processing  Computational Linguistics  Machine Learning and Text Mining  Data Analysis
  • 19. Challenges  Humans cannot sift through ever increasing amount of textual documents  Written language is easily understood by humans (?)  My feet are killing me; My head is on fire  This box weighs a ton;  You have been so so incredibly helpful and thanks (for nothing)  Extremely difficult to make a computer understand this language
  • 20. Machine Translation  "the spirit is willing but the flesh is weak." Translated into Russian and Translated back to English: "The vodka is good, but the meat is rotten“  "out of sight, out of mind" translating back as "invisible insanity”  A Chinese translation of The Grapes of Wrath was called "The Angry Raisins."
  • 21. Information Retrieval  Get me the average Compensation of all the employees stationed in Pune  Get me all the documents that contain the word “Education” and “Compensation” published in the last 4 years  Find the association patterns between “Education” and “Compensation”  First is SQL; Second is IR, third is TM
  • 22. Levels of Text Processing  Word Level  Sentence Level  Document Level  Document-Collection Level  Linked-Document-Collection Level  Application Level
  • 23. Text Mining Tasks  Information Extraction  Topic Tracking  Summarization  Classification  Clustering  Association
  • 24. Evolution of Text Mining  In general, two processes are used to access information from a series of documents  Information Retrieval  Retrieves a set of documents as an answer to a logical query using a set of key words  Information Extraction  Aims to EXTRACT specific information from documents which is further analyzed for trends
  • 25. Approaches to Access Textual Information  Library science  Information science  Natural language processing
  • 26. Library Book Summarization and Classification  Earliest library catalog by Thomas Hyde (Bodelian Library at Oxford)  Creation of science abstracts using IBM 701 (Luhn, 1958)  A word frequency analysis was performed  A number of relatively significant words were counted for each sentence  Linear distances between significant words was calculated as a metric of sentence significance  The most significant sentences were extracted to form the abstract  Doyle (1961) suggested classification based on word frequencies and associations!
  • 27. Information Science  Bibliometrics developed to provide numerical means to study and measure texts and information – Ex: Citation Index  Groups of important articles can be used to track the development of pathways of a particular development  Produced various indexing systems, document storage and manipulating applications and search systems
  • 29. Natural Language Processing  Stages in NLP  Document  Tokenization  Lexical analysis  Semantic analysis  Information extraction  Author’s intended meaning
  • 30. Role of Message Understanding Conferences  Initiated by Naval Ocean Systems Center (DARPA)  Initially to analyze military messages  Defined “Named Entity Recognition”  Formalized test metrics – “Recall and Precision”  Identified importance of robustness in Machine Learning models (ML Models)  Need for minimizing over-training of ML models  Importance of word-sense disambiguation
  • 31. Some Definitions  Document  A unit of discrete textual data within a collection that usually, but not necessarily, correlates with some real-world document such as  a business report, legal memorandum, e-mail, research paper, manuscript, article, press release, or news story
  • 32. Some Definitions  Document Collection  Any grouping of text-based documents.  The number of documents in such collections can range from the many thousands to the tens of millions  can be either static, (the initial complement of documents remains unchanged), or  dynamic (characterized by their inclusion of new or updated documents over time)
  • 33. Weakly-structured documents  A document is actually a “Structured Object”!  Has a rich amount of semantic and syntactical structure  Contains typographical elements such as punctuation marks, capitalization, numbers, and special characters such as white spacing, carriage returns, underlining, asterisks, tables, columns etc.  Help identify important document subcomponents such as paragraphs, titles, publication dates, author names, table records, headers, and footnotes
  • 34. Semi-Structured documents  Documents with extensive and consistent format elements in which field-type metadata can be more easily inferred – such as  some e-mail, HTML Web pages, PDF files, and word- processing files with document templating or style-sheet constraints  Some journal articles/ conference proceedings
  • 35. What is Unique to Text Data?  Large number of zero attributes  Each word is an attribute. Each document has only a few words. Phenomenon called high-dimensional sparsity.  There may also be a wide variation in the number of nonzero values across different documents.  Distance computation.  For example, while it is possible, in theory, to use the Euclidean function for measuring distances, the results are usually not very effective from a practical perspective. This is because Euclidean distances are extremely sensitive to the varying document lengths (the number of nonzero attributes).
  • 36. What is Unique to Text Data?  Non-negativity:  The frequencies of words take on nonnegative values only. Combined with high-dimensional sparsity, it enables the use of specialized methods for document analysis.  The presence of a word in a document is statistically more significant than its absence. Unlike traditional multidimensional techniques, incorporating the global statistical characteristics of the data set in pairwise distance computation is crucial.
  • 37. What is Unique to Text Data?  Side information:  In some domains, such as the Web, additional side information is available.  Examples include hyperlinks or other metadata associated with the document.  These additional attributes can be leveraged to enhance the text mining process further.
  • 38. Document Features  Basic idea is to transform a document from an irregular and implicitly structured representation into an explicitly structured representation.  Identification of a simplified subset of document features that can be used to represent a particular document as a whole  It is called the representational model of a document and individual documents are represented by the set of features that their representational models contain
  • 39. Commonly Used Document Features  Characters,  Words,  Terms,  Concepts Leads to  Categories and  Sentiments
  • 40. Characters  The individual component-level letters, numerals, special characters and spaces are the building blocks of higher-level semantic features such as words, terms, and concepts.  A character-level representation can include the full set of all characters for a document or some filtered subset.  Character-based representations that include some level of positional information are more useful and common.  character-based representations can be unwieldy for some types of text processing techniques
  • 41. Words  Specific words from a “native” document present the basic level of semantics and referred to as existing in the native feature space of a document  A single word-level feature is one linguistic token. Phrases, multiword expressions, multiword hyphenates do not constitute single word-level features.  It is necessary to filter these features for items such as stop words, symbolic characters, and meaningless numerics.
  • 42. Terms  Single and multiword phrases selected from the corpus of a native document.  These can only be made up of specific words and expressions found within the native document  Hence, a term-based representation is composed of a subset of the terms in that document. “President Kalam moved from a hut in the fishing village to the Rashtrapathi Bhavan”
  • 43. Concepts  Features generated for a document by means of manual, statistical, rule-based, or hybrid methodologies.  Extracted from documents using complex preprocessing routines that identify syntactical units that are related to specific concept identifiers.  A document collection of reviews of sports cars may not actually include the specific word “automotive” or the specific phrase “test drives,” but the concepts “automotive” and “test drives” might be the set of concepts used to identify and represent the collection.
  • 44. Concepts  Many methodologies involve a degree of cross- referencing against an external knowledge source.  For manual and rule-based categorization methods, the cross-referencing and validation involves interaction with a preexisting domain ontology, lexicon, or formal concept hierarchy  May use the mind of a human domain expert.
  • 45. Efficient and Effective Representation  Terms and concepts reflect the features with the most condensed and expressive levels of semantic value.  They represent the same efficiency as character- or word- based models but more efficient  Term-level representations can sometimes be easily and automatically generated from the original source text  Concept-level representations, as a practical matter, often involve some level of human interaction.
  • 46. Words-Properties  Homonomy: same form, but different meaning (e.g. bank: river bank, financial institution)  Polysemy: same form, Related meaning (e.g. bank: blood bank, financial institution; FIX) http://www.wordnet-online.com/fix.shtml  usually regarded as distinct from homonymy, in which the multiple meanings of a word may be unconnected or unrelated  Synonymy: different form, same meaning (e.g. singer, vocalist)  Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal)  Hypernym: also known as a superordinate, is broader than that of a hyponym. A hypernym consists of hyponyms
  • 47. Concepts  Concept-level representations are better at handling synonymy and polysemy. Also, best at relating a given feature to its various hyponyms and hypernyms.  Can be processed to support very sophisticated concept hierarchies.  Best for leveraging the domain knowledge afforded by ontologies and knowledge bases.
  • 48. Concepts  Require complex heuristics, during preprocessing operations required to extract and validate concept-type features  They are often domain dependent  manually generated concepts are fixed and labor intensive to assign
  • 49. Domain Knowledge  Concepts belong to the descriptive attributes of a document as well as to domains.  A domain is a specialized area of interest with dedicated ontologies, lexicons, and taxonomies of information  Domains can be very broad areas of subject matter (Financial Services) or narrowly defined specialisms (securities trading, derivatives, MFs etc.)  Domain knowledge is more important in Text mining as compared to data mining
  • 50. Text Mining Functional Architecture
  • 51. System architecture for generic text mining system
  • 53. Components and Tasks of IE engines Zoning POS Tagging Sense Disambiguation Shallow Parsing Deep Parsing Anaphora Resolution Integration Tokenization Morphological and Lexical Analysis Semantic Analysis Domain Analysis Components Tasks

Editor's Notes

  • #19: The semantic web augments the current web with formalised knowledge and data that can be processed by computers
  • #51: Feldman and Sanger, Advanced Approaches in Analyzing Unstructured Data
  • #52: Feldman and Sanger, Advanced Approaches in Analyzing Unstructured Data
  • #54: Feldman and Sanger page 105