Knowledge acquisition using automated techniques

Methods
of
Knowledge Extraction

Deepti Aggarwal
SIEL|SERL, IIIT-Hyderabad, India

Agenda
 Introduction to Web as a knowledge
repository
 Automated extraction techniques (Input
sources, extracted structures, input pre-
processing, extraction methods, output
generation)
 Issues with automated extraction

What is knowledge?
 A familiarity with someone or something
with experience
 Includes facts, information, descriptions,
skills

Types of Knowledge
Explicit Knowledge Implicit Knowledge
 Always present  Not present explicitly
explicitly in records for analysis

 Objective facts having  Cultural beliefs with
a definite answer subjective judgments

 E.g., Hyderabad is the
capital of A.P.  E.g., Hyderabad is the
best city to live in India.

How knowledge is
represented over a period
of time?
 From Public library to global library

How knowledge is
represented over the web?
 Millions of documents, blogs, forums,
social networks scattered on web
 Diverse topic, different formats, from
diverse people in diverse language,
different point of views

Benefits of knowledge
extraction over the Web
 Question Answering systems

 Search engines Explicit
 Validating knowledge knowledge

 Tracking a particular information

 Predicting market, polls etc. Implicit
 Community advertisements knowledge

Problems with knowledge
acquisition over web

 Abundance of data
 Relevance of information
 Personalized retrieval

Possible approaches
 Manual filtering

 Automated techniques

 Combination of both

Working of automated
extraction systems

Defining Input
output pre- Extraction Output
structures processing methods processing

Input
sources Database
of all facts,
Extraction system relations

Input sources
 web documents
 news articles
 blogs
 social networks activities (user profiles,
posts, comments)

Sentence level parsing required.

Defining the
structures of
output
Named Entities and their relations

Output structures
 Named Entities
 Named entities relations

1. Named Entity: Definition
 It is an atomic element in a body of
text.

 Types: person, organization, location etc.
 Different named entities when linked together,
form a relation.

1. Named Entity: An
example

Sachin Tendulkarwas born in Bombay.

NE of type „Person‟ NE of type „Location‟

2. Named Entity
Relationship: Structure

Subject – Relation - Object

NE of any type NE of any type

Verb, Adjective, Adverb

2. Named Entity
Relationship: An Example

Sachin Tendulkar was born inBombay

Subject Relation Object

Co-referencing

Sachin was born in Bombay. He is a ...

Sachin Tendulkar…. Mr. Tendulkar …
Master Blaster...

Input
pre-processing
Libraries

NLP libraries:
 Splitting each sentence into tokens, words,
digits using Sentence Tokenizer

 Recognizing language constructs, nouns,
verbs, pronouns using Part-of-speech
Tagger
 Example: Sachin/NNPTendulkar/NNP
was/VBD born/VBN in/IN
Bombay/NNP

NLP libraries (contd.):
 Linking individual constituents of a
sentence with Parser to form parse
tree
 Identify types of named entity using
Named Entity Recognizer
 Example: Sachin
Tendulkar/PERSON was born
inBombay/LOCATION

NLP libraries (contd.):
 Identify all co-references and replace
with actual entity using Co -
reference Resolution tool
 Identify specific meaning of a word
Word Sense Disambiguation
 External vocabularies: MindNet,
DBpedia, WordNet
 E.g., contextual meaning of „crane‟:
noun-bird, verb-lift/move

Extracting relationships
among NEs: Standard
process
named entities within a
1. Identify
sentence.

verbor adjective that
2. Find the

connects the identified named

entities.
3. Connect them together to form relation.

Extracting relationships
among NEs: Required
process
1. Identifypart-of-speech constructs:
noun, verb, adjective etc.

Co-references,
2. Determine

Acronyms and
abbreviations.
3. Connect them together to form a
relationship.

Extraction Methods
 Natural Language Processing: rule based.
 Based on sentence structure

 E.g., for English language, a rule can be “noun-verb-noun”

 Machine Learning: supervised and
unsupervised learning.
 Features are detected from the training data

 E.g., to extract instances of some medical diseases, system
is trained over all the symptoms of each given disease.

Extraction Methods (contd.)
 Other methods:Vocabulary
based systems,
context based clustering.
 Maintaining a mapping file of all countries and their
nationalities helps to determine nationality of a
person when his birth place is known.

 Hybrid:
 NLP based libraries to pre-process the input data,
applying machine learning approach to extract the
relations by using some external vocabulary as
WordNet.

Types of output systems
1. Identifies all mentionsof named entities
and their relations.
E.g., from a given corpus, extract all named entity
relations.

2. Identify missing relations of a database
E.g., Given a database, extract the missing attributes
of given entities from the corpus.

3. Linking various entities within a database.
E.g., Given a database, link two entities together with
some relation extracted from the corpus.

Issues with
automated
extraction
Accuracy, running time, dependency

Issue 1: Challenges of
language structure
Co-reference
resolution
Ambiguous, complex
sentences
Abbreviations
Acronyms

See an example…

“Tomcalled his father last night. They talked for
an hour. Hesaid hewould be home the next
day."

What is „He'referring to?
Tomorhis father?

“You see sir, I can talk English, I can walk English, I
can laugh English, I can run English, because
English is such a funny language.”
Amitabh in NamakHalal

Issue 2: Accuracy
 Named entity detection: 90%,
relationship 50-70%.

 Introduction of noise at each step.
 E.g., disambiguation of acronym
„crane‟ with WordNet, introduces
contextual errors, which then
decreases accuracy of rule based
relationship extraction

Issue 3: Efficiency
 Feature detection steps are
expensive.

 Require days for computation

Issue 4: Dependency
 on external vocabulary sources, like
Wikipedia, WordNet, MindNetetc.
 Maintenance &updationof vocabulary
sources is manual: costly and require
expertise.
 Limited size produce context based noise

 Domain-dependent: medical domain
 Corpus-dependent: Wikipedia, news
corpus
 Relation specific: Dateand Place-of-
event

Issue 5: Problem with Implicit
knowledge extraction
 Community Knowledge is learned and shared

 No one can be an expert.

 cultural competence and perception of
workers are fed into a system as variables.

Cultural Consensus Theory provides
models to include such variables into the
system.

Can we do better?
Can we seek human intelligence to improve
the accuracy of automated techniques?

References
[1] I. Tuomi. Data is more than knowledge:
implications of the reversed knowledge hierarchy
for knowledge management and organizational
memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec.
1999.

[2] S. Sekine. Named Entity: History and Future. 2004.

[3] S. Sarawagi. Information extraction. Found. Trends
databases , 1(3):261–377, Mar. 2008.

[4] S. C. Weller. Cultural consensus theory:
Applications and frequently asked questions. Field
Methods,19(4):339–368, 2007.

References (contd.)
[5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic
discovery of semantic relations using mindnet.
LREC,2010.

[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
K. Miller. Wordnet: An on-line lexical database.
International Journal of Lexicography , 3:235–244,
1990

[7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S.
Vaithyanathan, and H. Zhu. Avatar information
extraction system. IEEE Data Eng. Bull. , pages 40–48,
2006.

[8] E. Greengrass. Information retrieval: A survey, 2000.

Knowledge acquisition using automated techniques

Knowledge acquisition using automated techniques

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Knowledge acquisition using automated techniques (20)

More from University of Melbourne, Australia (12)

Recently uploaded (20)

Knowledge acquisition using automated techniques

Editor's Notes