SlideShare a Scribd company logo
Methods
        of
Knowledge Extraction



Deepti Aggarwal
SIEL|SERL, IIIT-Hyderabad, India
Agenda
 Introduction to Web as a knowledge
  repository
 Automated extraction techniques (Input
  sources, extracted structures, input pre-
  processing, extraction methods, output
  generation)
 Issues with automated extraction
What is knowledge?
 A familiarity with someone or something
  with experience
 Includes facts, information, descriptions,
  skills
Types of Knowledge
Explicit Knowledge         Implicit Knowledge
 Always present            Not present explicitly
  explicitly in records      for analysis

 Objective facts having    Cultural beliefs with
 a definite answer           subjective judgments

 E.g., Hyderabad is the
  capital of A.P.           E.g., Hyderabad is the
                             best city to live in India.
How knowledge is
represented over a period
of time?
 From Public library to global library
How knowledge is
represented over the web?
 Millions of documents, blogs, forums,
  social networks scattered on web
 Diverse topic, different formats, from
  diverse people in diverse language,
  different point of views
Benefits of knowledge
extraction over the Web
 Question Answering systems

 Search engines                       Explicit
 Validating knowledge                 knowledge

 Tracking a particular information



 Predicting market, polls etc.       Implicit
 Community advertisements            knowledge
Problems with knowledge
acquisition over web

 Abundance of data
 Relevance of information
 Personalized retrieval
Possible approaches
 Manual filtering

 Automated techniques

 Combination of both
Automated
 Extraction
Working of automated
   extraction systems


           Defining       Input
            output         pre-     Extraction     Output
          structures   processing   methods      processing


 Input
sources                                                       Database
                                                              of all facts,
                              Extraction system               relations
Input sources
           Types
Input sources
 web documents
 news articles
 blogs
 social networks activities (user profiles,
 posts, comments)


Sentence level parsing required.
Defining the
structures of
      output
Named Entities and their relations
Output structures
 Named Entities
 Named entities relations
1. Named Entity: Definition
 It is an   atomic element in a body of
  text.

 Types: person, organization, location etc.
 Different named entities when linked together,
  form   a relation.
1. Named Entity: An
example


  Sachin Tendulkarwas born in Bombay.




    NE of type „Person‟   NE of type „Location‟
2. Named Entity
Relationship: Structure


     Subject – Relation - Object



    NE of any type            NE of any type

                Verb, Adjective, Adverb
2. Named Entity
Relationship: An Example


Sachin Tendulkar was born inBombay




     Subject        Relation   Object
Co-referencing


Sachin was born in Bombay. He is a ...


 Sachin Tendulkar…. Mr. Tendulkar …
  Master Blaster...
Input
pre-processing
           Libraries
NLP libraries:
   Splitting each sentence into tokens, words,
    digits using Sentence Tokenizer

   Recognizing language constructs, nouns,
    verbs, pronouns using Part-of-speech
    Tagger
 Example: Sachin/NNPTendulkar/NNP
  was/VBD born/VBN in/IN
  Bombay/NNP
NLP libraries (contd.):
 Linking individual constituents of a
  sentence with Parser to form parse
  tree
 Identify types of named entity using
   Named Entity Recognizer
 Example: Sachin
  Tendulkar/PERSON was born
  inBombay/LOCATION
NLP libraries (contd.):
 Identify all co-references and replace
  with actual entity using Co -
   reference Resolution tool
 Identify specific meaning of a word
   Word Sense Disambiguation
      External vocabularies: MindNet,
       DBpedia, WordNet
      E.g., contextual meaning of „crane‟:
       noun-bird, verb-lift/move
Extraction
 methods
Extracting relationships
among NEs: Standard
process
          named entities within a
1. Identify
   sentence.

          verbor adjective that
2. Find the

   connects the identified named

   entities.
3. Connect them together to form   relation.
Extracting relationships
among NEs: Required
process
1. Identifypart-of-speech constructs:
   noun, verb, adjective etc.

        Co-references,
2. Determine

   Acronyms and
   abbreviations.
3. Connect them together to form a
   relationship.
Extraction Methods
 Natural Language Processing: rule        based.
    Based on sentence structure

    E.g., for English language, a rule can be “noun-verb-noun”

 Machine Learning: supervised          and
 unsupervised learning.
    Features are detected from the training data

    E.g., to extract instances of some medical diseases, system
     is trained over all the symptoms of each given disease.
Extraction Methods (contd.)
 Other methods:Vocabulary
                        based systems,
 context based clustering.
    Maintaining a mapping file of all countries and their
     nationalities helps to determine nationality of a
     person when his birth place is known.

 Hybrid:
    NLP based libraries to pre-process the input data,
     applying machine learning approach to extract the
     relations by using some external vocabulary as
     WordNet.
Output
generation
Types of output systems
1. Identifies all mentionsof named entities
   and their relations.
 E.g., from a given corpus, extract all named entity
    relations.

2. Identify missing relations of a database
 E.g., Given a database, extract the missing attributes
    of given entities from the corpus.

3. Linking various entities within a database.
 E.g., Given a database, link two entities together with
    some relation extracted from the corpus.
Working of automated
   extraction systems


           Defining       Input
            output         pre-     Extraction     Output
          structures   processing   methods      processing


 Input
sources                                                       Database
                                                              of all facts,
                              Extraction system               relations
Issues with
    automated
     extraction
Accuracy, running time, dependency
Issue 1: Challenges of
language structure
  Co-reference
  resolution
  Ambiguous, complex
  sentences
  Abbreviations
  Acronyms
See an example…

 “Tomcalled his father last night. They talked for
  an hour. Hesaid hewould be home the next
  day."


          What is „He'referring to?
            Tomorhis father?
“You see sir, I can talk English, I can walk English, I
can laugh English, I can run English, because
English is such a funny language.”
Amitabh in NamakHalal
Issue 2: Accuracy
  Named entity detection: 90%,
   relationship 50-70%.

  Introduction of noise at each step.
    E.g., disambiguation of acronym
     „crane‟ with WordNet, introduces
     contextual errors, which then
     decreases accuracy of rule based
     relationship extraction
Issue 3: Efficiency
  Feature detection steps are
   expensive.

  Require days for computation
Issue 4: Dependency
 on external vocabulary sources, like
  Wikipedia, WordNet, MindNetetc.
 Maintenance &updationof vocabulary
  sources is manual: costly and require
  expertise.
 Limited size produce context based noise

  Domain-dependent: medical domain
  Corpus-dependent: Wikipedia, news
   corpus
  Relation specific: Dateand Place-of-
   event
Issue 5: Problem with Implicit
knowledge extraction
 Community Knowledge is learned and shared

 No one can be an expert.

 cultural competence and perception of
  workers are fed into a system as variables.

Cultural Consensus Theory provides
 models to include such variables into the
 system.
Can we do better?
Can we seek human intelligence to improve
the accuracy of automated techniques?
References
[1] I. Tuomi. Data is more than knowledge:
  implications of the reversed knowledge hierarchy
  for knowledge management and organizational
  memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec.
  1999.

[2] S. Sekine. Named Entity: History and Future. 2004.

[3] S. Sarawagi. Information extraction. Found. Trends
  databases , 1(3):261–377, Mar. 2008.

[4] S. C. Weller. Cultural consensus theory:
  Applications and frequently asked questions. Field
  Methods,19(4):339–368, 2007.
References (contd.)
[5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic
  discovery of semantic relations using mindnet.
  LREC,2010.

[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
  K. Miller. Wordnet: An on-line lexical database.
  International Journal of Lexicography , 3:235–244,
  1990

[7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S.
  Vaithyanathan, and H. Zhu. Avatar information
  extraction system. IEEE Data Eng. Bull. , pages 40–48,
  2006.

[8] E. Greengrass. Information retrieval: A survey, 2000.
Thank you
    Questions?
Knowledge acquisition using automated techniques

More Related Content

What's hot (20)

PDF
Ontologies
Mani Kumar
 
PPTX
Ontology and Ontology Libraries: a critical study
Debashisnaskar
 
PDF
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
 
PDF
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
PDF
OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extr...
Content Savvy
 
PDF
The Process of Information extraction through Natural Language Processing
Waqas Tariq
 
PPTX
Ontology and Ontology Libraries: a Critical Study
Debashisnaskar
 
PPTX
Leveraging Semantic Parsing for Relation Linking over Knowledge Bases
Nandana Mihindukulasooriya
 
PPTX
ISWC 2020 - Semantic Answer Type Prediction
Nandana Mihindukulasooriya
 
PDF
Adaptive information extraction
unyil96
 
PPTX
Csci 6530 2016 fall presentation
ciakov
 
PDF
Usability Report - Discovery Tools
Nikki Kerber
 
PPTX
Chap 1 general introduction of information retrieval
Malobe Lottin Cyrille Marcel
 
PDF
Domain Specific Named Entity Recognition Using Supervised Approach
Waqas Tariq
 
DOC
Word Format.doc
butest
 
PPTX
Extracting Semantic
Suvodeep Mazumdar
 
PPTX
Indexing Automated Vs Automatic Galvan1
CorinaF
 
PDF
An analysis on Filter for Spam Mail
AM Publications
 
PDF
Phrase Structure Identification and Classification of Sentences using Deep Le...
ijtsrd
 
Ontologies
Mani Kumar
 
Ontology and Ontology Libraries: a critical study
Debashisnaskar
 
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extr...
Content Savvy
 
The Process of Information extraction through Natural Language Processing
Waqas Tariq
 
Ontology and Ontology Libraries: a Critical Study
Debashisnaskar
 
Leveraging Semantic Parsing for Relation Linking over Knowledge Bases
Nandana Mihindukulasooriya
 
ISWC 2020 - Semantic Answer Type Prediction
Nandana Mihindukulasooriya
 
Adaptive information extraction
unyil96
 
Csci 6530 2016 fall presentation
ciakov
 
Usability Report - Discovery Tools
Nikki Kerber
 
Chap 1 general introduction of information retrieval
Malobe Lottin Cyrille Marcel
 
Domain Specific Named Entity Recognition Using Supervised Approach
Waqas Tariq
 
Word Format.doc
butest
 
Extracting Semantic
Suvodeep Mazumdar
 
Indexing Automated Vs Automatic Galvan1
CorinaF
 
An analysis on Filter for Spam Mail
AM Publications
 
Phrase Structure Identification and Classification of Sentences using Deep Le...
ijtsrd
 

Viewers also liked (20)

PDF
Knowledge acquisition group capabilities 2014 q1 (concise)
AnnettaColeman
 
PDF
在中日系企業の強い味方 微博(ウェイボ)型社内SNS ”CFB”
Takamitsu Nakao
 
PDF
中国のスマートフォン市場とソーシャルネットワーク市場
Takamitsu Nakao
 
PDF
中国モバイル市場&ソーシャルメディア概要(2012年3月23日版)
Takamitsu Nakao
 
PDF
CFBご利用・ご活用ガイド
Takamitsu Nakao
 
PPTX
numéricos, embarcados e componentes básicos de um computador - UFS
wilkinson santana
 
PDF
Marketing-Methods,Marketing Automation & Marketing‘s Accountability
StephanWo
 
PDF
中国モバイル市場とソーシャルメディア市場(2013年1月版)
Takamitsu Nakao
 
PDF
中国Android事情
Takamitsu Nakao
 
DOCX
Laporan observasi Kecerdasan Buatan
Agung Moses C Satria
 
PPTX
Jhonier montoya
Jhonier Montoya Petrel
 
PDF
Solcellekursus 7 maj 2011_på_fc_jane_kruse
Christian Schmidt-Møller
 
PDF
Solcellekursus 7 maj 2011_på_fc_csm
Christian Schmidt-Møller
 
PPTX
A broad overview of Tele-consultation through SocialNUI lens
University of Melbourne, Australia
 
PPT
Richter video screenshots may
razleesecurity
 
PPTX
Bcs project of telenor
aaaswad
 
PPTX
Wellness at hand: Exploring interactive technology to support smokers
University of Melbourne, Australia
 
PDF
中国ソーシャルメディア その実態と動向(2012年8月版)
Takamitsu Nakao
 
PDF
中国市場開拓 天気予報アプリカスタマイズサービス
Takamitsu Nakao
 
PPT
Leadership heineken
Gidi Heynens
 
Knowledge acquisition group capabilities 2014 q1 (concise)
AnnettaColeman
 
在中日系企業の強い味方 微博(ウェイボ)型社内SNS ”CFB”
Takamitsu Nakao
 
中国のスマートフォン市場とソーシャルネットワーク市場
Takamitsu Nakao
 
中国モバイル市場&ソーシャルメディア概要(2012年3月23日版)
Takamitsu Nakao
 
CFBご利用・ご活用ガイド
Takamitsu Nakao
 
numéricos, embarcados e componentes básicos de um computador - UFS
wilkinson santana
 
Marketing-Methods,Marketing Automation & Marketing‘s Accountability
StephanWo
 
中国モバイル市場とソーシャルメディア市場(2013年1月版)
Takamitsu Nakao
 
中国Android事情
Takamitsu Nakao
 
Laporan observasi Kecerdasan Buatan
Agung Moses C Satria
 
Jhonier montoya
Jhonier Montoya Petrel
 
Solcellekursus 7 maj 2011_på_fc_jane_kruse
Christian Schmidt-Møller
 
Solcellekursus 7 maj 2011_på_fc_csm
Christian Schmidt-Møller
 
A broad overview of Tele-consultation through SocialNUI lens
University of Melbourne, Australia
 
Richter video screenshots may
razleesecurity
 
Bcs project of telenor
aaaswad
 
Wellness at hand: Exploring interactive technology to support smokers
University of Melbourne, Australia
 
中国ソーシャルメディア その実態と動向(2012年8月版)
Takamitsu Nakao
 
中国市場開拓 天気予報アプリカスタマイズサービス
Takamitsu Nakao
 
Leadership heineken
Gidi Heynens
 
Ad

Similar to Knowledge acquisition using automated techniques (20)

PPTX
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
Iman Mirrezaei
 
PPT
Text Analytics for Semantic Computing
Meena Nagarajan
 
PDF
Adaptive named entity recognition for social network analysis and domain onto...
Cuong Tran Van
 
PPTX
Search, Signals & Sense: An Analytics Fueled Vision
Seth Grimes
 
PPTX
Information Extraction
Ignacio Delgado
 
PPTX
Frame-Script and Predicate logic.pptx
nilesh405711
 
PPT
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
PDF
Ontology learning
Ehsan Asgarian
 
PPT
Knowledge discovery thru data mining
Devakumar Jain
 
PPTX
Text mining introduction-1
Sumit Sony
 
DOCX
Entity linking with a knowledge base issues techniques and solutions
CloudTechnologies
 
PPT
Enhancing Semantic Mining
Santhosh Kumar
 
PPTX
2015 07-tuto2-clus type
jins0618
 
PPT
Web 3 Expert System
guest4513a7
 
PPT
Web 3 Expert System
Mediabistro
 
PPT
ppt
butest
 
PPT
Content Analysis Overview for Persona Development
Pamela Rutledge
 
PDF
Named entity recognition using web document corpus
IJMIT JOURNAL
 
PPTX
Knowledge base system appl. p 3,4
Taymoor Nazmy
 
PDF
Named Entity Recognition Using Web Document Corpus
IJMIT JOURNAL
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
Iman Mirrezaei
 
Text Analytics for Semantic Computing
Meena Nagarajan
 
Adaptive named entity recognition for social network analysis and domain onto...
Cuong Tran Van
 
Search, Signals & Sense: An Analytics Fueled Vision
Seth Grimes
 
Information Extraction
Ignacio Delgado
 
Frame-Script and Predicate logic.pptx
nilesh405711
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Hammad Afzal
 
Ontology learning
Ehsan Asgarian
 
Knowledge discovery thru data mining
Devakumar Jain
 
Text mining introduction-1
Sumit Sony
 
Entity linking with a knowledge base issues techniques and solutions
CloudTechnologies
 
Enhancing Semantic Mining
Santhosh Kumar
 
2015 07-tuto2-clus type
jins0618
 
Web 3 Expert System
guest4513a7
 
Web 3 Expert System
Mediabistro
 
ppt
butest
 
Content Analysis Overview for Persona Development
Pamela Rutledge
 
Named entity recognition using web document corpus
IJMIT JOURNAL
 
Knowledge base system appl. p 3,4
Taymoor Nazmy
 
Named Entity Recognition Using Web Document Corpus
IJMIT JOURNAL
 
Ad

More from University of Melbourne, Australia (12)

PDF
OzCHI 2020: Lessons Learnt from Designing a Smart Clothing Telehealth System ...
University of Melbourne, Australia
 
PDF
Supporting Bodily Communication in Video Consultations of Physiotherapy
University of Melbourne, Australia
 
PDF
SoPhy: A wearable Technology for Lower Limb Assessment in Video Consultations...
University of Melbourne, Australia
 
PDF
Supporting Bodily Communication in Video Consultations of Physiotherapy
University of Melbourne, Australia
 
PPT
Doctor, Can You See My Squats?: Understanding Bodily Communication in Video C...
University of Melbourne, Australia
 
PDF
Understanding Video based Parent Training Intervention for Children with Autism
University of Melbourne, Australia
 
PPT
PhD Confirmation talk
University of Melbourne, Australia
 
PPTX
Six months progress review (PhD work)
University of Melbourne, Australia
 
PDF
Masters thesis defense talk
University of Melbourne, Australia
 
PPTX
5min presentation
University of Melbourne, Australia
 
PPTX
Demography based ATM design
University of Melbourne, Australia
 
OzCHI 2020: Lessons Learnt from Designing a Smart Clothing Telehealth System ...
University of Melbourne, Australia
 
Supporting Bodily Communication in Video Consultations of Physiotherapy
University of Melbourne, Australia
 
SoPhy: A wearable Technology for Lower Limb Assessment in Video Consultations...
University of Melbourne, Australia
 
Supporting Bodily Communication in Video Consultations of Physiotherapy
University of Melbourne, Australia
 
Doctor, Can You See My Squats?: Understanding Bodily Communication in Video C...
University of Melbourne, Australia
 
Understanding Video based Parent Training Intervention for Children with Autism
University of Melbourne, Australia
 
PhD Confirmation talk
University of Melbourne, Australia
 
Six months progress review (PhD work)
University of Melbourne, Australia
 
Masters thesis defense talk
University of Melbourne, Australia
 
Demography based ATM design
University of Melbourne, Australia
 

Recently uploaded (20)

PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

Knowledge acquisition using automated techniques

  • 1. Methods of Knowledge Extraction Deepti Aggarwal SIEL|SERL, IIIT-Hyderabad, India
  • 2. Agenda  Introduction to Web as a knowledge repository  Automated extraction techniques (Input sources, extracted structures, input pre- processing, extraction methods, output generation)  Issues with automated extraction
  • 3. What is knowledge?  A familiarity with someone or something with experience  Includes facts, information, descriptions, skills
  • 4. Types of Knowledge Explicit Knowledge Implicit Knowledge  Always present  Not present explicitly explicitly in records for analysis  Objective facts having  Cultural beliefs with a definite answer subjective judgments  E.g., Hyderabad is the capital of A.P.  E.g., Hyderabad is the best city to live in India.
  • 5. How knowledge is represented over a period of time?  From Public library to global library
  • 6. How knowledge is represented over the web?  Millions of documents, blogs, forums, social networks scattered on web  Diverse topic, different formats, from diverse people in diverse language, different point of views
  • 7. Benefits of knowledge extraction over the Web  Question Answering systems  Search engines Explicit  Validating knowledge knowledge  Tracking a particular information  Predicting market, polls etc. Implicit  Community advertisements knowledge
  • 8. Problems with knowledge acquisition over web  Abundance of data  Relevance of information  Personalized retrieval
  • 9. Possible approaches  Manual filtering  Automated techniques  Combination of both
  • 11. Working of automated extraction systems Defining Input output pre- Extraction Output structures processing methods processing Input sources Database of all facts, Extraction system relations
  • 12. Input sources Types
  • 13. Input sources  web documents  news articles  blogs  social networks activities (user profiles, posts, comments) Sentence level parsing required.
  • 14. Defining the structures of output Named Entities and their relations
  • 15. Output structures  Named Entities  Named entities relations
  • 16. 1. Named Entity: Definition  It is an atomic element in a body of text.  Types: person, organization, location etc.  Different named entities when linked together, form a relation.
  • 17. 1. Named Entity: An example Sachin Tendulkarwas born in Bombay. NE of type „Person‟ NE of type „Location‟
  • 18. 2. Named Entity Relationship: Structure Subject – Relation - Object NE of any type NE of any type Verb, Adjective, Adverb
  • 19. 2. Named Entity Relationship: An Example Sachin Tendulkar was born inBombay Subject Relation Object
  • 20. Co-referencing Sachin was born in Bombay. He is a ... Sachin Tendulkar…. Mr. Tendulkar … Master Blaster...
  • 21. Input pre-processing Libraries
  • 22. NLP libraries:  Splitting each sentence into tokens, words, digits using Sentence Tokenizer  Recognizing language constructs, nouns, verbs, pronouns using Part-of-speech Tagger  Example: Sachin/NNPTendulkar/NNP was/VBD born/VBN in/IN Bombay/NNP
  • 23. NLP libraries (contd.):  Linking individual constituents of a sentence with Parser to form parse tree  Identify types of named entity using Named Entity Recognizer  Example: Sachin Tendulkar/PERSON was born inBombay/LOCATION
  • 24. NLP libraries (contd.):  Identify all co-references and replace with actual entity using Co - reference Resolution tool  Identify specific meaning of a word Word Sense Disambiguation  External vocabularies: MindNet, DBpedia, WordNet  E.g., contextual meaning of „crane‟: noun-bird, verb-lift/move
  • 26. Extracting relationships among NEs: Standard process named entities within a 1. Identify sentence. verbor adjective that 2. Find the connects the identified named entities. 3. Connect them together to form relation.
  • 27. Extracting relationships among NEs: Required process 1. Identifypart-of-speech constructs: noun, verb, adjective etc. Co-references, 2. Determine Acronyms and abbreviations. 3. Connect them together to form a relationship.
  • 28. Extraction Methods  Natural Language Processing: rule based.  Based on sentence structure  E.g., for English language, a rule can be “noun-verb-noun”  Machine Learning: supervised and unsupervised learning.  Features are detected from the training data  E.g., to extract instances of some medical diseases, system is trained over all the symptoms of each given disease.
  • 29. Extraction Methods (contd.)  Other methods:Vocabulary based systems, context based clustering.  Maintaining a mapping file of all countries and their nationalities helps to determine nationality of a person when his birth place is known.  Hybrid:  NLP based libraries to pre-process the input data, applying machine learning approach to extract the relations by using some external vocabulary as WordNet.
  • 31. Types of output systems 1. Identifies all mentionsof named entities and their relations. E.g., from a given corpus, extract all named entity relations. 2. Identify missing relations of a database E.g., Given a database, extract the missing attributes of given entities from the corpus. 3. Linking various entities within a database. E.g., Given a database, link two entities together with some relation extracted from the corpus.
  • 32. Working of automated extraction systems Defining Input output pre- Extraction Output structures processing methods processing Input sources Database of all facts, Extraction system relations
  • 33. Issues with automated extraction Accuracy, running time, dependency
  • 34. Issue 1: Challenges of language structure Co-reference resolution Ambiguous, complex sentences Abbreviations Acronyms
  • 35. See an example… “Tomcalled his father last night. They talked for an hour. Hesaid hewould be home the next day." What is „He'referring to? Tomorhis father?
  • 36. “You see sir, I can talk English, I can walk English, I can laugh English, I can run English, because English is such a funny language.” Amitabh in NamakHalal
  • 37. Issue 2: Accuracy  Named entity detection: 90%, relationship 50-70%.  Introduction of noise at each step.  E.g., disambiguation of acronym „crane‟ with WordNet, introduces contextual errors, which then decreases accuracy of rule based relationship extraction
  • 38. Issue 3: Efficiency  Feature detection steps are expensive.  Require days for computation
  • 39. Issue 4: Dependency  on external vocabulary sources, like Wikipedia, WordNet, MindNetetc.  Maintenance &updationof vocabulary sources is manual: costly and require expertise.  Limited size produce context based noise  Domain-dependent: medical domain  Corpus-dependent: Wikipedia, news corpus  Relation specific: Dateand Place-of- event
  • 40. Issue 5: Problem with Implicit knowledge extraction  Community Knowledge is learned and shared  No one can be an expert.  cultural competence and perception of workers are fed into a system as variables. Cultural Consensus Theory provides models to include such variables into the system.
  • 41. Can we do better? Can we seek human intelligence to improve the accuracy of automated techniques?
  • 42. References [1] I. Tuomi. Data is more than knowledge: implications of the reversed knowledge hierarchy for knowledge management and organizational memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec. 1999. [2] S. Sekine. Named Entity: History and Future. 2004. [3] S. Sarawagi. Information extraction. Found. Trends databases , 1(3):261–377, Mar. 2008. [4] S. C. Weller. Cultural consensus theory: Applications and frequently asked questions. Field Methods,19(4):339–368, 2007.
  • 43. References (contd.) [5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic discovery of semantic relations using mindnet. LREC,2010. [6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Wordnet: An on-line lexical database. International Journal of Lexicography , 3:235–244, 1990 [7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull. , pages 40–48, 2006. [8] E. Greengrass. Information retrieval: A survey, 2000.
  • 44. Thank you Questions?

Editor's Notes

  • #4: The definition of knowledge is a matter of on-going debate among philosophersbut for our talk I have taken this definition from wikipedia
  • #8: Predicting market: to predict whether people likes Lux soap or not.community advertisements. Ex: Advertising Bengalis’ community in Hyderabad for a concert in Bengali.
  • #9: Scarcity is not the issue but abundance is!Easy for humans to understand the meaning lying in different documents.Becomes difficult for a user to find a document of his interest.
  • #10: Too much of labour, time consuming, biasedness, For huge data, an intelligent way is to formulate an algo which can perform repetitive computation. with systems instead of manual labour. Less time consuming, Which I will talk about in my ppt.I Consider it to be more appropriate. Combines the advantages of both systems and humans. Systems: scalability and accuracy and intelligence with humans. In my thesis, I have particularly opted for this approach. Today I am not talking about this approach. I will cover this topic in some later ppt.
  • #11: Systems that are built over some algorithms: the use of methods for controlling industrial processes automatically, esp by electronically controlled systems, often reducing manpower
  • #12: Broad overview of how system worksAccording to me these are five main components
  • #33: Broad overview of how system worksAccording to me these are five main components
  • #42: Type of extraction method depends on the applicationHighly sophisticated system can achieve max. of 70% accuracy. Accuracy of automated techniques can not surpass human intelligence.