Knowledge Extraction

(Knowledge Extraction)

Raymond Pierre de Lacaze

(RPL)

LispNYC July 10th, 2012

rpl@lispnyc.org

(John McCarthy)
September 4th,1927 – October 24th, 2011

This talk is dedicated to the memory of John McCarthy

 Inventor of the Lisp Language (1958)
 Founder of Artificial Intelligence
 Winner of the Turing award (1971)
 Designer of Elephant 2000
 Programming Language based on speech acts
 http://www-formal.stanford.edu/jmc/elephant/elephant.html

 May He Rest in Peace

BABAR: Project Goals
 Leverage Wikipedia as a Knowledge Base

 Infer Infrastructure & Extract Content
 Create Wiki Topic Taxonomies
 Generate Knowledge Hypergraphs

 Investigate Conceptual Relevance Metrics

 Generate Knowledge summaries
 Answer Knowledge base queries

 Evolve a new generation of web browsers:
Knowledge Browsers

Overview
 Brief Overview AI
 Knowledge Representation
 Natural Language Processing

 Examine Specific Algorithms
 Semantic Nets & Hypergraphs
 Recursive Descent Parsing
 Clustering Algorithms
 Similarity Metrics

 Describe Aspects of the BABAR System
 Semantic Link Analysis
 Automatic Topic Taxonomy Generation
 Knowledge Category Assignment
 Content Extraction
 English Phrase to Clausal Form Logic

AI Technologies Discussed
 Knowledge Representation
 Clausal Form Logic
 Semantic Nets
 Hypergraphs

 Natural Language Processing
 Lexical Analysis
 Syntactic Analysis
 Recursive Descent Parsing
 Semantic Analysis

 Machine Learning Techniques
 Clustering Algorithms
 K-Means, Agglomerative and SR Clustering

 Similarity Metrics
 Jaccard Index
 Pearson Correlation

Logics used in Artificial Intelligence
 Monotonic Logic (standard)
 Non-Monotonic Logic (exceptions)
 (1) Birds can fly, (2) Penguins are birds, (3) Penguins can't fly

 Sorted Logics (types)
 Fuzzy Logic (continuous truth values)
 Higher-Order Logics (meta-statements)
 Modal Logics (may, can, must)
 Intentional Logics (know, believe, think)

 Temporal Logics (temporal operators)
 Point-Based Temporal Logic (moments)
 Interval Time Logic (Allen 1986, 13 temporal operators)
 Before, Meets, Starts, Finishes, Overlaps, Contains, their inverses, and Equals.

 Logics can be expressed in clausal form:
(ancestor ?x ?y)  (parent ?x ?y)
(ancestor ?x ?y)  (parent ?x ?z)(ancestor ?z ?y)

Note: The variables ?x and ?y are universally quantified, whereas the variable
?z is existentially quantified.

Clausal Form Logic
 Propositional Calculus (PC)
 Fully grounded clauses
 No variables
 (Brother John Jill),
 (Parent Jane Jill)  (Mother Jane Jill)

 First Order Predicate Calculus (FOPC)
 Variables
 Universally qualified (for all ?x)
 Existentially qualified (there exists ?x)
 (Elephant ?x)  (Has-Tusks ?x)
 Converting 1st order logic to FOPC
 Skolem constants (there exists x for all y such that…)
 Skolem functions (for each x there exists a y such that…)

 Second Order Predicate Calculus
 Predicates and clauses can be arguments
 Meta statements
 Gödel's Incompleteness Theorem

 Horn Clauses
 Wikipedia: In computational logic, a Horn clause is a clause with at most
one positive literal
 B  (A1 ^ …. ^ An) ≡ ¬A1 v … v ¬A2 v B
 (<LHS> <RHS>) ≡ ((B) (A1…An))

Automated Reasoning
 Unification Algorithm
 Clausal pattern matching and variable binding
 (unify (P ?x ?y) (P A (Q ?x)))
 Returns bindings: ((?x A) (?y (Q ?x))
 Instantiation: (P A (Q A))

 Rete Algorithm
 Charles L. Forgy, CMU, 1974
 Addresses the many-many matching problem
 Matching facts to rules in rule-based systems
 Donald Knuth , Volume 3.

 Automated Reasoners
 Backward Chaining Reasoners
 Work from conclusion  axioms (facts)
 Good when state space branching factor is large
 Forward Chaining Reasoners
 Work from axioms  conclusion
 Good when the depth state space is large
 Mixed methods
Perform both forward & backward chaining
 GPS (Ernst & Newell, 1969)
 Island hopping

Semantic Nets
 Labeled, directed (or not) and weighted (or not) Graphs
 Equivalent in expressiveness to FOPC
 Graphical representation of 1st order logic.
 ISA Hierarchies
 Subsumption (Bill Woods)

 KL-ONE System: R.J. Brachman and J. Schmolze (1985)
 A whole family of KL-ONE like systems

 Concepts
 Distinguish Primitive and Defined concepts
 Only defined concepts are classifiable

 Frames
 Marvin Minsky , "A Framework for Representing Knowledge.“, 1974
 OO Languages (CLOS) ≡ Frame Language
 Think of class of definitions as frames, where slots are attribute-value pairs
and you use pattern matching to fill in all the slots at which point a
concept becomes defined and classifiable.

HyperGraphs
 A hypergraph is graph in which edges are first class
objects and can be linked to other edges or vertices.
 Hypergraphs are a natural and convenient way of
representing sentences and meta-statements.
Married
Jane Jim

Disapproves
Loves Likes

Mom
Resents

John

 Mom resents the fact that John disapproves of Jane and
Jim’s Marriage.
 BABAR uses an in memory HyperGraph  Semantic Net

Natural Language Processing
 Lexical Analysis
 Understanding the role and morphological nature of words.
 Morphology, Orthography, Part of Speech Tagging
 Typically use Lexicons: Dictionaries, etc…
 Programs that do this are called Scanners or Lexical Analyzers
 ScanGen and LEX on Unix systems for Programming Languages

 Syntactic Analysis
 Understanding the grammatical nature of groups of words
 Programs that do this are called Parsers.
 They take tokens produced by scanners/analyzers and apply them
to a grammar.
 In doing so they typically produce parse trees.
 NLP parsing methodologies include:
 Top-Down Parsers(recursive descent)
 Bottom-Up Parsers
 ParseGen and YACC on Unix systems for Programming Languages

 Semantic Analysis
 Extracting phrase structure from parse trees and producing
statements in some knowledge representation language such as
clausal-form logic.
 KRL: "An Overview of KRL, a Knowledge Representation
Language", D.G. Bobrow and T. Winograd, (1977).

Lexical Analysis
 Morphology
 The rules that govern word morphing
 foxes ≡ fox+<plural>

 Orthography
 The rules that govern spelling
 Plural of fox ≡ fox+’es’

 Transducers
 Define languages consisting of pairs of strings
 Loosely: Finite Automaton with 2 state transition functions.
 Formally: Q (states), Σ (i-alph), Δ (o-alph), q0 (start), F (final), δ(q, w) and σ(q, w).
 FST: Finite State Transducer
 Surface level, Intermediate level, Lexical level
 E.g. foxes  fox+es  fox+N+PL
 Parsing, Generating & Translating

 Morphological Parser
 Lexicons, Morphotactics and Orthographic Rules
 Penn Treebank Parts of Speech Tags (50)

 Probabilistic Approaches
 N-Gram model
 Counting word frequency
 See Chapter 4 of Jurafsky & Martin, Speech & Language Processing, 2009
 Google Translate

Lexical Analysis in BABAR
 Lexicons
 Regular words Lexicon
 http://www.merriam-webster.com/
 Query the site and extract parts of speech
 About 50,000 locally cached entries.
 Irregular Words Lexicons
 Irregular nouns
 Irregular verbs
 Irregular auxiliaries

 Orthographic Rules
 reverse engineer morphed words

 (analyze-morphed-word <word>)
 Analyzes word suffixes then queries MW.

Lexical Analysis Example
KB(5): (parser::analyze-morphed-word "traditionally“ )

Loading #P"C:ProjectstrunkDataLexiconsParts-of-Speech.lisp"

Loading table from file English-Irregular-Nouns ...
Loading table from file English-Irregular-Verbs ...
Loading table from file English-Irregular-Auxiliary ...

Initializing reverse lexicon table...

URL: "http://www.merriam-webster.com/dictionary/tradition"

 Returns five values:
Base Form: "tradition"
Actual Form: "traditionally"
Primary POS: :ADVERB
Additional NIL
Complete POS (:ADVERB)

 Reverse Engineering:

traditionally (adverb)  traditional (adjective)  tradition (noun)

 Parts-of-Speech Lexicon currently has about 50,000 entries.
 Appriximately one million words in the English language

Syntactic Analysis
 Grammars
 Productions (grammatical rules)
 LHS: A non-terminal symbol
 RHS: A disjunction of conjunctions of TS & NTS
 Can be recursive
 Non-Terminal Symbols
 Terminal Symbols (lexicon entries)
 Start Symbol

 Implicitly Define an AND-OR Tree.
 Context-Free Grammars, Attribute Grammars

 Parsers
 Traverse a grammar while consuming input tokens in an attempt to find a
valid path through the grammar that accommodates the input tokens.

 Produce parse trees in which the internal nodes are Non-Terminal Symbols
(NTS) and the leaves are Terminal Symbols (TS)

 Three typical ways to handle non-determinism
 Backtracking
 Look-ahead
 Parallelism

Parsing in BABAR
 Implements a Recursive Descent Parser which performs a
top-down traversal of the grammar.

 Uses backtracking to handle non-determinism
 3 Types of objects: tokens, grammars and parse-nodes

 Scanner
 Creates of seven fundamental token classes based on
character composition
 alphabetic, numeric, special, alpha-numeric, alpha-special,
numeric-special and alpha-numeric-special
 Implemented using multiple-inheritance:
 alphabetic-mixin, numeric-mixin and special-mixin classes

 Parser Module (Scanner, Analyzer, Parser)
 Implements a set of classes and generic functions geared towards
being easily able to develop particular domain–specific parsers.

Level 1 (simple)
Class grammar
Macro (define-grammar <name><prods><preds> &key <class>)
GF (scan-tokens <string> <grammar>&key <delimiter>)
GF (parse-tokens <tokens> <grammar>)

Level 2 (context)
Class context-grammar

Macro (define-context-grammar <name> <prods> <preds> <context>)

Macro (with-grammar-context (<context><grammar>) &body <body>)

GF (analyze-tokens <tokens> <grammar>)

Level 3 (domain)
Macro (define-lexicon <name> <fields>)
Macro (define-word-class <word-type> &optional <slots>)
Level 4 (english)
Adds english-grammar, scan-tokens, analyze-word-morphology

Crawling Wikipedia
 Wikipedia has approximately 4 million pages.

(initialize-wiki-graph <topic><depth>)
 Returns a graph object

(crawl-wiki-topic <topic> <depth>)
 Returns a Hash-Table of related-topics
 For topic=elephant and depth=
 #<EQUALP hash-table with 2580 entries>

(generate-wiki-graph <hash-table>)
 Only create a vertex for keyss (pruning)
 Non-key related topics are ignored (pruning)
 Create a ‘related-to edge for every (<key> <related-topic>) pair.

 Without pruning: #<Graph Elephant-3: 154833 vertices, 553604 edges>
 With Pruning: #<Graph Elephant-3: 2580 vertices, 182562 edges> (2.7%)
 With Pruning: #<Graph Elephant-4: 25577 vertices, 2355810 edges> (0.3%)

 A complete graph of n vertices has n(n-1)/2 edges ≡ O(n2)

Link Name Organization
 Internal, External and Intranal hyperlinks
 I chose the Elephant page as my entry page for crawling
 There are 228 internal links from the Elephant page.
 These occur throughout 103 paragraphs of text
 Goal: Organize the 228 links into a meaningful taxonomy

Asian_Elephant
Elephant
African_Bush_Elephant
African_Elephant
African_Forest_Elephant

 Apply NLP to link names: i.e. parse the link names.
 Partition link names into subtopic, supertopic and related.
 Subtopic candidate elimination
 Partition related topics into strongly and weakly related
based on link bi-directionality

Subtopic Taxonomy Generation Algorithm
(generate-subtopic-relations-in-graph <graph>)
1. Produce Candidates: a list of pairs of concepts. Each pair of
concepts is such that the first concept is a generalization of the
second concept. This is determined by noting concepts that
when parsed produce a set of tokens that is subset of the set
tokens produced by parsing the second concept.
2. Eliminate False-Poisitives: These are eliminated by ensuring that
the subjects of the phrases of each set of parsed tokens are
identical.
 E.g. Elephant_Hotel is not a subtopic of Elephant whereas
Hotel_Elephant would a be subtopic of Elephant. This is one place
where NLP really adds value.
3. Replace ‘related-to relations with ‘generalizes relations.
4. Eliminate direct ‘generalizes relationships between children and
non-parent ancestors.
 E.g. Elephant and North_African_Elephant.
5. Eliminate Singletons: Prune the list of sub trees by eliminating
singleton sub trees thus leaving them in a state of yet to be
classified
 Finally return a forest of trees, i.e. a list of root nodes.

Subtopic Taxonomies
 Organize 2,580 topics into a forest of 131 trees consisting of 1,594 nodes
(62%) and 986 yet to be classified nodes.

Elephant Tree Elephant_Seal Tree
-> Elephant -> Elephant_seal
-> Dwarf_elephant -> Southern_elephant_seal
-> Northern_elephant_seal
-> Sri_Lankan_elephant
-> Year_of_the_Elephant
-> Sumatran_Elephant
Intelligence Tree
-> White_elephant -> Intelligence
-> Fish_intelligence
-> War_elephant -> Cat_intelligence
-> Crushing_by_elephant -> Artificial_intelligence
-> Babar_the_Elephant -> Electronic_Transactions_on_Artificial_Intelligence
-> Indian_Elephant -> Swarm_intelligence
-> Cephalopod_intelligence
-> African_elephant -> Dinosaur_intelligence
-> African_Forest_Elephant -> Cetacean_intelligence
-> North_African_Elephant -> Evolution_of_human_intelligence
-> African_Bush_Elephant -> Elephant_intelligence
-> Dog_intelligence
-> Execution_by_elephant -> Pigeon_intelligence
-> Borneo_pygmy_elephant -> Primate_intelligence
-> Horton_the_Elephant -> Bird_intelligence
-> Asian elephant
-> Elmer_the_Patchwork_Elephant

Subtopic Taxonomy Issues
-> Lion -> Lion (cont.)
-> Congolese_Spotted_Lion -> Sea_lion
-> Asiatic_Lion -> Steller_sea_lion
-> Masai_lion -> Australian_sea_lion
-> Barbary_lion -> South_American_sea_lion
-> Henry_the_Lion -> New_Zealand_sea_lion
-> Sri_Lanka_lion -> California_sea_lion
-> Nemean_lion -> American_lion
-> Western_African_lion -> White_lion
-> Transvaal_Lion -> Kimba_the_White_Lion
-> West_African_lion -> Cowardly_Lion
-> Tsavo_lion -> Tiger_versus_lion
-> Southwest_African_Lion
-> European_lion
-> Cape Lion

WRT Nomenclature purity, Lion_Seal is a better name than Sea_Lion.

Clustering
 Two Fundamental Perspectives:
 Top-Down: Partitioning a set into disjoint subsets
 Bottom-Up: Grouping data points into disjoint clusters

 Goes hand-in-hand with classification

 Typically involves a metric: Euclidian or Manhattan distance

 Many, many different algorithms & books.

 Some really popular algorithms:
 K-Means Clustering (EM, PCA)
 Hierarchical Agglomerative Clustering
 K-Nearest Neighbor (classification)

 SR-Clustering: This is something I (re)invented.
 Effectively: The world’s simplest clustering algorithm.

K-Means Clustering (1)
 Given an initial set of cluster centroids, determine
the actual centroids of each cluster via an
iterative refinement algorithm.

 Each refinement iteration consists of two steps :
1. Computing new data point centroid assignments
2. Computing new centroid positions based of the
mean deviation of the data points from the previous
centroid positions.

 Converge, Divergence, Oscillation….

 Also known as Lloyd’s Algorithm in CS.

Wikipedia: Given a set of observations
(x1, x2, …, xn), where each observation is a d-
dimensional real vector, k-means clustering aims
to partition the n observations into k sets (k ≤ n) S
= {S1, S2, …, Sk} so as to minimize the within-
cluster sum of squares (WCSS):

where μi is the mean of points in Si

 Assignment Step:

Defines Si to be the set of xi that deviate least from Si

 Update Step:

Calculate the new means to be the centroid of the
observations in the cluster.
I.e. The average along each dimension

K-Means Clustering(4)
 K-Means is *really* a 3 step algorithm
 Step1. Initialize K-Means (non-trivial)
 Problem 1: Estimate K
 Problem 2: Pick Initial Centroid for each K
 Iterative Refinement
 Step 2: Centroid Assignments
 Step3: Centroid Update

 Many initialization approaches:
 Random, Forgy, MacQueen and Kaufman

 Performance depends on initialization and instance ordering
 Popular because of its robustness
 Related to:
 EM Algorithm and
 Principal Component Analysis (PCA)

Hierarchical Agglomerative Clustering
 The Algorithm
1. Cluster each data point with its nearest neighbor(s)
and make that a new data point (cluster).
2. Repeat until some fixed number of clusters is reached.

 K-Nearest Neighbor is often used hand-in-hand with
agglomerative clustering to compute the nearest
neighbor(s).

 End up with a tree of clusters (clustering history)

 This tree is called a dendogram

 See Chapter 6 of Duda & Hart (SRI, 1973)
Pattern Classification & Scene Analysis

SR-Clustering (1)
 Simple Ray Clustering 
 Sort of like non-hierarchical agglomerative
clustering
 Basic Algorithm
 For each data point, place it in the correct cluster
 If it doesn’t belong to any cluster, create a new
cluster consisting of that single data point
 Cluster Membership
 Defined as being within a certain proximity
threshold of every data point in that cluster.
 Proximity Metric
 The Jaccard Index

Recommender Systems
 Used by Netflix, Amazon, etc…
 Objects: Users, Items & Preferences

 User vs. Item based recommendations
 Former aka collaborative filtering
 Mixed method recommendations
 Based on User Similarity and/or Item Similarity

 Jaccard Index takes into account dissimilarity and
does not require preference measurements.

 Apache Mahout (leverages Hadoop)

Jaccard Index
 Defines a Similarity Metric between two sets

 Wikipedia: The Jaccard coefficient measures
similarity between sample sets, and is defined
as the size of the intersection divided by the size
of the union of the sample sets:

 Jaccard
Distance

Another Similarity Metric
 Pearson Correlation Coefficient

 Wikipedia:
Defined as the covariance of the
two variables divided by the product of their
standard deviations

(compute-similarity-matrix <topics>)
 Computes the Jaccard index for pairs of topics by
using the related topics of each topic as the sets to
be compared.

African Asian Indian Babar Horton War

African 100.00 38.46 21.05 4.35 6.82 7.94

Asian 38.46 100.00 37.74 4.00 6.25 20.00

Indian 21.05 37.74 100.00 6.90 7.14 24.39

Babar 4.35 4.00 6.90 100.00 28.57 7.14

Horton 6.82 6.25 7.14 28.57 100.00 7.41

War 7.94 20.00 24.39 7.14 7.41 100.00

(cluster-subtopics <subtopics> <matrix> <threshold>)

Cluster 1 Cluster 4
Asian_elephant(49) War_elephant(22)
African_elephant(60) Execution_by_elephant(5)
Crushing_by_elephant(4)
Cluster 2
Babar_the_Elephant(7) Cluster 5
Horton_the_Elephant(5) Year_of_the_Elephant(8)
Elmer_the_Patchwork_Elephant(4)
Cluster 6
Dwarf_elephant(24)
Cluster 3
Asian_elephant(49) Cluster 7
Indian_Elephant(18) White_elephant(10)
Sri_Lankan_elephant(12)
Sumatran_Elephant(11)
Borneo_pygmy_elephant(3) Threshold = 20

Knowledge Categories (1)
 Human schooling as a decade(s) long knowledge
acquisition process

 Spanning Kindergarten – Post Doctoral work

 Idea is to use grade school topics as initial
knowledge categories.

 Science, History, Geography, Literature & Art

 Goal: Assign categories to subtopic clusters

 Use Jaccard Index to determine the category

 Automatically create subtopic category names
e.g. Babar  Literature_Elephant

(compute-cluster-categories <clusters>)

 Wiki Crawl each Knowledge Category (pre-run)
 Compute subtopics of each knowledge category

 Compute a category relevancy vector for each
cluster member

 Combine the relevancy vectors of each cluster to
compute a relevancy vector for the cluster

 Assign a category to the cluster

(compute-cluster-categories <clusters>)
(((( :SCIENCE 0.47666672) (:HISTORY 0.44666672))
(#<Concept(49): Asian_elephant> #<Concept(60): African_elephant>))

((( :SCIENCE 0.39) (:GEOGRAPHY 0.37800002))
(#<Concept(3): Borneo_pygmy_elephant> #<Concept(49): Asian_elephant>
#<Concept(18): Indian_Elephant> #<Concept(12): Sri_Lankan_elephant>
#<Concept(11): Sumatran_Elephant>))

((( :ART 0.33333334) (:GEOGRAPHY 0.30666667))
(#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant>
#<Concept(4): Elmer_the_Patchwork_Elephant>))

((( :HISTORY 0.6) (:GEOGRAPHY 0.46)
(#<Concept(8): Year_of_the_Elephant>))

((( :GEOGRAPHY 0.72333336) ( :HISTORY 0.43666664))
(#<Concept(5): Execution_by_elephant> #<Concept(22): War_elephant>
#<Concept(4): Crushing_by_elephant>))

((( :GEOGRAPHY 0.86) ( :SCIENCE 0.5))
(#<Concept(24): Dwarf_elephant>))

(( ( :SCIENCE 0.69) (:ART 0.49))
(#<Concept(10): White_elephant>)))

Individual Subtopic Categories
The following shows the knowledge category relevancies for some of the 16
subtopics of Elephant and helps understand the results of previous slide

(#<Concept(7): Babar_the_Elephant>
(( :LITERATURE 0.44) ( :ART 0.25) (:GEOGRAPHY 0.23) (:HISTORY 0.2) ( :SCIENCE 0.17)))

(#<Concept(4): Elmer_the_Patchwork_Elephant>
(( :ART 0.25) (:GEOGRAPHY 0.23) (:LITERATURE 0.22) (:HISTORY 0.2) ( :SCIENCE 0.17)))

(#<Concept(5): Horton_the_Elephant>
(( :ART 0.5) (:GEOGRAPHY 0.46) (:HISTORY 0.4) (:SCIENCE 0.35) ( :LITERATURE 0.22)))

(#<Concept(60): African_elephant>
((:ART 1.03) ( :SCIENCE 0.91) ( :HISTORY 0.85) (:GEOGRAPHY 0.77) ( :LITERATURE 0.37)))

(#<Concept(49): Asian_elephant>
(( :HISTORY 0.7) ( :SCIENCE 0.62) (:GEOGRAPHY 0.59) (:ART 0.42) ( :LITERATURE 0.19)))

(#<Concept(22): War_elephant>
(( :HISTORY 0.93) ( :GEOGRAPHY 0.85) (:LITERATURE 0.41) (:ART 0.23) (:SCIENCE 0.16)))

Categorized Subtopic Clusters
 Elephant
 ART_Elephant
Elmer_the_Patchwork_Elephant
Horton_the_Elephant
Babar_the_Elephant
 GEOGRAPHY_Elephant
Dwarf_elephant
Crushing_by_elephant
War_elephant
Execution_by_elephant
 HISTORY_Elephant
Year_of_the_Elephant
 SCIENCE_Elephant
African_elephant
African_Forest_Elephant
African_Bush_Elephant
Asian_elephant
White_elephant
Sumatran_Elephant
Sri_Lankan_elephant
Indian_Elephant
Borneo_pygmy_elephany

Related Topics Associations
 Associate related topics to subtopic clusters using Jaccard Index
 Use associations to create related topic clusters

(find-compatible-clusters <strongly-related-topics> <clusters>)

(( #<Concept(60): African elephant> ((#<Concept(24): Dwarf elephant>)
#<Concept(49): Asian_elephant>)
(#<Concept(66): Mammoth>
(#<Concept(10): Elephant intelligence> #<Concept(25): Mastodon>
#<Concept(275): Genus>
#<Concept(103): Animal cognition>
#<Concept(62): Afrotheria>
#<Concept(4): Elephant tusk> #<Concept(86): Gestation>
#<Concept(15): African> #<Concept(749): Eutheria>
#<Concept(102): Proboscidea> #<Concept(8): Gomphotherium>
#<Concept(96): Mammalia> #<Concept(27): Tooth>
#<Concept(876): Mammal> #<Concept(8): Tooth_development>))
#<Concept(143): Hippopotamus>
#<Concept(590): Lion>
#<Concept(10): Loxodonta>)
((#<Concept(7): Babar_the_Elephant>
#<Concept(5): Horton_the_Elephant>
#<Concept(4): Elmer_the_Patchwork_Elephant>)
((#<Concept22): War_elephant>
#<Concept(5): Execution_by_elephant> #<Concept(6): List_of_fictional_elephants>
#<Concept(4): Crushing_by_elephant>) #<Concept(5): List_of_elephants_in_mythology_and_religion>
#<Concept(5): Pinnawala>
(#<Concept(55): Ivory> #<Concept(3): Katy_Payne>
#<Concept(77): Kenya> #<Concept(11): Infrasound>
#<Concept(31): Grief> #<Concept(56): Incisor>
#<Concept(8): History_of_elephants_in_Europe>)) #<Concept(14): Jeheskel_Shoshani>
#<Concept(6): Aanayoottu> )

(sentence-to-clause <sentence>)
  english sentence string
Scanner
  Tokens
Analyzer
  morphologically analyzed words
Parser
  parse-tree
Phrase extractor
  phrases (flattened parse tree)
Semantic analyzer
  Frame for subject, verb , object and prep. phrases
Clauses Generator
  Clause Objects

Sample Extracted Clauses
 (HASA "Asian elephant species" "disjunct distributions")

 (ISA "Elephants" "herbivores")

 (HASA "African Elephants" “three nails")

 (HASA "Indian Elephants" "four nails")

 (HASA "female African Elephants" "large tusks")

 (ISA "Elephants" "large land mammals")

Things Overlooked
 Wiki Page Contents Pane
 Provides page taxonomy
 Provides category names
 Provides related topic names

 Concept Weights

Future Direction
 Enhance English Parser
 Incorporate Variables into Semantic Net
 Leverage topic weights
 Work on language generation
 Produce Wiki Summary Pages
 Knowledge Queries
 Develop Client Side Browser
 Top Menu Bar Knowledge Categories
 RHS Dynamic Subtopic Tree
 LHS Wiki Page Content Pane

(references)
 Language and Speech Processing
Jurafsky and Martin
 Artificial Intelligence: A Modern Approach
Russell & Norvig
 Principles of Semantic Networks
Edited by John E. Sowa, Morgan Kaufman, 1991
 Machine Learning
Tom Mitchell, 1997
 Pattern Classification and Scene Analysis
Duda and Hart, 1973
 Algorithms of the Intelligent Web
Marmanis and Babenko, 2009

Knowledge Extraction

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Knowledge Extraction (20)

More from Pierre de Lacaze (7)

Recently uploaded (20)

Knowledge Extraction