SlideShare a Scribd company logo
(Knowledge Extraction)

  Raymond Pierre de Lacaze

          (RPL)

          LispNYC July 10th, 2012

                        rpl@lispnyc.org
(John McCarthy)
September 4th,1927 – October 24th, 2011


This talk is dedicated to the memory of John McCarthy

   Inventor of the Lisp Language (1958)
   Founder of Artificial Intelligence
   Winner of the Turing award (1971)
   Designer of Elephant 2000
       Programming Language based on speech acts
       http://www-formal.stanford.edu/jmc/elephant/elephant.html

   May He Rest in Peace
BABAR: Project Goals
   Leverage Wikipedia as a Knowledge Base

   Infer Infrastructure & Extract Content
       Create Wiki Topic Taxonomies
       Generate Knowledge Hypergraphs

   Investigate Conceptual Relevance Metrics

   Generate Knowledge summaries
   Answer Knowledge base queries

   Evolve a new generation of web browsers:
    Knowledge Browsers
Overview
   Brief Overview AI
       Knowledge Representation
       Natural Language Processing

   Examine Specific Algorithms
       Semantic Nets & Hypergraphs
       Recursive Descent Parsing
       Clustering Algorithms
       Similarity Metrics

   Describe Aspects of the BABAR System
       Semantic Link Analysis
           Automatic Topic Taxonomy Generation
           Knowledge Category Assignment
       Content Extraction
           English Phrase to Clausal Form Logic
AI Technologies Discussed
   Knowledge Representation
       Clausal Form Logic
       Semantic Nets
       Hypergraphs

   Natural Language Processing
       Lexical Analysis
       Syntactic Analysis
           Recursive Descent Parsing
       Semantic Analysis

   Machine Learning Techniques
       Clustering Algorithms
       K-Means, Agglomerative and SR Clustering

   Similarity Metrics
       Jaccard Index
       Pearson Correlation
Logics used in Artificial Intelligence
   Monotonic Logic (standard)
   Non-Monotonic Logic (exceptions)
       (1) Birds can fly, (2) Penguins are birds, (3) Penguins can't fly

   Sorted Logics (types)
   Fuzzy Logic (continuous truth values)
   Higher-Order Logics (meta-statements)
       Modal Logics (may, can, must)
       Intentional Logics (know, believe, think)

   Temporal Logics (temporal operators)
       Point-Based Temporal Logic (moments)
       Interval Time Logic (Allen 1986, 13 temporal operators)
           Before, Meets, Starts, Finishes, Overlaps, Contains, their inverses, and Equals.

   Logics can be expressed in clausal form:
    (ancestor ?x ?y)  (parent ?x ?y)
    (ancestor ?x ?y)  (parent ?x ?z)(ancestor ?z ?y)

    Note: The variables ?x and ?y are universally quantified, whereas the variable
           ?z is existentially quantified.
Clausal Form Logic
   Propositional Calculus (PC)
       Fully grounded clauses
       No variables
           (Brother John Jill),
           (Parent Jane Jill)  (Mother Jane Jill)

   First Order Predicate Calculus (FOPC)
       Variables
           Universally qualified (for all ?x)
           Existentially qualified (there exists ?x)
           (Elephant ?x)  (Has-Tusks ?x)
       Converting 1st order logic to FOPC
           Skolem constants (there exists x for all y such that…)
           Skolem functions (for each x there exists a y such that…)

   Second Order Predicate Calculus
       Predicates and clauses can be arguments
       Meta statements
       Gödel's Incompleteness Theorem

   Horn Clauses
       Wikipedia: In computational logic, a Horn clause is a clause with at most
        one positive literal
       B  (A1 ^ …. ^ An) ≡ ¬A1 v … v ¬A2 v B
       (<LHS> <RHS>) ≡ ((B) (A1…An))
Automated Reasoning
   Unification Algorithm
       Clausal pattern matching and variable binding
       (unify (P ?x ?y) (P A (Q ?x)))
           Returns bindings: ((?x A) (?y (Q ?x))
           Instantiation: (P A (Q A))

   Rete Algorithm
       Charles L. Forgy, CMU, 1974
       Addresses the many-many matching problem
       Matching facts to rules in rule-based systems
       Donald Knuth , Volume 3.

   Automated Reasoners
       Backward Chaining Reasoners
           Work from conclusion  axioms (facts)
           Good when state space branching factor is large
       Forward Chaining Reasoners
           Work from axioms  conclusion
           Good when the depth state space is large
       Mixed methods
        Perform both forward & backward chaining
         GPS (Ernst & Newell, 1969)
         Island hopping
Semantic Nets
   Labeled, directed (or not) and weighted (or not) Graphs
   Equivalent in expressiveness to FOPC
   Graphical representation of 1st order logic.
   ISA Hierarchies
   Subsumption (Bill Woods)

   KL-ONE System: R.J. Brachman and J. Schmolze (1985)
   A whole family of KL-ONE like systems

   Concepts
       Distinguish Primitive and Defined concepts
       Only defined concepts are classifiable

   Frames
       Marvin Minsky , "A Framework for Representing Knowledge.“, 1974
       OO Languages (CLOS) ≡ Frame Language
       Think of class of definitions as frames, where slots are attribute-value pairs
        and you use pattern matching to fill in all the slots at which point a
        concept becomes defined and classifiable.
HyperGraphs
   A hypergraph is graph in which edges are first class
    objects and can be linked to other edges or vertices.
   Hypergraphs are a natural and convenient way of
    representing sentences and meta-statements.
                              Married
                    Jane                        Jim




                                  Disapproves
                    Loves                       Likes

        Mom
                    Resents

                                John

   Mom resents the fact that John disapproves of Jane and
    Jim’s Marriage.
   BABAR uses an in memory HyperGraph  Semantic Net
Natural Language Processing
   Lexical Analysis
       Understanding the role and morphological nature of words.
       Morphology, Orthography, Part of Speech Tagging
       Typically use Lexicons: Dictionaries, etc…
       Programs that do this are called Scanners or Lexical Analyzers
       ScanGen and LEX on Unix systems for Programming Languages

   Syntactic Analysis
       Understanding the grammatical nature of groups of words
       Programs that do this are called Parsers.
       They take tokens produced by scanners/analyzers and apply them
        to a grammar.
       In doing so they typically produce parse trees.
       NLP parsing methodologies include:
           Top-Down Parsers(recursive descent)
           Bottom-Up Parsers
       ParseGen and YACC on Unix systems for Programming Languages

   Semantic Analysis
       Extracting phrase structure from parse trees and producing
        statements in some knowledge representation language such as
        clausal-form logic.
       KRL: "An Overview of KRL, a Knowledge Representation
        Language", D.G. Bobrow and T. Winograd, (1977).
Lexical Analysis
   Morphology
       The rules that govern word morphing
       foxes ≡ fox+<plural>

   Orthography
       The rules that govern spelling
       Plural of fox ≡ fox+’es’

   Transducers
       Define languages consisting of pairs of strings
       Loosely: Finite Automaton with 2 state transition functions.
       Formally: Q (states), Σ (i-alph), Δ (o-alph), q0 (start), F (final), δ(q, w) and σ(q, w).
       FST: Finite State Transducer
       Surface level, Intermediate level, Lexical level
           E.g. foxes  fox+es  fox+N+PL
       Parsing, Generating & Translating

   Morphological Parser
           Lexicons, Morphotactics and Orthographic Rules
           Penn Treebank Parts of Speech Tags (50)

   Probabilistic Approaches
       N-Gram model
       Counting word frequency
       See Chapter 4 of Jurafsky & Martin, Speech & Language Processing, 2009
       Google Translate
Lexical Analysis in BABAR
   Lexicons
       Regular words Lexicon
         http://www.merriam-webster.com/
         Query the site and extract parts of speech
         About 50,000 locally cached entries.
       Irregular Words Lexicons
         Irregular nouns
         Irregular verbs
         Irregular auxiliaries


   Orthographic Rules
       reverse engineer morphed words

   (analyze-morphed-word <word>)
       Analyzes word suffixes then queries MW.
Lexical Analysis Example
KB(5): (parser::analyze-morphed-word "traditionally“ )

Loading #P"C:ProjectstrunkDataLexiconsParts-of-Speech.lisp"

Loading table from file English-Irregular-Nouns ...
Loading table from file English-Irregular-Verbs ...
Loading table from file English-Irregular-Auxiliary ...

Initializing reverse lexicon table...

URL: "http://www.merriam-webster.com/dictionary/tradition"

   Returns five values:
    Base Form:           "tradition"
    Actual Form:         "traditionally"
    Primary POS:         :ADVERB
    Additional           NIL
    Complete POS         (:ADVERB)

   Reverse Engineering:

    traditionally (adverb)  traditional (adjective)  tradition (noun)

   Parts-of-Speech Lexicon currently has about 50,000 entries.
   Appriximately one million words in the English language
Syntactic Analysis
   Grammars
       Productions (grammatical rules)
           LHS: A non-terminal symbol
           RHS: A disjunction of conjunctions of TS & NTS
           Can be recursive
       Non-Terminal Symbols
       Terminal Symbols (lexicon entries)
       Start Symbol

       Implicitly Define an AND-OR Tree.
       Context-Free Grammars, Attribute Grammars

   Parsers
       Traverse a grammar while consuming input tokens in an attempt to find a
        valid path through the grammar that accommodates the input tokens.

       Produce parse trees in which the internal nodes are Non-Terminal Symbols
        (NTS) and the leaves are Terminal Symbols (TS)

       Three typical ways to handle non-determinism
           Backtracking
           Look-ahead
           Parallelism
Parsing in BABAR
   Implements a Recursive Descent Parser which performs a
    top-down traversal of the grammar.

   Uses backtracking to handle non-determinism
   3 Types of objects: tokens, grammars and parse-nodes

   Scanner
       Creates of seven fundamental token classes based on
        character composition
       alphabetic, numeric, special, alpha-numeric, alpha-special,
        numeric-special and alpha-numeric-special
       Implemented using multiple-inheritance:
           alphabetic-mixin, numeric-mixin and special-mixin classes

   Parser Module (Scanner, Analyzer, Parser)
       Implements a set of classes and generic functions geared towards
        being easily able to develop particular domain–specific parsers.
Level 1 (simple)
Class      grammar
Macro      (define-grammar <name><prods><preds> &key <class>)
GF         (scan-tokens <string> <grammar>&key <delimiter>)
GF         (parse-tokens <tokens> <grammar>)

Level 2 (context)
Class      context-grammar

Macro      (define-context-grammar <name> <prods> <preds> <context>)

Macro      (with-grammar-context (<context><grammar>) &body <body>)

GF         (analyze-tokens <tokens> <grammar>)

Level 3 (domain)
Macro      (define-lexicon <name> <fields>)
Macro      (define-word-class <word-type> &optional <slots>)
Level 4 (english)
Adds       english-grammar, scan-tokens, analyze-word-morphology
Crawling Wikipedia
   Wikipedia has approximately 4 million pages.

(initialize-wiki-graph <topic><depth>)
     Returns a graph object

(crawl-wiki-topic <topic> <depth>)
   Returns a Hash-Table of related-topics
   For topic=elephant and depth=
     #<EQUALP hash-table with 2580 entries>

(generate-wiki-graph <hash-table>)
   Only create a vertex for keyss (pruning)
   Non-key related topics are ignored (pruning)
   Create a ‘related-to edge for every (<key> <related-topic>) pair.

   Without pruning: #<Graph Elephant-3: 154833 vertices, 553604 edges>
   With Pruning:    #<Graph Elephant-3: 2580 vertices, 182562 edges> (2.7%)
   With Pruning:    #<Graph Elephant-4: 25577 vertices, 2355810 edges> (0.3%)

   A complete graph of n vertices has n(n-1)/2 edges ≡ O(n2)
Link Name Organization
   Internal, External and Intranal hyperlinks
   I chose the Elephant page as my entry page for crawling
   There are 228 internal links from the Elephant page.
   These occur throughout 103 paragraphs of text
   Goal: Organize the 228 links into a meaningful taxonomy

                       Asian_Elephant
    Elephant
                                             African_Bush_Elephant
                      African_Elephant
                                            African_Forest_Elephant


   Apply NLP to link names: i.e. parse the link names.
   Partition link names into subtopic, supertopic and related.
       Subtopic candidate elimination
   Partition related topics into strongly and weakly related
    based on link bi-directionality
Subtopic Taxonomy Generation Algorithm
(generate-subtopic-relations-in-graph <graph>)
1. Produce Candidates: a list of pairs of concepts. Each pair of
concepts is such that the first concept is a generalization of the
second concept. This is determined by noting concepts that
when parsed produce a set of tokens that is subset of the set
tokens produced by parsing the second concept.
2. Eliminate False-Poisitives: These are eliminated by ensuring that
the subjects of the phrases of each set of parsed tokens are
identical.
      E.g. Elephant_Hotel is not a subtopic of Elephant whereas
       Hotel_Elephant would a be subtopic of Elephant. This is one place
       where NLP really adds value.
3. Replace ‘related-to relations with ‘generalizes relations.
4. Eliminate direct ‘generalizes relationships between children and
non-parent ancestors.
      E.g. Elephant and North_African_Elephant.
5. Eliminate Singletons: Prune the list of sub trees by eliminating
singleton sub trees thus leaving them in a state of yet to be
classified
 Finally return a forest of trees, i.e. a list of root nodes.
Subtopic Taxonomies
  Organize 2,580 topics into a forest of 131 trees consisting of 1,594 nodes
   (62%) and 986 yet to be classified nodes.

 Elephant Tree                       Elephant_Seal Tree
-> Elephant                           -> Elephant_seal
  -> Dwarf_elephant                      -> Southern_elephant_seal
                                         -> Northern_elephant_seal
  -> Sri_Lankan_elephant
  -> Year_of_the_Elephant
  -> Sumatran_Elephant
                                     Intelligence Tree
  -> White_elephant                  -> Intelligence
                                       -> Fish_intelligence
  -> War_elephant                      -> Cat_intelligence
  -> Crushing_by_elephant              -> Artificial_intelligence
  -> Babar_the_Elephant                  -> Electronic_Transactions_on_Artificial_Intelligence
  -> Indian_Elephant                   -> Swarm_intelligence
                                       -> Cephalopod_intelligence
  -> African_elephant                  -> Dinosaur_intelligence
     -> African_Forest_Elephant        -> Cetacean_intelligence
     -> North_African_Elephant         -> Evolution_of_human_intelligence
     -> African_Bush_Elephant          -> Elephant_intelligence
                                       -> Dog_intelligence
  -> Execution_by_elephant             -> Pigeon_intelligence
  -> Borneo_pygmy_elephant             -> Primate_intelligence
  -> Horton_the_Elephant               -> Bird_intelligence
  -> Asian elephant
  -> Elmer_the_Patchwork_Elephant
Subtopic Taxonomy Issues
 -> Lion                          -> Lion (cont.)
   -> Congolese_Spotted_Lion         -> Sea_lion
   -> Asiatic_Lion                      -> Steller_sea_lion
   -> Masai_lion                        -> Australian_sea_lion
   -> Barbary_lion                      -> South_American_sea_lion
   -> Henry_the_Lion                    -> New_Zealand_sea_lion
   -> Sri_Lanka_lion                    -> California_sea_lion
   -> Nemean_lion                     -> American_lion
   -> Western_African_lion            -> White_lion
   -> Transvaal_Lion                     -> Kimba_the_White_Lion
   -> West_African_lion               -> Cowardly_Lion
   -> Tsavo_lion                      -> Tiger_versus_lion
   -> Southwest_African_Lion
   -> European_lion
   -> Cape Lion

WRT Nomenclature purity, Lion_Seal is a better name than Sea_Lion.
Clustering
   Two Fundamental Perspectives:
     Top-Down: Partitioning a set into disjoint subsets
     Bottom-Up: Grouping data points into disjoint clusters


   Goes hand-in-hand with classification

   Typically involves a metric: Euclidian or Manhattan distance

   Many, many different algorithms & books.

   Some really popular algorithms:
     K-Means Clustering (EM, PCA)
     Hierarchical Agglomerative Clustering
     K-Nearest Neighbor (classification)


   SR-Clustering: This is something I (re)invented.
       Effectively: The world’s simplest clustering algorithm.
K-Means Clustering (1)
   Given an initial set of cluster centroids, determine
    the actual centroids of each cluster via an
    iterative refinement algorithm.

   Each refinement iteration consists of two steps :
    1. Computing new data point centroid assignments
    2. Computing new centroid positions based of the
    mean deviation of the data points from the previous
    centroid positions.

   Converge, Divergence, Oscillation….

   Also known as Lloyd’s Algorithm in CS.
K-Means Clustering (2)
Wikipedia: Given a set of observations
(x1, x2, …, xn), where each observation is a d-
dimensional real vector, k-means clustering aims
to partition the n observations into k sets (k ≤ n) S
= {S1, S2, …, Sk} so as to minimize the within-
cluster sum of squares (WCSS):




where μi is the mean of points in Si
K-Means Clustering (3)
 Assignment    Step:


  Defines Si to be the set of xi that deviate least from Si


 Update   Step:




  Calculate the new means to be the centroid of the
  observations in the cluster.
  I.e. The average along each dimension
K-Means Clustering(4)
   K-Means is *really* a 3 step algorithm
     Step1. Initialize K-Means (non-trivial)
        Problem 1: Estimate K
        Problem 2: Pick Initial Centroid for each K
     Iterative Refinement
        Step 2: Centroid Assignments
        Step3: Centroid Update


   Many initialization approaches:
       Random, Forgy, MacQueen and Kaufman

   Performance depends on initialization and instance ordering
   Popular because of its robustness
   Related to:
       EM Algorithm and
       Principal Component Analysis (PCA)
Hierarchical Agglomerative Clustering
   The Algorithm
    1. Cluster each data point with its nearest neighbor(s)
    and make that a new data point (cluster).
    2. Repeat until some fixed number of clusters is reached.

   K-Nearest Neighbor is often used hand-in-hand with
    agglomerative clustering to compute the nearest
    neighbor(s).

   End up with a tree of clusters (clustering history)

   This tree is called a dendogram

   See Chapter 6 of Duda & Hart (SRI, 1973)
    Pattern Classification & Scene Analysis
SR-Clustering (1)
 Simple    Ray Clustering 
     Sort of like non-hierarchical agglomerative
      clustering
 Basic   Algorithm
     For each data point, place it in the correct cluster
     If it doesn’t belong to any cluster, create a new
      cluster consisting of that single data point
 Cluster   Membership
     Defined as being within a certain proximity
      threshold of every data point in that cluster.
 Proximity   Metric
     The Jaccard Index
Recommender Systems
   Used by Netflix, Amazon, etc…
   Objects: Users, Items & Preferences

   User vs. Item based recommendations
   Former aka collaborative filtering
   Mixed method recommendations
   Based on User Similarity and/or Item Similarity

   Jaccard Index takes into account dissimilarity and
    does not require preference measurements.

   Apache Mahout (leverages Hadoop)
Jaccard Index
 Defines   a Similarity Metric between two sets

 Wikipedia:  The Jaccard coefficient measures
 similarity between sample sets, and is defined
 as the size of the intersection divided by the size
 of the union of the sample sets:




 Jaccard
 Distance
Another Similarity Metric
 Pearson   Correlation Coefficient

 Wikipedia:
           Defined as the covariance of the
 two variables divided by the product of their
 standard deviations
(compute-similarity-matrix <topics>)
   Computes the Jaccard index for pairs of topics by
    using the related topics of each topic as the sets to
    be compared.


             African   Asian    Indian   Babar    Horton     War

African      100.00     38.46    21.05     4.35     6.82     7.94

Asian         38.46    100.00    37.74     4.00     6.25    20.00

Indian        21.05     37.74   100.00     6.90     7.14    24.39

Babar           4.35     4.00     6.90   100.00    28.57     7.14

Horton          6.82     6.25     7.14    28.57   100.00     7.41

War             7.94    20.00    24.39     7.14     7.41   100.00
(cluster-subtopics <subtopics> <matrix> <threshold>)

Cluster 1                         Cluster 4
Asian_elephant(49)                War_elephant(22)
African_elephant(60)              Execution_by_elephant(5)
                                  Crushing_by_elephant(4)
Cluster 2
Babar_the_Elephant(7)             Cluster 5
Horton_the_Elephant(5)            Year_of_the_Elephant(8)
Elmer_the_Patchwork_Elephant(4)
                                  Cluster 6
                                  Dwarf_elephant(24)
Cluster 3
Asian_elephant(49)                Cluster 7
Indian_Elephant(18)               White_elephant(10)
Sri_Lankan_elephant(12)
Sumatran_Elephant(11)
Borneo_pygmy_elephant(3)                        Threshold = 20
Knowledge Categories (1)
   Human schooling as a decade(s) long knowledge
    acquisition process

   Spanning Kindergarten – Post Doctoral work

   Idea is to use grade school topics as initial
    knowledge categories.

   Science, History, Geography, Literature & Art

   Goal: Assign categories to subtopic clusters

   Use Jaccard Index to determine the category

   Automatically create subtopic category names
    e.g. Babar  Literature_Elephant
(compute-cluster-categories <clusters>)

   Wiki Crawl each Knowledge Category (pre-run)
   Compute subtopics of each knowledge category

   Compute a category relevancy vector for each
    cluster member

   Combine the relevancy vectors of each cluster to
    compute a relevancy vector for the cluster

   Assign a category to the cluster
(compute-cluster-categories <clusters>)
(((( :SCIENCE 0.47666672) (:HISTORY 0.44666672))
  (#<Concept(49): Asian_elephant> #<Concept(60): African_elephant>))

((( :SCIENCE 0.39) (:GEOGRAPHY 0.37800002))
 (#<Concept(3): Borneo_pygmy_elephant> #<Concept(49): Asian_elephant>
  #<Concept(18): Indian_Elephant> #<Concept(12): Sri_Lankan_elephant>
  #<Concept(11): Sumatran_Elephant>))

((( :ART 0.33333334) (:GEOGRAPHY 0.30666667))
 (#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant>
  #<Concept(4): Elmer_the_Patchwork_Elephant>))

((( :HISTORY 0.6) (:GEOGRAPHY 0.46)
 (#<Concept(8): Year_of_the_Elephant>))

((( :GEOGRAPHY 0.72333336) ( :HISTORY 0.43666664))
 (#<Concept(5): Execution_by_elephant> #<Concept(22): War_elephant>
  #<Concept(4): Crushing_by_elephant>))

((( :GEOGRAPHY 0.86) ( :SCIENCE 0.5))
 (#<Concept(24): Dwarf_elephant>))

(( ( :SCIENCE 0.69) (:ART 0.49))
(#<Concept(10): White_elephant>)))
Individual Subtopic Categories
The following shows the knowledge category relevancies for some of the 16
subtopics of Elephant and helps understand the results of previous slide

(#<Concept(7): Babar_the_Elephant>
 (( :LITERATURE 0.44) ( :ART 0.25) (:GEOGRAPHY 0.23) (:HISTORY 0.2) ( :SCIENCE 0.17)))

(#<Concept(4): Elmer_the_Patchwork_Elephant>
 (( :ART 0.25) (:GEOGRAPHY 0.23) (:LITERATURE 0.22) (:HISTORY 0.2) ( :SCIENCE 0.17)))

(#<Concept(5): Horton_the_Elephant>
 (( :ART 0.5) (:GEOGRAPHY 0.46) (:HISTORY 0.4) (:SCIENCE 0.35) ( :LITERATURE 0.22)))

(#<Concept(60): African_elephant>
((:ART 1.03) ( :SCIENCE 0.91) ( :HISTORY 0.85) (:GEOGRAPHY 0.77) ( :LITERATURE 0.37)))

(#<Concept(49): Asian_elephant>
 (( :HISTORY 0.7) ( :SCIENCE 0.62) (:GEOGRAPHY 0.59) (:ART 0.42) ( :LITERATURE 0.19)))

(#<Concept(22): War_elephant>
 (( :HISTORY 0.93) ( :GEOGRAPHY 0.85) (:LITERATURE 0.41) (:ART 0.23) (:SCIENCE 0.16)))
Categorized Subtopic Clusters
   Elephant
       ART_Elephant
          Elmer_the_Patchwork_Elephant
          Horton_the_Elephant
          Babar_the_Elephant
       GEOGRAPHY_Elephant
          Dwarf_elephant
          Crushing_by_elephant
          War_elephant
          Execution_by_elephant
       HISTORY_Elephant
          Year_of_the_Elephant
       SCIENCE_Elephant
          African_elephant
              African_Forest_Elephant
               African_Bush_Elephant
          Asian_elephant
          White_elephant
          Sumatran_Elephant
          Sri_Lankan_elephant
          Indian_Elephant
          Borneo_pygmy_elephany
Related Topics Associations
   Associate related topics to subtopic clusters using Jaccard Index
   Use associations to create related topic clusters

(find-compatible-clusters <strongly-related-topics> <clusters>)

(( #<Concept(60): African elephant>                ((#<Concept(24): Dwarf elephant>)
  #<Concept(49): Asian_elephant>)
                                                   (#<Concept(66): Mammoth>
(#<Concept(10): Elephant intelligence>              #<Concept(25): Mastodon>
                                                    #<Concept(275): Genus>
 #<Concept(103): Animal cognition>
                                                    #<Concept(62): Afrotheria>
 #<Concept(4): Elephant tusk>                       #<Concept(86): Gestation>
 #<Concept(15): African>                            #<Concept(749): Eutheria>
 #<Concept(102): Proboscidea>                       #<Concept(8): Gomphotherium>
 #<Concept(96): Mammalia>                           #<Concept(27): Tooth>
 #<Concept(876): Mammal>                            #<Concept(8): Tooth_development>))
 #<Concept(143): Hippopotamus>
 #<Concept(590): Lion>
 #<Concept(10): Loxodonta>)
                                                   ((#<Concept(7): Babar_the_Elephant>
                                                     #<Concept(5): Horton_the_Elephant>
                                                     #<Concept(4): Elmer_the_Patchwork_Elephant>)
((#<Concept22): War_elephant>
  #<Concept(5): Execution_by_elephant>              #<Concept(6): List_of_fictional_elephants>
  #<Concept(4): Crushing_by_elephant>)              #<Concept(5): List_of_elephants_in_mythology_and_religion>
                                                    #<Concept(5): Pinnawala>
(#<Concept(55): Ivory>                              #<Concept(3): Katy_Payne>
 #<Concept(77): Kenya>                              #<Concept(11): Infrasound>
 #<Concept(31): Grief>                              #<Concept(56): Incisor>
 #<Concept(8): History_of_elephants_in_Europe>))    #<Concept(14): Jeheskel_Shoshani>
                                                    #<Concept(6): Aanayoottu> )
(sentence-to-clause <sentence>)
  english sentence string
Scanner
  Tokens
Analyzer
  morphologically analyzed words
Parser
  parse-tree
Phrase extractor
  phrases (flattened parse tree)
Semantic analyzer
  Frame for subject, verb , object and prep. phrases
Clauses Generator
  Clause Objects
Sample Extracted Clauses
   (HASA "Asian elephant species" "disjunct distributions")

   (ISA "Elephants" "herbivores")

   (HASA "African Elephants" “three nails")

   (HASA "Indian Elephants" "four nails")

   (HASA "female African Elephants" "large tusks")

   (ISA "Elephants" "large land mammals")
Things Overlooked
 Wiki   Page Contents Pane
     Provides page taxonomy
     Provides category names
     Provides related topic names

 Concept    Weights
Future Direction
 Enhance  English Parser
 Incorporate Variables into Semantic Net
 Leverage topic weights
 Work on language generation
 Produce Wiki Summary Pages
 Knowledge Queries
 Develop Client Side Browser
    Top Menu Bar Knowledge Categories
    RHS Dynamic Subtopic Tree
    LHS Wiki Page Content Pane
(references)
 Language and          Speech Processing
  Jurafsky and Martin
 Artificial Intelligence:   A Modern Approach
  Russell & Norvig
 Principles of Semantic Networks
  Edited by John E. Sowa, Morgan Kaufman, 1991
 Machine Learning
  Tom Mitchell, 1997
 Pattern Classification     and Scene Analysis
  Duda and Hart, 1973
 Algorithms   of the Intelligent Web
  Marmanis and Babenko, 2009
(cluster-images)
(Love Elephants LispNYC)
Knowledge Extraction

More Related Content

What's hot (20)

PDF
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
ODP
Topic Modeling
Karol Grzegorczyk
 
PDF
T9. Trust and reputation in multi-agent systems
EASSS 2012
 
PPTX
Treebank annotation
Mohit Jasapara
 
PDF
Introduction to natural language processing
Minh Pham
 
PPTX
Introduction to Soft Computing
Aakash Kumar
 
PPTX
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Sehrish Asif
 
PPT
Type Checking(Compiler Design) #ShareThisIfYouLike
United International University
 
PPTX
Natural language processing
Yogendra Tamang
 
PDF
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
PPTX
FUNCTION APPROXIMATION
ankita pandey
 
PPTX
Introduction to Natural Language Processing
Mercy Rani
 
PDF
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
PPTX
NLP
guestff64339
 
PPTX
Natural language processing
Abash shah
 
PPTX
Inference in First-Order Logic
Junya Tanaka
 
PPTX
Knowledge representation
Md. Tanvir Masud
 
PDF
Daa notes 3
smruti sarangi
 
PPT
Vanishing & Exploding Gradients
Siddharth Vij
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
Topic Modeling
Karol Grzegorczyk
 
T9. Trust and reputation in multi-agent systems
EASSS 2012
 
Treebank annotation
Mohit Jasapara
 
Introduction to natural language processing
Minh Pham
 
Introduction to Soft Computing
Aakash Kumar
 
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Sehrish Asif
 
Type Checking(Compiler Design) #ShareThisIfYouLike
United International University
 
Natural language processing
Yogendra Tamang
 
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
FUNCTION APPROXIMATION
ankita pandey
 
Introduction to Natural Language Processing
Mercy Rani
 
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Natural language processing
Abash shah
 
Inference in First-Order Logic
Junya Tanaka
 
Knowledge representation
Md. Tanvir Masud
 
Daa notes 3
smruti sarangi
 
Vanishing & Exploding Gradients
Siddharth Vij
 

Viewers also liked (14)

PPT
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Health Informatics New Zealand
 
PPTX
Turing
Rupak Chakraborty
 
PPTX
KNOWLEDGE: REPRESENTATION AND MANIPULATION
Maria Angela Leabres-Diopol
 
PPS
Lecture 4 Meta Knowledge
Simon Shurville
 
DOCX
7. knowledge acquisition, representation and organization 8. semantic network...
AhL'Dn Daliva
 
PPT
Representation of knowledge
Veera Balaji kumar veeraswamy
 
PPT
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...
piero scaruffi
 
PPTX
Turing Test
Rogério Nascimento
 
PPTX
Knowledge Extraction from Social Media
Seth Grimes
 
PPTX
Turing test
Dipesh Senseii
 
PPS
Artificial Intelligence
sanjay_asati
 
PPTX
Knowledge representation and Predicate logic
Amey Kerkar
 
PPT
Knowledge Representation in Artificial intelligence
Yasir Khan
 
PPTX
Knowledge representation in AI
Vishal Singh
 
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Health Informatics New Zealand
 
KNOWLEDGE: REPRESENTATION AND MANIPULATION
Maria Angela Leabres-Diopol
 
Lecture 4 Meta Knowledge
Simon Shurville
 
7. knowledge acquisition, representation and organization 8. semantic network...
AhL'Dn Daliva
 
Representation of knowledge
Veera Balaji kumar veeraswamy
 
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...
piero scaruffi
 
Turing Test
Rogério Nascimento
 
Knowledge Extraction from Social Media
Seth Grimes
 
Turing test
Dipesh Senseii
 
Artificial Intelligence
sanjay_asati
 
Knowledge representation and Predicate logic
Amey Kerkar
 
Knowledge Representation in Artificial intelligence
Yasir Khan
 
Knowledge representation in AI
Vishal Singh
 
Ad

Similar to Knowledge Extraction (20)

PDF
Constructive Description Logics 2006
Valeria de Paiva
 
PDF
A Bridge Not too Far
Valeria de Paiva
 
PDF
Logics of Context and Modal Type Theories
Valeria de Paiva
 
PDF
Meaning Extraction - IJCTE 2(1)
IT Industry
 
PPT
KNOWLEDGE Representation unit 3 for data mining
RGAYATHRI25
 
PDF
Constructive Hybrid Logics
Valeria de Paiva
 
PPT
Chapter 12 knowledge representation nd description
AfraseyabKhan1
 
PDF
Lean Logic for Lean Times: Entailment and Contradiction Revisited
Valeria de Paiva
 
PDF
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML
 
PPTX
AI material for you computer science.pptx
kerimu1235
 
PPTX
gdhfjdhjcbdjhvjhdshbajhbvdjbklcbdsjhbvjhsdbvjjv
samjohnson7350
 
PDF
From Linked Data to Semantic Applications
Andre Freitas
 
PPTX
Building AI Applications using Knowledge Graphs
Andre Freitas
 
PDF
Grosof haley-talk-semtech2013-ver6-10-13
Brian Ulicny
 
PDF
Effective Semantics for Engineering NLP Systems
Andre Freitas
 
PPTX
Unit II Natural Language Processing.pptx
sriramrpselvam
 
PPTX
Foundations of Knowledge Representation in Artificial Intelligence.pptx
kitsenthilkumarcse
 
PPTX
Knowledge Representation and Reasoning.pptx
MohanKumarP34
 
PPTX
chapter2 Know.representation.pptx
wendifrawtadesse1
 
PPT
Semantics
Mohammed Al-Meqdad
 
Constructive Description Logics 2006
Valeria de Paiva
 
A Bridge Not too Far
Valeria de Paiva
 
Logics of Context and Modal Type Theories
Valeria de Paiva
 
Meaning Extraction - IJCTE 2(1)
IT Industry
 
KNOWLEDGE Representation unit 3 for data mining
RGAYATHRI25
 
Constructive Hybrid Logics
Valeria de Paiva
 
Chapter 12 knowledge representation nd description
AfraseyabKhan1
 
Lean Logic for Lean Times: Entailment and Contradiction Revisited
Valeria de Paiva
 
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML
 
AI material for you computer science.pptx
kerimu1235
 
gdhfjdhjcbdjhvjhdshbajhbvdjbklcbdsjhbvjhsdbvjjv
samjohnson7350
 
From Linked Data to Semantic Applications
Andre Freitas
 
Building AI Applications using Knowledge Graphs
Andre Freitas
 
Grosof haley-talk-semtech2013-ver6-10-13
Brian Ulicny
 
Effective Semantics for Engineering NLP Systems
Andre Freitas
 
Unit II Natural Language Processing.pptx
sriramrpselvam
 
Foundations of Knowledge Representation in Artificial Intelligence.pptx
kitsenthilkumarcse
 
Knowledge Representation and Reasoning.pptx
MohanKumarP34
 
chapter2 Know.representation.pptx
wendifrawtadesse1
 
Ad

More from Pierre de Lacaze (7)

PDF
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
PPTX
Deep Learning
Pierre de Lacaze
 
PPTX
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze
 
PDF
Logic Programming and ILP
Pierre de Lacaze
 
PPTX
Meta Object Protocols
Pierre de Lacaze
 
PPTX
Prolog 7-Languages
Pierre de Lacaze
 
PPTX
Clojure 7-Languages
Pierre de Lacaze
 
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Deep Learning
Pierre de Lacaze
 
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze
 
Logic Programming and ILP
Pierre de Lacaze
 
Meta Object Protocols
Pierre de Lacaze
 
Prolog 7-Languages
Pierre de Lacaze
 
Clojure 7-Languages
Pierre de Lacaze
 

Recently uploaded (20)

PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Complete Network Protection with Real-Time Security
L4RGINDIA
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Complete Network Protection with Real-Time Security
L4RGINDIA
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 

Knowledge Extraction

  • 1. (Knowledge Extraction) Raymond Pierre de Lacaze (RPL) LispNYC July 10th, 2012 [email protected]
  • 2. (John McCarthy) September 4th,1927 – October 24th, 2011 This talk is dedicated to the memory of John McCarthy  Inventor of the Lisp Language (1958)  Founder of Artificial Intelligence  Winner of the Turing award (1971)  Designer of Elephant 2000  Programming Language based on speech acts  http://www-formal.stanford.edu/jmc/elephant/elephant.html  May He Rest in Peace
  • 3. BABAR: Project Goals  Leverage Wikipedia as a Knowledge Base  Infer Infrastructure & Extract Content  Create Wiki Topic Taxonomies  Generate Knowledge Hypergraphs  Investigate Conceptual Relevance Metrics  Generate Knowledge summaries  Answer Knowledge base queries  Evolve a new generation of web browsers: Knowledge Browsers
  • 4. Overview  Brief Overview AI  Knowledge Representation  Natural Language Processing  Examine Specific Algorithms  Semantic Nets & Hypergraphs  Recursive Descent Parsing  Clustering Algorithms  Similarity Metrics  Describe Aspects of the BABAR System  Semantic Link Analysis  Automatic Topic Taxonomy Generation  Knowledge Category Assignment  Content Extraction  English Phrase to Clausal Form Logic
  • 5. AI Technologies Discussed  Knowledge Representation  Clausal Form Logic  Semantic Nets  Hypergraphs  Natural Language Processing  Lexical Analysis  Syntactic Analysis  Recursive Descent Parsing  Semantic Analysis  Machine Learning Techniques  Clustering Algorithms  K-Means, Agglomerative and SR Clustering  Similarity Metrics  Jaccard Index  Pearson Correlation
  • 6. Logics used in Artificial Intelligence  Monotonic Logic (standard)  Non-Monotonic Logic (exceptions)  (1) Birds can fly, (2) Penguins are birds, (3) Penguins can't fly  Sorted Logics (types)  Fuzzy Logic (continuous truth values)  Higher-Order Logics (meta-statements)  Modal Logics (may, can, must)  Intentional Logics (know, believe, think)  Temporal Logics (temporal operators)  Point-Based Temporal Logic (moments)  Interval Time Logic (Allen 1986, 13 temporal operators)  Before, Meets, Starts, Finishes, Overlaps, Contains, their inverses, and Equals.  Logics can be expressed in clausal form: (ancestor ?x ?y)  (parent ?x ?y) (ancestor ?x ?y)  (parent ?x ?z)(ancestor ?z ?y) Note: The variables ?x and ?y are universally quantified, whereas the variable ?z is existentially quantified.
  • 7. Clausal Form Logic  Propositional Calculus (PC)  Fully grounded clauses  No variables  (Brother John Jill),  (Parent Jane Jill)  (Mother Jane Jill)  First Order Predicate Calculus (FOPC)  Variables  Universally qualified (for all ?x)  Existentially qualified (there exists ?x)  (Elephant ?x)  (Has-Tusks ?x)  Converting 1st order logic to FOPC  Skolem constants (there exists x for all y such that…)  Skolem functions (for each x there exists a y such that…)  Second Order Predicate Calculus  Predicates and clauses can be arguments  Meta statements  Gödel's Incompleteness Theorem  Horn Clauses  Wikipedia: In computational logic, a Horn clause is a clause with at most one positive literal  B  (A1 ^ …. ^ An) ≡ ¬A1 v … v ¬A2 v B  (<LHS> <RHS>) ≡ ((B) (A1…An))
  • 8. Automated Reasoning  Unification Algorithm  Clausal pattern matching and variable binding  (unify (P ?x ?y) (P A (Q ?x)))  Returns bindings: ((?x A) (?y (Q ?x))  Instantiation: (P A (Q A))  Rete Algorithm  Charles L. Forgy, CMU, 1974  Addresses the many-many matching problem  Matching facts to rules in rule-based systems  Donald Knuth , Volume 3.  Automated Reasoners  Backward Chaining Reasoners  Work from conclusion  axioms (facts)  Good when state space branching factor is large  Forward Chaining Reasoners  Work from axioms  conclusion  Good when the depth state space is large  Mixed methods Perform both forward & backward chaining  GPS (Ernst & Newell, 1969)  Island hopping
  • 9. Semantic Nets  Labeled, directed (or not) and weighted (or not) Graphs  Equivalent in expressiveness to FOPC  Graphical representation of 1st order logic.  ISA Hierarchies  Subsumption (Bill Woods)  KL-ONE System: R.J. Brachman and J. Schmolze (1985)  A whole family of KL-ONE like systems  Concepts  Distinguish Primitive and Defined concepts  Only defined concepts are classifiable  Frames  Marvin Minsky , "A Framework for Representing Knowledge.“, 1974  OO Languages (CLOS) ≡ Frame Language  Think of class of definitions as frames, where slots are attribute-value pairs and you use pattern matching to fill in all the slots at which point a concept becomes defined and classifiable.
  • 10. HyperGraphs  A hypergraph is graph in which edges are first class objects and can be linked to other edges or vertices.  Hypergraphs are a natural and convenient way of representing sentences and meta-statements. Married Jane Jim Disapproves Loves Likes Mom Resents John  Mom resents the fact that John disapproves of Jane and Jim’s Marriage.  BABAR uses an in memory HyperGraph  Semantic Net
  • 11. Natural Language Processing  Lexical Analysis  Understanding the role and morphological nature of words.  Morphology, Orthography, Part of Speech Tagging  Typically use Lexicons: Dictionaries, etc…  Programs that do this are called Scanners or Lexical Analyzers  ScanGen and LEX on Unix systems for Programming Languages  Syntactic Analysis  Understanding the grammatical nature of groups of words  Programs that do this are called Parsers.  They take tokens produced by scanners/analyzers and apply them to a grammar.  In doing so they typically produce parse trees.  NLP parsing methodologies include:  Top-Down Parsers(recursive descent)  Bottom-Up Parsers  ParseGen and YACC on Unix systems for Programming Languages  Semantic Analysis  Extracting phrase structure from parse trees and producing statements in some knowledge representation language such as clausal-form logic.  KRL: "An Overview of KRL, a Knowledge Representation Language", D.G. Bobrow and T. Winograd, (1977).
  • 12. Lexical Analysis  Morphology  The rules that govern word morphing  foxes ≡ fox+<plural>  Orthography  The rules that govern spelling  Plural of fox ≡ fox+’es’  Transducers  Define languages consisting of pairs of strings  Loosely: Finite Automaton with 2 state transition functions.  Formally: Q (states), Σ (i-alph), Δ (o-alph), q0 (start), F (final), δ(q, w) and σ(q, w).  FST: Finite State Transducer  Surface level, Intermediate level, Lexical level  E.g. foxes  fox+es  fox+N+PL  Parsing, Generating & Translating  Morphological Parser  Lexicons, Morphotactics and Orthographic Rules  Penn Treebank Parts of Speech Tags (50)  Probabilistic Approaches  N-Gram model  Counting word frequency  See Chapter 4 of Jurafsky & Martin, Speech & Language Processing, 2009  Google Translate
  • 13. Lexical Analysis in BABAR  Lexicons  Regular words Lexicon  http://www.merriam-webster.com/  Query the site and extract parts of speech  About 50,000 locally cached entries.  Irregular Words Lexicons  Irregular nouns  Irregular verbs  Irregular auxiliaries  Orthographic Rules  reverse engineer morphed words  (analyze-morphed-word <word>)  Analyzes word suffixes then queries MW.
  • 14. Lexical Analysis Example KB(5): (parser::analyze-morphed-word "traditionally“ ) Loading #P"C:ProjectstrunkDataLexiconsParts-of-Speech.lisp" Loading table from file English-Irregular-Nouns ... Loading table from file English-Irregular-Verbs ... Loading table from file English-Irregular-Auxiliary ... Initializing reverse lexicon table... URL: "http://www.merriam-webster.com/dictionary/tradition"  Returns five values: Base Form: "tradition" Actual Form: "traditionally" Primary POS: :ADVERB Additional NIL Complete POS (:ADVERB)  Reverse Engineering: traditionally (adverb)  traditional (adjective)  tradition (noun)  Parts-of-Speech Lexicon currently has about 50,000 entries.  Appriximately one million words in the English language
  • 15. Syntactic Analysis  Grammars  Productions (grammatical rules)  LHS: A non-terminal symbol  RHS: A disjunction of conjunctions of TS & NTS  Can be recursive  Non-Terminal Symbols  Terminal Symbols (lexicon entries)  Start Symbol  Implicitly Define an AND-OR Tree.  Context-Free Grammars, Attribute Grammars  Parsers  Traverse a grammar while consuming input tokens in an attempt to find a valid path through the grammar that accommodates the input tokens.  Produce parse trees in which the internal nodes are Non-Terminal Symbols (NTS) and the leaves are Terminal Symbols (TS)  Three typical ways to handle non-determinism  Backtracking  Look-ahead  Parallelism
  • 16. Parsing in BABAR  Implements a Recursive Descent Parser which performs a top-down traversal of the grammar.  Uses backtracking to handle non-determinism  3 Types of objects: tokens, grammars and parse-nodes  Scanner  Creates of seven fundamental token classes based on character composition  alphabetic, numeric, special, alpha-numeric, alpha-special, numeric-special and alpha-numeric-special  Implemented using multiple-inheritance:  alphabetic-mixin, numeric-mixin and special-mixin classes  Parser Module (Scanner, Analyzer, Parser)  Implements a set of classes and generic functions geared towards being easily able to develop particular domain–specific parsers.
  • 17. Level 1 (simple) Class grammar Macro (define-grammar <name><prods><preds> &key <class>) GF (scan-tokens <string> <grammar>&key <delimiter>) GF (parse-tokens <tokens> <grammar>) Level 2 (context) Class context-grammar Macro (define-context-grammar <name> <prods> <preds> <context>) Macro (with-grammar-context (<context><grammar>) &body <body>) GF (analyze-tokens <tokens> <grammar>) Level 3 (domain) Macro (define-lexicon <name> <fields>) Macro (define-word-class <word-type> &optional <slots>) Level 4 (english) Adds english-grammar, scan-tokens, analyze-word-morphology
  • 18. Crawling Wikipedia  Wikipedia has approximately 4 million pages. (initialize-wiki-graph <topic><depth>)  Returns a graph object (crawl-wiki-topic <topic> <depth>)  Returns a Hash-Table of related-topics  For topic=elephant and depth=  #<EQUALP hash-table with 2580 entries> (generate-wiki-graph <hash-table>)  Only create a vertex for keyss (pruning)  Non-key related topics are ignored (pruning)  Create a ‘related-to edge for every (<key> <related-topic>) pair.  Without pruning: #<Graph Elephant-3: 154833 vertices, 553604 edges>  With Pruning: #<Graph Elephant-3: 2580 vertices, 182562 edges> (2.7%)  With Pruning: #<Graph Elephant-4: 25577 vertices, 2355810 edges> (0.3%)  A complete graph of n vertices has n(n-1)/2 edges ≡ O(n2)
  • 19. Link Name Organization  Internal, External and Intranal hyperlinks  I chose the Elephant page as my entry page for crawling  There are 228 internal links from the Elephant page.  These occur throughout 103 paragraphs of text  Goal: Organize the 228 links into a meaningful taxonomy Asian_Elephant Elephant African_Bush_Elephant African_Elephant African_Forest_Elephant  Apply NLP to link names: i.e. parse the link names.  Partition link names into subtopic, supertopic and related.  Subtopic candidate elimination  Partition related topics into strongly and weakly related based on link bi-directionality
  • 20. Subtopic Taxonomy Generation Algorithm (generate-subtopic-relations-in-graph <graph>) 1. Produce Candidates: a list of pairs of concepts. Each pair of concepts is such that the first concept is a generalization of the second concept. This is determined by noting concepts that when parsed produce a set of tokens that is subset of the set tokens produced by parsing the second concept. 2. Eliminate False-Poisitives: These are eliminated by ensuring that the subjects of the phrases of each set of parsed tokens are identical.  E.g. Elephant_Hotel is not a subtopic of Elephant whereas Hotel_Elephant would a be subtopic of Elephant. This is one place where NLP really adds value. 3. Replace ‘related-to relations with ‘generalizes relations. 4. Eliminate direct ‘generalizes relationships between children and non-parent ancestors.  E.g. Elephant and North_African_Elephant. 5. Eliminate Singletons: Prune the list of sub trees by eliminating singleton sub trees thus leaving them in a state of yet to be classified  Finally return a forest of trees, i.e. a list of root nodes.
  • 21. Subtopic Taxonomies  Organize 2,580 topics into a forest of 131 trees consisting of 1,594 nodes (62%) and 986 yet to be classified nodes. Elephant Tree Elephant_Seal Tree -> Elephant -> Elephant_seal -> Dwarf_elephant -> Southern_elephant_seal -> Northern_elephant_seal -> Sri_Lankan_elephant -> Year_of_the_Elephant -> Sumatran_Elephant Intelligence Tree -> White_elephant -> Intelligence -> Fish_intelligence -> War_elephant -> Cat_intelligence -> Crushing_by_elephant -> Artificial_intelligence -> Babar_the_Elephant -> Electronic_Transactions_on_Artificial_Intelligence -> Indian_Elephant -> Swarm_intelligence -> Cephalopod_intelligence -> African_elephant -> Dinosaur_intelligence -> African_Forest_Elephant -> Cetacean_intelligence -> North_African_Elephant -> Evolution_of_human_intelligence -> African_Bush_Elephant -> Elephant_intelligence -> Dog_intelligence -> Execution_by_elephant -> Pigeon_intelligence -> Borneo_pygmy_elephant -> Primate_intelligence -> Horton_the_Elephant -> Bird_intelligence -> Asian elephant -> Elmer_the_Patchwork_Elephant
  • 22. Subtopic Taxonomy Issues -> Lion -> Lion (cont.) -> Congolese_Spotted_Lion -> Sea_lion -> Asiatic_Lion -> Steller_sea_lion -> Masai_lion -> Australian_sea_lion -> Barbary_lion -> South_American_sea_lion -> Henry_the_Lion -> New_Zealand_sea_lion -> Sri_Lanka_lion -> California_sea_lion -> Nemean_lion -> American_lion -> Western_African_lion -> White_lion -> Transvaal_Lion -> Kimba_the_White_Lion -> West_African_lion -> Cowardly_Lion -> Tsavo_lion -> Tiger_versus_lion -> Southwest_African_Lion -> European_lion -> Cape Lion WRT Nomenclature purity, Lion_Seal is a better name than Sea_Lion.
  • 23. Clustering  Two Fundamental Perspectives:  Top-Down: Partitioning a set into disjoint subsets  Bottom-Up: Grouping data points into disjoint clusters  Goes hand-in-hand with classification  Typically involves a metric: Euclidian or Manhattan distance  Many, many different algorithms & books.  Some really popular algorithms:  K-Means Clustering (EM, PCA)  Hierarchical Agglomerative Clustering  K-Nearest Neighbor (classification)  SR-Clustering: This is something I (re)invented.  Effectively: The world’s simplest clustering algorithm.
  • 24. K-Means Clustering (1)  Given an initial set of cluster centroids, determine the actual centroids of each cluster via an iterative refinement algorithm.  Each refinement iteration consists of two steps : 1. Computing new data point centroid assignments 2. Computing new centroid positions based of the mean deviation of the data points from the previous centroid positions.  Converge, Divergence, Oscillation….  Also known as Lloyd’s Algorithm in CS.
  • 25. K-Means Clustering (2) Wikipedia: Given a set of observations (x1, x2, …, xn), where each observation is a d- dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within- cluster sum of squares (WCSS): where μi is the mean of points in Si
  • 26. K-Means Clustering (3)  Assignment Step: Defines Si to be the set of xi that deviate least from Si  Update Step: Calculate the new means to be the centroid of the observations in the cluster. I.e. The average along each dimension
  • 27. K-Means Clustering(4)  K-Means is *really* a 3 step algorithm  Step1. Initialize K-Means (non-trivial)  Problem 1: Estimate K  Problem 2: Pick Initial Centroid for each K  Iterative Refinement  Step 2: Centroid Assignments  Step3: Centroid Update  Many initialization approaches:  Random, Forgy, MacQueen and Kaufman  Performance depends on initialization and instance ordering  Popular because of its robustness  Related to:  EM Algorithm and  Principal Component Analysis (PCA)
  • 28. Hierarchical Agglomerative Clustering  The Algorithm 1. Cluster each data point with its nearest neighbor(s) and make that a new data point (cluster). 2. Repeat until some fixed number of clusters is reached.  K-Nearest Neighbor is often used hand-in-hand with agglomerative clustering to compute the nearest neighbor(s).  End up with a tree of clusters (clustering history)  This tree is called a dendogram  See Chapter 6 of Duda & Hart (SRI, 1973) Pattern Classification & Scene Analysis
  • 29. SR-Clustering (1)  Simple Ray Clustering   Sort of like non-hierarchical agglomerative clustering  Basic Algorithm  For each data point, place it in the correct cluster  If it doesn’t belong to any cluster, create a new cluster consisting of that single data point  Cluster Membership  Defined as being within a certain proximity threshold of every data point in that cluster.  Proximity Metric  The Jaccard Index
  • 30. Recommender Systems  Used by Netflix, Amazon, etc…  Objects: Users, Items & Preferences  User vs. Item based recommendations  Former aka collaborative filtering  Mixed method recommendations  Based on User Similarity and/or Item Similarity  Jaccard Index takes into account dissimilarity and does not require preference measurements.  Apache Mahout (leverages Hadoop)
  • 31. Jaccard Index  Defines a Similarity Metric between two sets  Wikipedia: The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:  Jaccard Distance
  • 32. Another Similarity Metric  Pearson Correlation Coefficient  Wikipedia: Defined as the covariance of the two variables divided by the product of their standard deviations
  • 33. (compute-similarity-matrix <topics>)  Computes the Jaccard index for pairs of topics by using the related topics of each topic as the sets to be compared. African Asian Indian Babar Horton War African 100.00 38.46 21.05 4.35 6.82 7.94 Asian 38.46 100.00 37.74 4.00 6.25 20.00 Indian 21.05 37.74 100.00 6.90 7.14 24.39 Babar 4.35 4.00 6.90 100.00 28.57 7.14 Horton 6.82 6.25 7.14 28.57 100.00 7.41 War 7.94 20.00 24.39 7.14 7.41 100.00
  • 34. (cluster-subtopics <subtopics> <matrix> <threshold>) Cluster 1 Cluster 4 Asian_elephant(49) War_elephant(22) African_elephant(60) Execution_by_elephant(5) Crushing_by_elephant(4) Cluster 2 Babar_the_Elephant(7) Cluster 5 Horton_the_Elephant(5) Year_of_the_Elephant(8) Elmer_the_Patchwork_Elephant(4) Cluster 6 Dwarf_elephant(24) Cluster 3 Asian_elephant(49) Cluster 7 Indian_Elephant(18) White_elephant(10) Sri_Lankan_elephant(12) Sumatran_Elephant(11) Borneo_pygmy_elephant(3) Threshold = 20
  • 35. Knowledge Categories (1)  Human schooling as a decade(s) long knowledge acquisition process  Spanning Kindergarten – Post Doctoral work  Idea is to use grade school topics as initial knowledge categories.  Science, History, Geography, Literature & Art  Goal: Assign categories to subtopic clusters  Use Jaccard Index to determine the category  Automatically create subtopic category names e.g. Babar  Literature_Elephant
  • 36. (compute-cluster-categories <clusters>)  Wiki Crawl each Knowledge Category (pre-run)  Compute subtopics of each knowledge category  Compute a category relevancy vector for each cluster member  Combine the relevancy vectors of each cluster to compute a relevancy vector for the cluster  Assign a category to the cluster
  • 37. (compute-cluster-categories <clusters>) (((( :SCIENCE 0.47666672) (:HISTORY 0.44666672)) (#<Concept(49): Asian_elephant> #<Concept(60): African_elephant>)) ((( :SCIENCE 0.39) (:GEOGRAPHY 0.37800002)) (#<Concept(3): Borneo_pygmy_elephant> #<Concept(49): Asian_elephant> #<Concept(18): Indian_Elephant> #<Concept(12): Sri_Lankan_elephant> #<Concept(11): Sumatran_Elephant>)) ((( :ART 0.33333334) (:GEOGRAPHY 0.30666667)) (#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant> #<Concept(4): Elmer_the_Patchwork_Elephant>)) ((( :HISTORY 0.6) (:GEOGRAPHY 0.46) (#<Concept(8): Year_of_the_Elephant>)) ((( :GEOGRAPHY 0.72333336) ( :HISTORY 0.43666664)) (#<Concept(5): Execution_by_elephant> #<Concept(22): War_elephant> #<Concept(4): Crushing_by_elephant>)) ((( :GEOGRAPHY 0.86) ( :SCIENCE 0.5)) (#<Concept(24): Dwarf_elephant>)) (( ( :SCIENCE 0.69) (:ART 0.49)) (#<Concept(10): White_elephant>)))
  • 38. Individual Subtopic Categories The following shows the knowledge category relevancies for some of the 16 subtopics of Elephant and helps understand the results of previous slide (#<Concept(7): Babar_the_Elephant> (( :LITERATURE 0.44) ( :ART 0.25) (:GEOGRAPHY 0.23) (:HISTORY 0.2) ( :SCIENCE 0.17))) (#<Concept(4): Elmer_the_Patchwork_Elephant> (( :ART 0.25) (:GEOGRAPHY 0.23) (:LITERATURE 0.22) (:HISTORY 0.2) ( :SCIENCE 0.17))) (#<Concept(5): Horton_the_Elephant> (( :ART 0.5) (:GEOGRAPHY 0.46) (:HISTORY 0.4) (:SCIENCE 0.35) ( :LITERATURE 0.22))) (#<Concept(60): African_elephant> ((:ART 1.03) ( :SCIENCE 0.91) ( :HISTORY 0.85) (:GEOGRAPHY 0.77) ( :LITERATURE 0.37))) (#<Concept(49): Asian_elephant> (( :HISTORY 0.7) ( :SCIENCE 0.62) (:GEOGRAPHY 0.59) (:ART 0.42) ( :LITERATURE 0.19))) (#<Concept(22): War_elephant> (( :HISTORY 0.93) ( :GEOGRAPHY 0.85) (:LITERATURE 0.41) (:ART 0.23) (:SCIENCE 0.16)))
  • 39. Categorized Subtopic Clusters  Elephant  ART_Elephant Elmer_the_Patchwork_Elephant Horton_the_Elephant Babar_the_Elephant  GEOGRAPHY_Elephant Dwarf_elephant Crushing_by_elephant War_elephant Execution_by_elephant  HISTORY_Elephant Year_of_the_Elephant  SCIENCE_Elephant African_elephant African_Forest_Elephant African_Bush_Elephant Asian_elephant White_elephant Sumatran_Elephant Sri_Lankan_elephant Indian_Elephant Borneo_pygmy_elephany
  • 40. Related Topics Associations  Associate related topics to subtopic clusters using Jaccard Index  Use associations to create related topic clusters (find-compatible-clusters <strongly-related-topics> <clusters>) (( #<Concept(60): African elephant> ((#<Concept(24): Dwarf elephant>) #<Concept(49): Asian_elephant>) (#<Concept(66): Mammoth> (#<Concept(10): Elephant intelligence> #<Concept(25): Mastodon> #<Concept(275): Genus> #<Concept(103): Animal cognition> #<Concept(62): Afrotheria> #<Concept(4): Elephant tusk> #<Concept(86): Gestation> #<Concept(15): African> #<Concept(749): Eutheria> #<Concept(102): Proboscidea> #<Concept(8): Gomphotherium> #<Concept(96): Mammalia> #<Concept(27): Tooth> #<Concept(876): Mammal> #<Concept(8): Tooth_development>)) #<Concept(143): Hippopotamus> #<Concept(590): Lion> #<Concept(10): Loxodonta>) ((#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant> #<Concept(4): Elmer_the_Patchwork_Elephant>) ((#<Concept22): War_elephant> #<Concept(5): Execution_by_elephant> #<Concept(6): List_of_fictional_elephants> #<Concept(4): Crushing_by_elephant>) #<Concept(5): List_of_elephants_in_mythology_and_religion> #<Concept(5): Pinnawala> (#<Concept(55): Ivory> #<Concept(3): Katy_Payne> #<Concept(77): Kenya> #<Concept(11): Infrasound> #<Concept(31): Grief> #<Concept(56): Incisor> #<Concept(8): History_of_elephants_in_Europe>)) #<Concept(14): Jeheskel_Shoshani> #<Concept(6): Aanayoottu> )
  • 41. (sentence-to-clause <sentence>)   english sentence string Scanner   Tokens Analyzer   morphologically analyzed words Parser   parse-tree Phrase extractor   phrases (flattened parse tree) Semantic analyzer   Frame for subject, verb , object and prep. phrases Clauses Generator   Clause Objects
  • 42. Sample Extracted Clauses  (HASA "Asian elephant species" "disjunct distributions")  (ISA "Elephants" "herbivores")  (HASA "African Elephants" “three nails")  (HASA "Indian Elephants" "four nails")  (HASA "female African Elephants" "large tusks")  (ISA "Elephants" "large land mammals")
  • 43. Things Overlooked  Wiki Page Contents Pane  Provides page taxonomy  Provides category names  Provides related topic names  Concept Weights
  • 44. Future Direction  Enhance English Parser  Incorporate Variables into Semantic Net  Leverage topic weights  Work on language generation  Produce Wiki Summary Pages  Knowledge Queries  Develop Client Side Browser  Top Menu Bar Knowledge Categories  RHS Dynamic Subtopic Tree  LHS Wiki Page Content Pane
  • 45. (references)  Language and Speech Processing Jurafsky and Martin  Artificial Intelligence: A Modern Approach Russell & Norvig  Principles of Semantic Networks Edited by John E. Sowa, Morgan Kaufman, 1991  Machine Learning Tom Mitchell, 1997  Pattern Classification and Scene Analysis Duda and Hart, 1973  Algorithms of the Intelligent Web Marmanis and Babenko, 2009