NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD: an open source
platform for extracting and
disambiguating named entities
in very diverse documents
Raphaël Troncy <raphael.troncy@eurecom.fr>
Giuseppe Rizzo <giuseppe.rizzo@eurecom.fr>

What is a Named Entity recognition task?
 A task that aims to locate and classify the name of a
person or an organization, a location, a brand, a
product, a numeric expression including time, date,
money and percent in a textual document

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

-2

Example
 “ I want to book a room in an hotel located in
the heart of Paris, just a stone’s throw from the
Eiffel Tower ”

Eric Charton, “Named Entity Detection and Entity Linking in the
Context of Semantic Web: Exploring the ambiguity question”

22/10/2013 -


-3

Part of Speech
I
want
to
book
a
room
in
…
Paris

PRP
VBP
TO
VB
DT
NN
IN
…
NNP

NER: What is Paris?
NEL: Which Paris are we
talking about?

Giuseppe Rizzo, “Learning with the Web: Structuring data to
ease machine understanding”

22/10/2013 -


-4

What is Paris? Type Ambiguity

dbpedia-owl:Asteroid

schema:City

schema:Movie
dbpedia-owl:Film


22/10/2013 -


-5

Named Entity Recognition (NER)
I
want
to
book
a
room
in
…
Paris

PRP
VBP
TO
VB
DT
NN
IN
…
NNP

O
O
O
O
O
O
O
…
LOC


22/10/2013 -


-6

What is Paris? Name Ambiguity

Paris, Kentucky

Paris, France

Paris, Maine

Paris, Idaho

Paris, Tennessee

Paris, Ontario

22/10/2013 -


-7

Named Entity Linking (NEL)
I
want
to
book
a
room
in
…
Paris

PRP
VBP
TO
VB
DT
NN
IN
…
NNP

O
O
O
O
O
O
O
…
LOC

O
O
O
O
O
O
O
…
http://dbpedia.org/resource/Paris


22/10/2013 -


-8

NER Tools and Web APIs
 Standalone software
 GATE
 Stanford CoreNLP
 Temis

http://nerd.eurecom.fr/

 Web APIs

22/10/2013 -


-9

NERD: Named Entity Recognition and
Disambiguation
 Compare performances of
NER and NEL tools
 Understand strengths and weaknesses of different Web APIs
 Adapt NER processing to different context

 (Learn how to) Combine NER (/ NEL) tools
 Participate in various benchmarks

22/10/2013 -


- 10

What is NERD?
ontology1

REST API2
UI3

1

http://nerd.eurecom.fr/ontology
2 http://nerd.eurecom.fr/api/application.wadl
3 http://nerd.eurecom.fr

22/10/2013 -


- 11

Factual comparison of 10 Web NER tools
Alchemy
API

DBpedia
Spotlight

Evri

Extractiv

Lupedia

Open
Calais

Saplo

Wikimeta

Yahoo!

Zemanta

Language

EN,FR,
GR,IT,
PT,RU,
SP,SW

EN
GR*
PT*
SP*

EN,I
T

EN

EN,FR,
IT

EN,FR
SP

EN,
SW

EN,FR
SP

EN

EN

Granularity

OEN

OEN

OED

OEN

OEN

OEN

OED

OEN

OEN

OED

Entity
position

N/A

char
offset

N/A

word
offset

range of
chars

char
offset

N/A

POS
offset

range
of
chars

N/A

Alchemy

DBpedia
FreeBase
Scema.or
g

Evri

DBpedia

DBpedia
LinkedM
DB

Open
Calais

N/A

ESTER

Yahoo

FreeBase

Number of
classes

324

320

5

34

319

95

5

7

13

81

Response
Format

JSON
MicroF
XML
RDF

HTML
JSON
RDF
XML

HTM
L
JSO
N
RDF

HTML
JSON
RDF
XML

HTML
JSON
RDFa
XML

JSON
MicroF
ormat

JSON

JSON
XML

JSON
XML

XML
JSON
RDF

Quota
(calls/day)

30000

unl

300
3000
unl
50000
0

1333

unl

5000

10000

Classification
schema

22/10/2013 -

12/15

NERD Ontology

Aligned the taxonomies used by
the extractors
22/10/2013 -


- 13

NERD type

Building the NERD Ontology

Occurrence

Person

10

Organization

10

Country
Company

6

Continent

5

City

5

RadioStation

5

Album

5

Product

5

...


6

Location

22/10/2013 -

6

...

- 14

NERD REST API
RDF
/document
/user
/annotation/{extractor}
/extraction
/evaluation
...

GET,
POST,
PUT,
DELETE

JSON
“entities” : [{
“entity”: “Tim Berners-Lee” ,
“type”: “Person” ,
“uri”: "http://dbpedia.org/resource/Tim_berners_lee",
“nerdType”: "http://nerd.eurecom.fr/ontology#Person",
“startChar”: 30,
“endChar”: 45,
“confidence”: 1,
“relevance”: 0.5
}]

Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction
Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France.

22/10/2013 -


- 15

NERD meets NIF
Model documents through a
set of strings deferencable on
the Web
: offset_23107_ 23110 a str:String ;
str:referenceContext :offset_0_26546 .

Map string to entity
: offset_23107_ 23110 sso:oen dbpedia:W3C.

Classification
dbpedia:W3C

rdf:type

nerd:Organization .

Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked
Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France.
22/10/2013 -


- 16

NERD User Dashboard

22/10/2013 -


- 17

NERD User Interface

22/10/2013 -


- 18

History of NER benchmarks
 CoNLL 2003 and CoNLL 2005
 schema (4 types): person, organization, location and miscellaneous

 ACE 2004, ACE 2005 and ACE 2007
 schema (7 types): person, organization, location, facility, weapon,
vehicle and geo-political entity
 entity recognition, co-ref, find relationships among entities extracted

 TAC 2009 (Knowledge Base Track)
 schema (3 types): person, organization and location
 create a knowledge base from the named entities extracted

 ETAPE 2012 (Named Entity Task)
 schema: Quaero (7 main types, 32 sub-types)

 MSM 2013: tweet corpus !
 schema (4 types): person, organization, location, miscellaneous
22/10/2013 -


- 19

ETAPE 2012 challenge
genre

train

dev

test

TV news

7h 40m

1h 40m

1h 40m

BFM Story, Top QUestions (LCP)

TV debates

10h 30m

5h 10m

5h 10m

Pile et Face, Ca vous regarde,
Entre les lignes (LCP)

1h 05m

1h 05m

La place du village (TV8)

TV amusements -

sources

Train

Dev

Eval

Item length

26h

10h 55m

10h 55m

Nb files

44

15

15

Nb words

290517

91656

115511

Nb Named Entities

46763

14398

13055

Nb unique categories

33

33

33

22/10/2013 -


- 20

NERD @ ETAPE (naïve combined strategy)
extraction

(eA1,tA1,URIA1,siA1,eiA1) ...
(eA2,tA2,URIA2,siA2,eiA2)
(eA3,tA3,URIA3,siA3,eiA3)

...

...

cleaning

fusion
`

22/10/2013 -

(eN1,tN1,URIN1,siN1,eiN1)
(eN2,tN2,URIN2,siN2,eiN2)

When at least 2 extractors classify the
same entity with a different type then
we apply a preferred selection order
(empirically defined): Wikimeta,
AlchemyAPI, OpenCalais, Lupedia


- 21

Participation at ETAPE (combined+ strategy)
ETAPE
Train & Dev

...
Learned model

POS tagger

Created
static rules

Apply rules

(eA1,tA1,URIA1,siA1,eA1
)
(eA2,tA2,URIA2,siA2,eiA2
)

(e1,t1,URI1,si1,ei1)

fusion
Conflicts handled by
priority selection: own,
Wikimeta,AlchemyAPI,
OpenCalais,Lupedia

(eN1,tN1,URIN1,sN1,eN1)
`(e ,t ,URI ,s ,e )
N2 N2
N2 N2 N2
22/10/2013 -


- 22

NERD Global results

SLR

Precision

Recall

F-measure

%correct

combined

86.85%

35.31%

17.69%

23.44%

17.69%

combined+

188.81%

15.13%

28.40%

19.45%

28.40%

Combined+ : Eval corpus differs substantially from the Train & Dev
corpora. The static rules do not fit well the Eval corpora and they
introduce classification noise.

22/10/2013 -


- 23

Per-extractor results
SLR

Precision

Recall

F-measure

%correct

alchemyapi

37.71%

47.95%

5.45%

9.68%

5.45%

lupedia

39.49%

22.87%

1.56%

2.91%

1.56%

opencalais

37.47%

41.69%

3.53%

6.49%

3.53%

wikimeta

36.67%

19.40%

4.25%

6.95%

4.25%

combined
(nerd)

86.85%

35.31%

17.69%

23.44%

17.69%

combined+
(nerd+)

188.81%

15.13%

28.40%

19.45%

28.40%

22/10/2013 -


- 24

22/10/2013 -


- 25

Learning How to Combine NER Extractors

22/10/2013 -


- 26

NERD on CoNLL 2003 (NER task)

22/10/2013 -


- 27

NERD on MSM 2013 (NER task)

22/10/2013 -


- 28

NERD on MSM 2013 (NEL task)

22/10/2013 -


- 29

Media Fragment Enricher:
http://mfe.synote.org/mfe/

22/10/2013 -


- 30

Linking pieces of knowledge

22/10/2013 -


- 31

Linking pieces of knowledge

22/10/2013 -


- 32

Named Entities for Video Classification

22/10/2013 -


- 33

Workflow
5:Timed Text
6: NEs with time
alignment
(json)

2: Metadata
7: RDFize (ttl)
Media Fragment Enricher Services
Metadata &
timed-text

1: Video
URL

NERD
Client

3: metadata

RDFizator

9: SPARQL query
4:NERDify

Video and
metadata preview

Categorization

Triple Store

8: Generate
Category

Video replay with subtitles and
aligned NEs

Media Fragment Enricher UI

22/10/2013 -


- 34

Channel signature based on NE distribution

22/10/2013 -


- 35

22/10/2013 -


- 36

LinkedTV: automatic annotations ...

22/10/2013 -


- 37

... and enrichment for hypervideos

CONCEPT IN
PLAYER
Cubism

Expressionism

Fauvism

FACETS / PROPERTIES OF CONCEPT
22/10/2013 -


CONTENT ENRICHMENT
- 38

Media Fragments and Annotations

http://data.linkedtv.eu/medi
a/e2899e7f#t=840,900

nerd:Location
Casablanca

nerd:Location
Cafe Rick

nerd:Person
H. Bogart

nerd:Person
I. Bergman

 Media Fragment URI 1.0





22/10/2013 -

Chapters
Scenes
Shots
etc…


- 39

Enrichment and Hypervideos

nerd:Location
Casablanca

nerd:Location
Cafe Rick

nerd:Person
H. Bogart

Nerd:Person
E. Tierney

22/10/2013 -


nerd:Person
I. Bergman
nerd:Location
China

- 40

Media Fragment + Open Annotation + NERD
Locator

MediaResource
OffsetBasedString

Annotation

MediaFragment

Entity
Type

URL (hyperlink)

22/10/2013 -


- 41

Towards a Linked Media Layer
 Enriching media with media from a closed collection
(e.g. BBC archive)
 The MediaEval scenario (~ 1697 hours of archived BBC video)
http://www.multimediaeval.org/mediaeval2013/hyper2013/

 Enriching media with content from the open web
 LinkedTV scenarios: white listed web sites for each program
 Media Collector for Social Media
22/10/2013 -


- 42

Seed video enriched with web content
rbbaktuell_20120809

nerd:Location
Brandenburg
oa

Enrichments are Annotations too

22/10/2013 -


- 44

Media Finder (named entities clustering)

22/10/2013 -


- 45

Media Finder (zooming in a cluster)

22/10/2013 -


- 46

Media Finder: http://mediafinder.eurecom.fr/
 Live Topic Generation from Event Streams
 WWW 2013 Demo Session
 http://www.youtube.com/watch?v=8iRiwz7cDYY

22/10/2013 -


- 47

Credits
 Giuseppe Rizzo, Vuk Milicic,
José Luis Redondo Garcia (EURECOM)
 Thomas Steiner (Google Inc.)
 Marieke van Erp (Free University of Amsterdam)
 Yunjia Li (University of Southampton)
 … and many other students

22/10/2013 -


- 48

http://www.slideshare.net/troncy
22/10/2013 -


- 49

NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

More Related Content

What's hot

Similar to NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

More from Raphael Troncy

Recently uploaded

NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013