Semi structure data extraction

SEMI-STRUCTURE
DATA EXTRACTION
Rajendra Akerkar
(with David Camacho, Maria D. R-Moreno,
David F Barrero)
F.

Bonn, June 2007

INDEX
 Introduction
I d i

 Semantic Generators

 The WebMantic architecture

 A practical example

 Some experimental issues

 Conclusions

INTRODUCTION
 Web information
 Unstructured
 Non-semantic
 Designed for humans not for crawlers

 Problems
 Representation (HTML vs XML)
 Extract, filter and reuse data
 Share information
 Volatility
 Fault tolerance

INTRODUCTION
 Information Extraction techniques
 Machine learning
 Pattern recognition
 Wrappers technologies
 Tools for automatic and semi-automatic
Web data extraction

 This work presents
 A rule-based method for data identification
l b d th d f d t id tifi ti
 An approach to Web data extraction
 A particular implementation of the previous
method

SEMANTIC GENERATORS
 Def: A Semantic Generator (Sg) is a non-
non
empty set of rules (HTML2XML) that can be
used to translate HTML documents into XML
documents

 A Semantic Generator (Sg), is built by several
rules which transform a set of non-semantic
HTML tags into a set of semantic XML tags

 HTML2XML rule format

HTML2XMLi =< header > IS < body > #num

SEMANTIC GENERATORS

 HTML2XML: <table.tr.td> IS <my-xml-tag>

Tags: <table> <tr> <td> <A href…> etc…
will be removed….only data will be extracted

 #num: provides the number of cells to be processed

 <my-xml-tag> Madrid <my-xml-tag>

SEMANTIC GENERATORS

Semantic generator

WEBMANTIC ARCHITECTURE
 WebMantic allows:

 Automatically generates Sg

 Generalize HTML2XML rules
G li l

 Guiding the extraction process

 Automatically generates Wrappers

 Tidy HTML p
y parser (http://tidy.sourceforge.net). It
( p y f g )
translates HTML documents into well-formed
HTML documents
 The HTML Tidy program (HTML parser and
yp g ( p
pretty printer) has been integrated as the first
preprocessing module in WebMantic.

 Tree generator module. Once the HTML page is
p p
preprocessed by Tidy parser, a tree representation
y yp , p
of the structures stored in the page is built
 In this representation any table or list tags
g
generate a node, and the leafs of the tree are: cells
, f f
for tables (th,td,tr) or items for lists (li,lo)

 HTML2XML: Rule generator module The tree
module.
representation obtained is used by this module
to generate a set of rules (Sg) that represent
the information to be translated

HTML2XML rules

 Subsumption module. Previous module generates a
rule for each structure to be translated. However,
some of those rules can be generalized if the
XML tag
XML-tag represents the same concept. (i.e. the
rules in previous example that represent the
concepts of <data-record> and <country>)

 XML Parser module. This module receives both,
the Semantic G
th S ti Generator obtained i previous
t bt i d in i
module, and the (well formed) HTML document

Semantic Generator
Yahoo! Weather

arser
XML
Pa
X

WEBMANTIC GUI

WebMantic’s GUI

WEBMANTIC GUI

www.citypopulation.de

WEBMANTIC GUI

First tables & list are rejected

WEBMANTIC GUI

First data-table is rejected

WEBMANTIC GUI

data-table target

WEBMANTIC GUI

XML tags generation (user interaction)
i ( i i )

WEBMANTIC GUI

XML tags & HTML2XML rules

WEBMANTIC HTML PROCESSING

Tree
T generated f
d from HTML d
document

Relation between the HTML tree and the XML-tags provided by the user

WEBMANTIC HTML PROCESSING

HTML2XML rules

Semantic Generator: HTML2XML subsumed rules

EXPERIMENTAL RESULTS
 Experimental tests (Web sites used):
 Population (www.citypopulation.de)

 Yahoo Weather (weather.yahoo.com)

 Iberia arilines (www.iberia.com)

 Several parameters have been evaluated:

1. Number of pages tested from each Web site

2.
2 Number of accessible structures

3. Maximum nested structure

4.
4 Average number of HTML2XML rules for each Semantic
Generator (Sg), once the subsumption process has
finished

5. Average time (seconds) to generate the Sg (Time Sg)

6. Average time (seconds) to translate from HTML to
XMLfor the set of training pages (transformation time)

CONCLUSIONS AND FUTURE WORK
 Conclusions:

 We define a technique which is able to p
f q provide a
semantic representation (using XML-tags) to semi-
structured (tables and lists) Web pages through a set of
rules (encapsulated in a Semantic Generator)
 Rules are created and automatically generalized
 These rules can be used to preprocess Web pages with a
similar structure, and convert them into XML
documents with semantic tags
d i h i
 These can be integrated into information agents

CONCLUSIONS AND FUTURE WORK
 In the near future:

 Other Web t h l i
Oth W b technologies as DOM

 Ontologies

 Machine learning algorithms to automatically
learns new web (similar) p g
( ) pages

 Statistical knowledge extraction

Semi structure data extraction

More Related Content

Viewers also liked (20)

Similar to Semi structure data extraction (20)

More from R A Akerkar (13)

Recently uploaded (20)

Semi structure data extraction