SlideShare a Scribd company logo
SEMI-STRUCTURE
DATA EXTRACTION
Rajendra Akerkar
(with David Camacho, Maria D. R-Moreno,
David F Barrero)
      F.

                                Bonn, June 2007
INDEX
   Introduction
    I    d i

   Semantic Generators

   The WebMantic architecture

   A practical example

   Some experimental issues

   Conclusions
INTRODUCTION
INTRODUCTION
  Web information
    Unstructured
    Non-semantic
    Designed for humans   not for crawlers

  Problems
      Representation (HTML vs XML)
      Extract, filter and reuse data
      Share information
      Volatility
      Fault tolerance
INTRODUCTION
        Information Extraction techniques
          Machine learning
          Pattern recognition
          Wrappers technologies
          Tools for automatic and semi-automatic
           Web data extraction


        This work presents
          A rule-based method for data identification
               l b    d    th d f d t id tifi ti
          An approach to Web data extraction
          A particular implementation of the previous
           method
SEMANTIC GENERATORS
SEMANTIC GENERATORS
      Def: A Semantic Generator (Sg) is a non-
                                           non
       empty set of rules (HTML2XML) that can be
       used to translate HTML documents into XML
       documents

      A Semantic Generator (Sg), is built by several
       rules which transform a set of non-semantic
       HTML tags into a set of semantic XML tags

      HTML2XML rule format

         HTML2XMLi =< header > IS < body > #num
SEMANTIC GENERATORS




       HTML2XML: <table.tr.td> IS <my-xml-tag>

    Tags: <table> <tr> <td> <A href…> etc…
    will be removed….only data will be extracted

       #num: provides the number of cells to be processed

       <my-xml-tag> Madrid <my-xml-tag>
SEMANTIC GENERATORS




                      Semantic generator
THE WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE
 WebMantic     allows:

    Automatically generates Sg

    Generalize HTML2XML rules
     G     li              l

    Guiding the extraction process

    Automatically generates Wrappers
WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE
 Tidy HTML p
     y       parser (http://tidy.sourceforge.net). It
                    (   p       y      f g      )
  translates HTML documents into well-formed
  HTML documents
 The HTML Tidy program (HTML parser and
                 yp g         (      p
  pretty printer) has been integrated as the first
  preprocessing module in WebMantic.


 Tree generator module. Once the HTML page is
  p p
  preprocessed by Tidy parser, a tree representation
                  y     yp       ,          p
  of the structures stored in the page is built
 In this representation any table or list tags
  g
  generate a node, and the leafs of the tree are: cells
                    ,             f f
  for tables (th,td,tr) or items for lists (li,lo)
WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE
    HTML2XML: Rule generator module The tree
                                    module.
     representation obtained is used by this module
     to generate a set of rules (Sg) that represent
     the information to be translated




                     HTML2XML rules
WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE
   Subsumption module. Previous module generates a
    rule for each structure to be translated. However,
    some of those rules can be generalized if the
    XML tag
    XML-tag represents the same concept. (i.e. the
    rules in previous example that represent the
    concepts of <data-record> and <country>)
WEBMANTIC ARCHITECTURE
WEBMANTIC ARCHITECTURE
   XML Parser module. This module receives both,
    the Semantic G
    th S      ti Generator obtained i previous
                        t    bt i d in    i
    module, and the (well formed) HTML document




           Semantic Generator
           Yahoo! Weather




                                   arser
                                  XML
                                  Pa
                                  X
A PRACTICAL EXAMPLE
WEBMANTIC GUI




            WebMantic’s GUI
WEBMANTIC GUI




                www.citypopulation.de
WEBMANTIC GUI




                www.citypopulation.de
WEBMANTIC GUI




           First tables & list are rejected
WEBMANTIC GUI




           First data-table is rejected
WEBMANTIC GUI




                data-table target
WEBMANTIC GUI




       XML tags generation (user interaction)
                       i (       i       i )
WEBMANTIC GUI




        XML tags & HTML2XML rules
WEBMANTIC HTML PROCESSING




               Tree
               T generated f
                         d from HTML d
                                     document




    Relation between the HTML tree and the XML-tags provided by the user
WEBMANTIC HTML PROCESSING




                     HTML2XML rules




        Semantic Generator: HTML2XML subsumed rules
EXPERIMENTAL RESULTS
EXPERIMENTAL RESULTS
   Experimental tests (Web sites used):
     Population (www.citypopulation.de)
EXPERIMENTAL RESULTS
   Experimental tests (Web sites used):
     Yahoo Weather (weather.yahoo.com)
EXPERIMENTAL RESULTS
   Experimental tests (Web sites used):
     Iberia arilines (www.iberia.com)
EXPERIMENTAL RESULTS
   Several parameters have been evaluated:

    1.   Number of pages tested from each Web site

    2.
    2    Number of accessible structures

    3.   Maximum nested structure

    4.
    4    Average number of HTML2XML rules for each Semantic
         Generator (Sg), once the subsumption process has
         finished

    5.   Average time (seconds) to generate the Sg (Time Sg)

    6.   Average time (seconds) to translate from HTML to
         XMLfor the set of training pages (transformation time)
EXPERIMENTAL RESULTS
CONCLUSIONS
CONCLUSIONS AND FUTURE WORK
  Conclusions:


      We define a technique which is able to p
             f            q                   provide a
       semantic representation (using XML-tags) to semi-
       structured (tables and lists) Web pages through a set of
       rules (encapsulated in a Semantic Generator)
      Rules are created and automatically generalized
      These rules can be used to preprocess Web pages with a
       similar structure, and convert them into XML
       documents with semantic tags
       d            i h        i
      These can be integrated into information agents
CONCLUSIONS AND FUTURE WORK
 In   the near future:

     Other Web t h l i
      Oth W b technologies as DOM

     Ontologies

     Machine learning algorithms to automatically
      learns new web (similar) p g
                     (       ) pages

     Statistical knowledge extraction

More Related Content

Viewers also liked (20)

PDF
Linked open data
R A Akerkar
 
PDF
Statistical Preliminaries
R A Akerkar
 
PDF
Semantic Markup
R A Akerkar
 
PDF
What is Big Data ?
R A Akerkar
 
PDF
Big data in Business Innovation
R A Akerkar
 
PDF
Big data: analyzing large data sets
R A Akerkar
 
PDF
Intelligent natural language system
R A Akerkar
 
PDF
Big Data and Harvesting Data from Social Media
R A Akerkar
 
PDF
Can You Really Make Best Use of Big Data?
R A Akerkar
 
PDF
Data mining
R A Akerkar
 
PPSX
Your amazing brain assembly
HighbankPrimary
 
PDF
Link analysis
R A Akerkar
 
PDF
Unified Modelling Language
R A Akerkar
 
PPT
SOFTCOMPUTERING TECHNICS - Unit
sravanthi computers
 
PDF
RCOMM 2011 - Sentiment Classification with RapidMiner
bohanairl
 
PDF
Rational Unified Process for User Interface Design
R A Akerkar
 
PDF
Neural Networks
R A Akerkar
 
PDF
artificial intelligence
R A Akerkar
 
PDF
Data and Information Extraction on the Web
Tommaso Teofili
 
PPT
Dr. kiani artificial neural network lecture 1
Parinaz Faraji
 
Linked open data
R A Akerkar
 
Statistical Preliminaries
R A Akerkar
 
Semantic Markup
R A Akerkar
 
What is Big Data ?
R A Akerkar
 
Big data in Business Innovation
R A Akerkar
 
Big data: analyzing large data sets
R A Akerkar
 
Intelligent natural language system
R A Akerkar
 
Big Data and Harvesting Data from Social Media
R A Akerkar
 
Can You Really Make Best Use of Big Data?
R A Akerkar
 
Data mining
R A Akerkar
 
Your amazing brain assembly
HighbankPrimary
 
Link analysis
R A Akerkar
 
Unified Modelling Language
R A Akerkar
 
SOFTCOMPUTERING TECHNICS - Unit
sravanthi computers
 
RCOMM 2011 - Sentiment Classification with RapidMiner
bohanairl
 
Rational Unified Process for User Interface Design
R A Akerkar
 
Neural Networks
R A Akerkar
 
artificial intelligence
R A Akerkar
 
Data and Information Extraction on the Web
Tommaso Teofili
 
Dr. kiani artificial neural network lecture 1
Parinaz Faraji
 

Similar to Semi structure data extraction (20)

PDF
Pxc3872601
StephanieLeBadezet
 
PPTX
Progress Report
xoanon
 
PPT
20080930
xoanon
 
PDF
Semantic Knowledge Acquisition of Information for Syntactic web
dannyijwest
 
PDF
IRJET- Semantic Web Mining and Semantic Search Engine: A Review
IRJET Journal
 
PPT
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 1)
IWMW
 
PPTX
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Moutasm Tamimi
 
PPT
semantic web tech.ppt
NaglaaFathy42
 
PPTX
Web Information Network Extraction and Analysis
Tim Weninger
 
PPT
Intelligent expert systems for location planning
Navid Milanizadeh
 
PPTX
Semantic mark-up with schema.org: helping search engines understand the Web
Peter Mika
 
PDF
An imperative focus on semantic
ijasa
 
PPT
Semantic web
Aatif Hussain Warraich
 
PPT
Semantic web
Hon Lasisi H
 
PDF
An Implementation of a New Framework for Automatic Generation of Ontology and...
IJCSIS Research Publications
 
PDF
Mit press a semantic web primer - 2004 !! - (by laxxuss)
okeee
 
DOCX
Semantic web Document
ap
 
PPT
Semantic web
cat_us
 
PPT
PhD Presentation
mskayed
 
PDF
A semantic based approach for information retrieval from html documents using...
csandit
 
Pxc3872601
StephanieLeBadezet
 
Progress Report
xoanon
 
20080930
xoanon
 
Semantic Knowledge Acquisition of Information for Syntactic web
dannyijwest
 
IRJET- Semantic Web Mining and Semantic Search Engine: A Review
IRJET Journal
 
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 1)
IWMW
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Moutasm Tamimi
 
semantic web tech.ppt
NaglaaFathy42
 
Web Information Network Extraction and Analysis
Tim Weninger
 
Intelligent expert systems for location planning
Navid Milanizadeh
 
Semantic mark-up with schema.org: helping search engines understand the Web
Peter Mika
 
An imperative focus on semantic
ijasa
 
Semantic web
Hon Lasisi H
 
An Implementation of a New Framework for Automatic Generation of Ontology and...
IJCSIS Research Publications
 
Mit press a semantic web primer - 2004 !! - (by laxxuss)
okeee
 
Semantic web Document
ap
 
Semantic web
cat_us
 
PhD Presentation
mskayed
 
A semantic based approach for information retrieval from html documents using...
csandit
 
Ad

More from R A Akerkar (13)

PDF
Rajendraakerkar lemoproject
R A Akerkar
 
PDF
Connecting and Exploiting Big Data
R A Akerkar
 
PDF
Data Mining
R A Akerkar
 
PDF
Case Based Reasoning
R A Akerkar
 
PDF
Statistics and Data Mining
R A Akerkar
 
PDF
Software project management
R A Akerkar
 
PDF
Personalisation and Fuzzy Bayesian Nets
R A Akerkar
 
PDF
Multi-agent systems
R A Akerkar
 
PDF
Human machine interface
R A Akerkar
 
PDF
Reasoning in Description Logics
R A Akerkar
 
PDF
Decision tree
R A Akerkar
 
PDF
Building an Intelligent Web: Theory & Practice
R A Akerkar
 
PDF
Relationship between the Semantic Web and NLP
R A Akerkar
 
Rajendraakerkar lemoproject
R A Akerkar
 
Connecting and Exploiting Big Data
R A Akerkar
 
Data Mining
R A Akerkar
 
Case Based Reasoning
R A Akerkar
 
Statistics and Data Mining
R A Akerkar
 
Software project management
R A Akerkar
 
Personalisation and Fuzzy Bayesian Nets
R A Akerkar
 
Multi-agent systems
R A Akerkar
 
Human machine interface
R A Akerkar
 
Reasoning in Description Logics
R A Akerkar
 
Decision tree
R A Akerkar
 
Building an Intelligent Web: Theory & Practice
R A Akerkar
 
Relationship between the Semantic Web and NLP
R A Akerkar
 
Ad

Recently uploaded (20)

PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Biography of Daniel Podor.pdf
Daniel Podor
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
July Patch Tuesday
Ivanti
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 

Semi structure data extraction

  • 1. SEMI-STRUCTURE DATA EXTRACTION Rajendra Akerkar (with David Camacho, Maria D. R-Moreno, David F Barrero) F. Bonn, June 2007
  • 2. INDEX  Introduction I d i  Semantic Generators  The WebMantic architecture  A practical example  Some experimental issues  Conclusions
  • 4. INTRODUCTION  Web information  Unstructured  Non-semantic  Designed for humans not for crawlers  Problems  Representation (HTML vs XML)  Extract, filter and reuse data  Share information  Volatility  Fault tolerance
  • 5. INTRODUCTION  Information Extraction techniques  Machine learning  Pattern recognition  Wrappers technologies  Tools for automatic and semi-automatic Web data extraction  This work presents  A rule-based method for data identification l b d th d f d t id tifi ti  An approach to Web data extraction  A particular implementation of the previous method
  • 7. SEMANTIC GENERATORS  Def: A Semantic Generator (Sg) is a non- non empty set of rules (HTML2XML) that can be used to translate HTML documents into XML documents  A Semantic Generator (Sg), is built by several rules which transform a set of non-semantic HTML tags into a set of semantic XML tags  HTML2XML rule format HTML2XMLi =< header > IS < body > #num
  • 8. SEMANTIC GENERATORS  HTML2XML: <table.tr.td> IS <my-xml-tag> Tags: <table> <tr> <td> <A href…> etc… will be removed….only data will be extracted  #num: provides the number of cells to be processed  <my-xml-tag> Madrid <my-xml-tag>
  • 9. SEMANTIC GENERATORS Semantic generator
  • 11. WEBMANTIC ARCHITECTURE  WebMantic allows:  Automatically generates Sg  Generalize HTML2XML rules G li l  Guiding the extraction process  Automatically generates Wrappers
  • 13. WEBMANTIC ARCHITECTURE  Tidy HTML p y parser (http://tidy.sourceforge.net). It ( p y f g ) translates HTML documents into well-formed HTML documents  The HTML Tidy program (HTML parser and yp g ( p pretty printer) has been integrated as the first preprocessing module in WebMantic.  Tree generator module. Once the HTML page is p p preprocessed by Tidy parser, a tree representation y yp , p of the structures stored in the page is built  In this representation any table or list tags g generate a node, and the leafs of the tree are: cells , f f for tables (th,td,tr) or items for lists (li,lo)
  • 15. WEBMANTIC ARCHITECTURE  HTML2XML: Rule generator module The tree module. representation obtained is used by this module to generate a set of rules (Sg) that represent the information to be translated HTML2XML rules
  • 17. WEBMANTIC ARCHITECTURE  Subsumption module. Previous module generates a rule for each structure to be translated. However, some of those rules can be generalized if the XML tag XML-tag represents the same concept. (i.e. the rules in previous example that represent the concepts of <data-record> and <country>)
  • 19. WEBMANTIC ARCHITECTURE  XML Parser module. This module receives both, the Semantic G th S ti Generator obtained i previous t bt i d in i module, and the (well formed) HTML document Semantic Generator Yahoo! Weather arser XML Pa X
  • 21. WEBMANTIC GUI WebMantic’s GUI
  • 22. WEBMANTIC GUI www.citypopulation.de
  • 23. WEBMANTIC GUI www.citypopulation.de
  • 24. WEBMANTIC GUI First tables & list are rejected
  • 25. WEBMANTIC GUI First data-table is rejected
  • 26. WEBMANTIC GUI data-table target
  • 27. WEBMANTIC GUI XML tags generation (user interaction) i ( i i )
  • 28. WEBMANTIC GUI XML tags & HTML2XML rules
  • 29. WEBMANTIC HTML PROCESSING Tree T generated f d from HTML d document Relation between the HTML tree and the XML-tags provided by the user
  • 30. WEBMANTIC HTML PROCESSING HTML2XML rules Semantic Generator: HTML2XML subsumed rules
  • 32. EXPERIMENTAL RESULTS  Experimental tests (Web sites used):  Population (www.citypopulation.de)
  • 33. EXPERIMENTAL RESULTS  Experimental tests (Web sites used):  Yahoo Weather (weather.yahoo.com)
  • 34. EXPERIMENTAL RESULTS  Experimental tests (Web sites used):  Iberia arilines (www.iberia.com)
  • 35. EXPERIMENTAL RESULTS  Several parameters have been evaluated: 1. Number of pages tested from each Web site 2. 2 Number of accessible structures 3. Maximum nested structure 4. 4 Average number of HTML2XML rules for each Semantic Generator (Sg), once the subsumption process has finished 5. Average time (seconds) to generate the Sg (Time Sg) 6. Average time (seconds) to translate from HTML to XMLfor the set of training pages (transformation time)
  • 38. CONCLUSIONS AND FUTURE WORK  Conclusions:  We define a technique which is able to p f q provide a semantic representation (using XML-tags) to semi- structured (tables and lists) Web pages through a set of rules (encapsulated in a Semantic Generator)  Rules are created and automatically generalized  These rules can be used to preprocess Web pages with a similar structure, and convert them into XML documents with semantic tags d i h i  These can be integrated into information agents
  • 39. CONCLUSIONS AND FUTURE WORK  In the near future:  Other Web t h l i Oth W b technologies as DOM  Ontologies  Machine learning algorithms to automatically learns new web (similar) p g ( ) pages  Statistical knowledge extraction