2-Layered HMMs for Search Interface SegmentationRitu Khare(Under the Supervision of Dr Yuan An, Assistant Professor, iSchool)1
Order of Presentation2BackgroundDeep WebWhat is Search Interface Understanding?What is Interface Segmentation?Why is Segmentation Challenging?Our Approach for SegmentationInterface RepresentationHMM: The Artificial Designer2-Layered ApproachArchitectureExperimentation ParametersResultContributionsFuture WorkReferences
Background: Deep WebWhat is Deep Web:  The data that exists on the Web but is not returned by search engines through traditional crawling and indexing.  The primary way to access this data is by filling up HTML forms on search interfaces. Characteristics[6] :A large proportion of structured databases; Diversity of domains; and its ; Growing scale Researchers have many goals for the deep Web: design intra-domain meta-search engines [22, 8, 15, 5, 21]increase content visibility on existing search engines [17, 12]derive ontologies from search interfaces [1].  A pre-requisite to attain these goals is an understanding of the search interfaces (slide 4). In this project, we propose an approach to address the segmentation(slide 5) portion of the problem of search interface understanding.3
Background: What is Search Interface Understanding?4Understanding semantics of a search interface (shown in figure) is an intricate process [4] .It involves 4 stages. Representation: A suitable interface representation scheme is chosen; semantic labels (slide 8) to be assigned to interface components are decided.  An interface component is any text or HTML form element (textbox, textarea, selection list, radiobutton, checkbox, file input) that exists inside an HTML form. Parsing: Components are parsed into a suitable structure. Segmentation: The interface components are assigned semantic labels , and related components are grouped together  The questions like “Which surrounding text is associated with which form element?” (In figure 2, “Gene ID” is associated with the textbox placed next) are also answered in this stage. Segment-processing: Additional information, such as domain, constraints, and data type, about each segment component is extracted.
What is Interface Segmentation?5This project focuses on Segmentation, the 3rd stage of this process. Figure shows a segmented interface. The related components are grouped together. The left segment has 7 components. The right segment has 4 components (“cM Position:”, selection list, textbox, and “e.g., “10.0-40.0””).
Why is Segmentation Challenging?From a user’s (or designer’s) standpoint, By looking at the visual arrangement of components, and based on past experiences, the user creates a logical boundary around the related components as they appear to belong to the same atomic query. On the other hand, a machine is unable to “see” a segment due to the following reasons: The components that are visually close to each other might be located very far apart in the HTML source code, A machine does not implicitly have any search experience that can be leveraged to identify a segment boundary.  This project aims to investigate whether a machine can “learn” how to understand and segment an interface. Existing works have two shortcomings: they [9,13,17] do not group all related components together i.e. do not create complete segments. they [23, 7] use rules and heuristics to segment a search interface.  These techniques have problems  in handling scalability and heterogeneity [10]. 6
Our Approach for Segmentation We incorporate the first-hand implicit knowledge using which a human designer is assumed to have designed an interface. This is accomplished by designing an artificial designer using Hidden Markov Models (refer to week 9’s slides on HMM introduction).  We visualize segmentation as a two-folded problem Identification of boundaries of logical attributes (slide 9)Assignment of semantic labels (attribute-name, operator, and operand described in slide 9) to interface components.  7
Interface Representation8In figure, each component of the lower segment is marked with a label, which we term as a semantic label. The semantic label for a particular component denotes the meaning of the component from a user’s or designer’s standpoint. Search EntityLogical AttributeLogical AttributeOperandOperatorAttribute-name
Interface RepresentationAttribute-name: Attribute-name denotes the criteria available for searching a particular entity, e.g. the entity “Genes” can be searched by “Gene ID” and by “Gene Name”. Operand: An attribute-name is usually associated with operand(s), the value(s) entered by the user that is(are) matched against the corresponding field value(s) in the underlying database. Operator: The user may also be given an option of specifying the operator that further qualifies an operand. Filling up an HTML form is similar to writing SQL queries.  Assuming the underlying database table name is “Gene”, the SQL queries for figure would be: SELECT * FROM Gene WHERE Gene_ID= ‘PF11_0344’; SELECT * FROM Gene WHERE Gene_Name LIKE ‘maggie’; Logical Attribute: The predicate in the WHERE clause of each query is created by a group of related components. We combine the semantic roles (attribute-name, operator(s), and operand(s)) of these components to create a composite semantic label called logical attribute. Our approach assumes that a segment corresponds to a logical attribute. 9
HMM: The Artificial Designer10We assume that an HMM can act like a human designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components. The designing process is similar to statistically choosing one component from a bag of components (a superset of all possible components) and placing it on the Web page while keeping the semantic role (attribute-name, operand, or operator) of the component in mind. Knowledge of Semantic LabelsBag of  ComponentsSearch InterfaceDesigning2-Layered HMM(Artificial Designer)Segments & Tagged ComponentsDecoding
HMM: The Artificial DesignerWhile the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. In the figure, Ovals=states (semantic labels); Rectangles= emitted symbols (components).The designing ability is provided by training the HMM with suitable algorithms. Once an HMM is trained, it can be used for the decoding process i.e. for explaining the design of a given search interface. 11Attribute NameOperandOperatorAttribute NameOperandText(Gene ID)TextboxText(Gene Name)RadioButton GroupTextbox
2-Layered HMMThe problem of decoding that we address in this paper is two-folded involving segmentation as well as assignment of semantic labels to components. Hence, we employ a layered HMM [14] with 2 layers. The first layer T-HMM tags each component with appropriate semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical attributes.12
2-Layered Approach ArchitectureDOM-TREE PARSINGTraining InterfacesT-HMMManually tagged State SequencesT-HMM SpecsTest interfacesPredictedState SequencesT-HMM TRAININGT-HMM TESTINGS-HMMS-HMM SpecsTest interfacesS-HMM TRAININGManually tagged State SequencesS-HMM TESTINGPredictedState Sequences13
Experimentation ParametersData-Set: 200 interfaces (NAR collection) http://www3.oup.co.uk/nar/database/c/Parsing: DOM-trees [3] of components. Trees were traversed in the depth-first search order .Testing and Training Data: The examples were randomly divided into 20 equal-sized sets. We conducted 20 experiments each having 190 training and 10 testing examples.Testing and Training Algorithms: In both layers, training and testing were performed using Maximum Likelihood method and Viterbi algorithm respectively.14
Results15
ContributionsWe studied a challenging stage (segmentation) of the process of search interface understanding. In the context of deep Web, this is the third formal empirical study (after [23] and [7]) that groups components belonging to the same logical attribute together. We incorporated the first-hand knowledge of the designer for interface segmentation and component tagging. To the best of our knowledge, this is the first work to apply HMMs on deep Web search interfaces. The interface has been represented in terms of the underlying database. This helped in extracting database querying semantics. Moreover, we tested our method on a less-explored domain (biology), and found promising results. 16
Future Work17To recover the schema of deep Web databases by extraction of finer details such as data type and constraints of logical attribute. To do justice to the balanced domain distribution of the deep Web [6], we want to test this method on interfaces from other less-explored domains. To improve the degree of automation we want to investigate the use of Baum Welch training algorithm. To minimize the zero emission probabilities, we want to investigate the use of Synset-HMM [20] .
References18Benslimane, S. M., Malki, M., Rahmouni, M. K., & Benslimane, D. (2007). Extracting personalised ontology from data-intensive web application: An HTML forms-based reverse engineering approach.Informatica, 18(4), 511-534. Freitag, D., & Mccallum, A. K. (1999). Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 31-36. Gupta , S., Kaiser, G. E., Grimm , P., Chiang, M. F., & Starren, J. (2005). Automating content extraction of HTML documents. World Wide Web, 8(2), 179-224. Halevy, A. Y. (2005, Why your data won't mix: Semantic heterogeneity. Queue, 3, 50-50-58. He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM SIGMOD International Conference on Management of Data , San Diego, California. 217-228. He, B., Patel, M., Zhang, Z., & Chang, K. C. (2007a). Accessing the deep web. Communications of the ACM, 50(5), 94-101. He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007b). Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10(2), 133 - 155. He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with WISE-integrator. The VLDB Journal the International Journal on very Large Data Bases, 13(3), 256-273. Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on PDAs. Proceedings of the 10th International Conference on World Wide Web , Hong Kong, Hong Kong. Kushmerick , N. (2002). Finite-state approaches to web information extraction. 3rd Summer Convention on Information Extraction, 77-91. Kushmerick , N. (2003). Learning to invoke web forms. On the move to meaningful internet systems 2003 (pp. 997-1013) Springer Berlin / Heidelberg. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google's deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241-1252.
References19Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment , Auckland, New Zealand. , 1(1) 684-694. Oliver, N., Garg, A., & Horvitz, E. (2004). Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96(2), 163-180. Pei, J., Hong, J., & Bell, D. (2006). A robust approach to schema matching over web query interfaces. Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06), Atlanta, Georgia. 46-55. Rabiner, L., R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on very Large Data Bases , Rome, Italy. 129-138. Russell, S., J., & Norvig, P. (2002). Artificial intelligence: Modern approach Prentice Hall. Seymore, K., Mccallum, A. K., & Rosenfeld , R. (1999). Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 37-42. Tran-Le, M. S., Vo-Dang , T. T., Ho-Van , Q., & Dang, T. K. (2008). Automatic information extraction from the web: An HMM-based approach. Modeling, simulation and optimization of complex processes (pp. 575-585) Springer Berlin Heidelberg. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419. Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data , Paris, France. 95 - 106. Zhang, Z., He, B., & Chang, K. C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France. 107 - 118. Zhong, P., & Chen, J. (2006). A generalized hidden markov model approach for web information extractionWeb Intelligence, 2006. WI 2006, Hong Kong, China. 709-718.
Thank YouQuestions, Comments, Ideas???20

More Related Content

PPTX
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
PDF
Bca examination 2017 dbms
PDF
Chapter 3 Entity Relationship Model
PDF
Study on a Hybrid Segmentation Approach for Handwritten Numeral Strings in Fo...
PPT
Cs583 information-integration
PDF
Qeary pro and opt
PDF
MATBASE AUTO FUNCTION NON-RELATIONAL CONSTRAINTS ENFORCEMENT ALGORITHMS
PDF
Dbms interview questions
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
Bca examination 2017 dbms
Chapter 3 Entity Relationship Model
Study on a Hybrid Segmentation Approach for Handwritten Numeral Strings in Fo...
Cs583 information-integration
Qeary pro and opt
MATBASE AUTO FUNCTION NON-RELATIONAL CONSTRAINTS ENFORCEMENT ALGORITHMS
Dbms interview questions

What's hot (20)

PPT
Dbms relational model
PPTX
DBMS - Relational Model
PPTX
Chapter 7 relation database language
PPTX
Chapter-2 Database System Concepts and Architecture
PPTX
Logical database design and the relational model(database)
PPT
Chapter12 designing databases
PDF
Generating requirements analysis models from textual requiremen
PDF
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
PDF
Cleveree: an artificially intelligent web service for Jacob voice chatbot
PDF
Database Design and the ER Model, Indexing and Hashing
PPTX
ADB introduction
PPT
Database 3 Conceptual Modeling And Er
PDF
Fundamentals of Database Systems Questions and Answers
PDF
Availability Assessment of Software Systems Architecture Using Formal Models
PDF
ENSEMBLE MODEL FOR CHUNKING
PPT
5. Other Relational Languages in DBMS
PPTX
Bank mangement system
PDF
Relational Database Design
PPTX
The relational data model part[1]
Dbms relational model
DBMS - Relational Model
Chapter 7 relation database language
Chapter-2 Database System Concepts and Architecture
Logical database design and the relational model(database)
Chapter12 designing databases
Generating requirements analysis models from textual requiremen
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
Cleveree: an artificially intelligent web service for Jacob voice chatbot
Database Design and the ER Model, Indexing and Hashing
ADB introduction
Database 3 Conceptual Modeling And Er
Fundamentals of Database Systems Questions and Answers
Availability Assessment of Software Systems Architecture Using Formal Models
ENSEMBLE MODEL FOR CHUNKING
5. Other Relational Languages in DBMS
Bank mangement system
Relational Database Design
The relational data model part[1]
Ad

Viewers also liked (17)

PPTX
Dissertation Proposal Presentation
POTX
Why Community Managers Won't Exist in 5 Years (and why that's a good thing)
PDF
5 tips on how to select a prom for your study presentation notes
PPT
PRO Workshop - Selecting the appropriate PRO for your clinical study
PPTX
A Multi-level Methodology for Developing UML Sequence Diagrams
KEY
What is that beautiful house?
PDF
Configure Kettle debug session
PDF
Prospectus presentation
PDF
PHP Documentation APIs on the fly
PPTX
Can Clinicians Create High-Quality Databases?
PPT
Guia argentina de tratamiento de la EPOC
PDF
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
PPTX
iPad 3 Features Rumors
PPTX
There's a Customer Out There with a Bullet for You: Understanding Your Customers
PDF
GbDportfolio Marketing+Analytics
Dissertation Proposal Presentation
Why Community Managers Won't Exist in 5 Years (and why that's a good thing)
5 tips on how to select a prom for your study presentation notes
PRO Workshop - Selecting the appropriate PRO for your clinical study
A Multi-level Methodology for Developing UML Sequence Diagrams
What is that beautiful house?
Configure Kettle debug session
Prospectus presentation
PHP Documentation APIs on the fly
Can Clinicians Create High-Quality Databases?
Guia argentina de tratamiento de la EPOC
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
iPad 3 Features Rumors
There's a Customer Out There with a Bullet for You: Understanding Your Customers
GbDportfolio Marketing+Analytics
Ad

Similar to Two Layered HMMs for Search Interface Segmentation (20)

PDF
HMM-based Artificial Designer for Search Interface Segmentation
PDF
Vision Based Deep Web data Extraction on Nested Query Result Records
PDF
Ak4301197200
PDF
Paper id 25201463
PDF
Zhao huang deep sim deep learning code functional similarity
PDF
Data Science Machine
PDF
Annotating Search Results from Web Databases
PDF
ArtForm - Dynamic analysis of JavaScript validation in web forms - Poster
PDF
G1803054653
PDF
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
DOC
Abstract
PPT
Lec2_Information Integration.ppt
PPT
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
PPT
Web Information Extraction Learning based on Probabilistic Graphical Models
PDF
Learning from similarity and information extraction from structured documents...
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
P209 leithiser-relationaldb-formal-specifications
PDF
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
PDF
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
HMM-based Artificial Designer for Search Interface Segmentation
Vision Based Deep Web data Extraction on Nested Query Result Records
Ak4301197200
Paper id 25201463
Zhao huang deep sim deep learning code functional similarity
Data Science Machine
Annotating Search Results from Web Databases
ArtForm - Dynamic analysis of JavaScript validation in web forms - Poster
G1803054653
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Abstract
Lec2_Information Integration.ppt
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Web Information Extraction Learning based on Probabilistic Graphical Models
Learning from similarity and information extraction from structured documents...
International Journal of Engineering Research and Development (IJERD)
P209 leithiser-relationaldb-formal-specifications
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...

Two Layered HMMs for Search Interface Segmentation

  • 1. 2-Layered HMMs for Search Interface SegmentationRitu Khare(Under the Supervision of Dr Yuan An, Assistant Professor, iSchool)1
  • 2. Order of Presentation2BackgroundDeep WebWhat is Search Interface Understanding?What is Interface Segmentation?Why is Segmentation Challenging?Our Approach for SegmentationInterface RepresentationHMM: The Artificial Designer2-Layered ApproachArchitectureExperimentation ParametersResultContributionsFuture WorkReferences
  • 3. Background: Deep WebWhat is Deep Web: The data that exists on the Web but is not returned by search engines through traditional crawling and indexing. The primary way to access this data is by filling up HTML forms on search interfaces. Characteristics[6] :A large proportion of structured databases; Diversity of domains; and its ; Growing scale Researchers have many goals for the deep Web: design intra-domain meta-search engines [22, 8, 15, 5, 21]increase content visibility on existing search engines [17, 12]derive ontologies from search interfaces [1]. A pre-requisite to attain these goals is an understanding of the search interfaces (slide 4). In this project, we propose an approach to address the segmentation(slide 5) portion of the problem of search interface understanding.3
  • 4. Background: What is Search Interface Understanding?4Understanding semantics of a search interface (shown in figure) is an intricate process [4] .It involves 4 stages. Representation: A suitable interface representation scheme is chosen; semantic labels (slide 8) to be assigned to interface components are decided. An interface component is any text or HTML form element (textbox, textarea, selection list, radiobutton, checkbox, file input) that exists inside an HTML form. Parsing: Components are parsed into a suitable structure. Segmentation: The interface components are assigned semantic labels , and related components are grouped together The questions like “Which surrounding text is associated with which form element?” (In figure 2, “Gene ID” is associated with the textbox placed next) are also answered in this stage. Segment-processing: Additional information, such as domain, constraints, and data type, about each segment component is extracted.
  • 5. What is Interface Segmentation?5This project focuses on Segmentation, the 3rd stage of this process. Figure shows a segmented interface. The related components are grouped together. The left segment has 7 components. The right segment has 4 components (“cM Position:”, selection list, textbox, and “e.g., “10.0-40.0””).
  • 6. Why is Segmentation Challenging?From a user’s (or designer’s) standpoint, By looking at the visual arrangement of components, and based on past experiences, the user creates a logical boundary around the related components as they appear to belong to the same atomic query. On the other hand, a machine is unable to “see” a segment due to the following reasons: The components that are visually close to each other might be located very far apart in the HTML source code, A machine does not implicitly have any search experience that can be leveraged to identify a segment boundary. This project aims to investigate whether a machine can “learn” how to understand and segment an interface. Existing works have two shortcomings: they [9,13,17] do not group all related components together i.e. do not create complete segments. they [23, 7] use rules and heuristics to segment a search interface. These techniques have problems in handling scalability and heterogeneity [10]. 6
  • 7. Our Approach for Segmentation We incorporate the first-hand implicit knowledge using which a human designer is assumed to have designed an interface. This is accomplished by designing an artificial designer using Hidden Markov Models (refer to week 9’s slides on HMM introduction). We visualize segmentation as a two-folded problem Identification of boundaries of logical attributes (slide 9)Assignment of semantic labels (attribute-name, operator, and operand described in slide 9) to interface components. 7
  • 8. Interface Representation8In figure, each component of the lower segment is marked with a label, which we term as a semantic label. The semantic label for a particular component denotes the meaning of the component from a user’s or designer’s standpoint. Search EntityLogical AttributeLogical AttributeOperandOperatorAttribute-name
  • 9. Interface RepresentationAttribute-name: Attribute-name denotes the criteria available for searching a particular entity, e.g. the entity “Genes” can be searched by “Gene ID” and by “Gene Name”. Operand: An attribute-name is usually associated with operand(s), the value(s) entered by the user that is(are) matched against the corresponding field value(s) in the underlying database. Operator: The user may also be given an option of specifying the operator that further qualifies an operand. Filling up an HTML form is similar to writing SQL queries. Assuming the underlying database table name is “Gene”, the SQL queries for figure would be: SELECT * FROM Gene WHERE Gene_ID= ‘PF11_0344’; SELECT * FROM Gene WHERE Gene_Name LIKE ‘maggie’; Logical Attribute: The predicate in the WHERE clause of each query is created by a group of related components. We combine the semantic roles (attribute-name, operator(s), and operand(s)) of these components to create a composite semantic label called logical attribute. Our approach assumes that a segment corresponds to a logical attribute. 9
  • 10. HMM: The Artificial Designer10We assume that an HMM can act like a human designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components. The designing process is similar to statistically choosing one component from a bag of components (a superset of all possible components) and placing it on the Web page while keeping the semantic role (attribute-name, operand, or operator) of the component in mind. Knowledge of Semantic LabelsBag of ComponentsSearch InterfaceDesigning2-Layered HMM(Artificial Designer)Segments & Tagged ComponentsDecoding
  • 11. HMM: The Artificial DesignerWhile the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. In the figure, Ovals=states (semantic labels); Rectangles= emitted symbols (components).The designing ability is provided by training the HMM with suitable algorithms. Once an HMM is trained, it can be used for the decoding process i.e. for explaining the design of a given search interface. 11Attribute NameOperandOperatorAttribute NameOperandText(Gene ID)TextboxText(Gene Name)RadioButton GroupTextbox
  • 12. 2-Layered HMMThe problem of decoding that we address in this paper is two-folded involving segmentation as well as assignment of semantic labels to components. Hence, we employ a layered HMM [14] with 2 layers. The first layer T-HMM tags each component with appropriate semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical attributes.12
  • 13. 2-Layered Approach ArchitectureDOM-TREE PARSINGTraining InterfacesT-HMMManually tagged State SequencesT-HMM SpecsTest interfacesPredictedState SequencesT-HMM TRAININGT-HMM TESTINGS-HMMS-HMM SpecsTest interfacesS-HMM TRAININGManually tagged State SequencesS-HMM TESTINGPredictedState Sequences13
  • 14. Experimentation ParametersData-Set: 200 interfaces (NAR collection) http://www3.oup.co.uk/nar/database/c/Parsing: DOM-trees [3] of components. Trees were traversed in the depth-first search order .Testing and Training Data: The examples were randomly divided into 20 equal-sized sets. We conducted 20 experiments each having 190 training and 10 testing examples.Testing and Training Algorithms: In both layers, training and testing were performed using Maximum Likelihood method and Viterbi algorithm respectively.14
  • 16. ContributionsWe studied a challenging stage (segmentation) of the process of search interface understanding. In the context of deep Web, this is the third formal empirical study (after [23] and [7]) that groups components belonging to the same logical attribute together. We incorporated the first-hand knowledge of the designer for interface segmentation and component tagging. To the best of our knowledge, this is the first work to apply HMMs on deep Web search interfaces. The interface has been represented in terms of the underlying database. This helped in extracting database querying semantics. Moreover, we tested our method on a less-explored domain (biology), and found promising results. 16
  • 17. Future Work17To recover the schema of deep Web databases by extraction of finer details such as data type and constraints of logical attribute. To do justice to the balanced domain distribution of the deep Web [6], we want to test this method on interfaces from other less-explored domains. To improve the degree of automation we want to investigate the use of Baum Welch training algorithm. To minimize the zero emission probabilities, we want to investigate the use of Synset-HMM [20] .
  • 18. References18Benslimane, S. M., Malki, M., Rahmouni, M. K., & Benslimane, D. (2007). Extracting personalised ontology from data-intensive web application: An HTML forms-based reverse engineering approach.Informatica, 18(4), 511-534. Freitag, D., & Mccallum, A. K. (1999). Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 31-36. Gupta , S., Kaiser, G. E., Grimm , P., Chiang, M. F., & Starren, J. (2005). Automating content extraction of HTML documents. World Wide Web, 8(2), 179-224. Halevy, A. Y. (2005, Why your data won't mix: Semantic heterogeneity. Queue, 3, 50-50-58. He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM SIGMOD International Conference on Management of Data , San Diego, California. 217-228. He, B., Patel, M., Zhang, Z., & Chang, K. C. (2007a). Accessing the deep web. Communications of the ACM, 50(5), 94-101. He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007b). Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10(2), 133 - 155. He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with WISE-integrator. The VLDB Journal the International Journal on very Large Data Bases, 13(3), 256-273. Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on PDAs. Proceedings of the 10th International Conference on World Wide Web , Hong Kong, Hong Kong. Kushmerick , N. (2002). Finite-state approaches to web information extraction. 3rd Summer Convention on Information Extraction, 77-91. Kushmerick , N. (2003). Learning to invoke web forms. On the move to meaningful internet systems 2003 (pp. 997-1013) Springer Berlin / Heidelberg. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google's deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241-1252.
  • 19. References19Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment , Auckland, New Zealand. , 1(1) 684-694. Oliver, N., Garg, A., & Horvitz, E. (2004). Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96(2), 163-180. Pei, J., Hong, J., & Bell, D. (2006). A robust approach to schema matching over web query interfaces. Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06), Atlanta, Georgia. 46-55. Rabiner, L., R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on very Large Data Bases , Rome, Italy. 129-138. Russell, S., J., & Norvig, P. (2002). Artificial intelligence: Modern approach Prentice Hall. Seymore, K., Mccallum, A. K., & Rosenfeld , R. (1999). Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 37-42. Tran-Le, M. S., Vo-Dang , T. T., Ho-Van , Q., & Dang, T. K. (2008). Automatic information extraction from the web: An HMM-based approach. Modeling, simulation and optimization of complex processes (pp. 575-585) Springer Berlin Heidelberg. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419. Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data , Paris, France. 95 - 106. Zhang, Z., He, B., & Chang, K. C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France. 107 - 118. Zhong, P., & Chen, J. (2006). A generalized hidden markov model approach for web information extractionWeb Intelligence, 2006. WI 2006, Hong Kong, China. 709-718.