SlideShare a Scribd company logo

Abstract— Information on the web is tremendously increasing in
recent years with the faster rate. This massive or voluminous data
has driven intricate problems for information retrieval and
knowledge management. As the data resides in a web with several
forms, the Knowledge management in the web is a challenging
task. Here the novel 'Semantic Web' concept may be used for
understanding the web contents by the machine to offer
intelligent services in an efficient way with a meaningful
knowledge representation. The data retrieval in the traditional
web source is focused on 'page ranking' techniques, whereas in
the semantic web the data retrieval processes are based on the
‘concept based learning'. The proposed work is aimed at the
development of a new framework for automatic generation of
ontology and RDF to some real time Web data, extracted from
multiple repositories by tracing their URI’s and Text Documents.
Improved inverted indexing technique is applied for ontology
generation and turtle notation is used for RDF notation. A
program is written for validating the extracted data from
multiple repositories by removing unwanted data and considering
only the document section of the web page.
Index Terms— Semantic Web, Resource description
framework, Ontology, Improved inverted indexing technique,
Knowledge management.
I. INTRODUCTION
World Wide Web (WWW) is considered as a global
information repository that identifies documents and other web
resources by Uniform Resource Locators, interlinked by
hypertext links. Search engines are used to retrieve the
information from the web. Data overburden is the most
concerning issue in these days for the existing system.
Evolution of web includes the web versions of web 1.0, 2.0
etc. In this series, the web version 3.0 is referred to as semantic
web [1] is evolved as a knowledge management support across
the globe. Search engines should be enriched with semantic
web capabilities that analyze webpage content and provide
more relevant results corresponding to the user query.
Semantic web standards include resource description
framework (RDF), web ontology, RDF Schema and rule
interchange format (RIF) for handling data. Resource
description framework (RDF) provides a conceptual
description of information for representing the web resources
like Turtle syntax, N-Triples etc. Resource Description
Framework (RDF) describes data on the Web in graph form
[2]. Ontologies consist of the finite set of terms, relationships,
constraints and axioms [3]. Ontologies have proven to be
useful for effective knowledge modeling and information
retrieval. The remaining paper is arranged as follows: In
Section 2 the related work is presented. The proposed work
and its methodology are discussed in Section 3 & 4. The
results are presented in Section 5. Conclusions are given in
Section 6.
II. RELATED WORK
M.S.P.Babu et.al [4] provided the overview of some of the
semantic search engines that yield unique search experience
for users. Wilkinson et.al [5] proposed an information retrieval
system using document structure. Amel Grissa Touzi et.al [6]
suggested the Fuzzy Ontology of Data mining (FODM) for
processing automated generation of ontologies in the domain
of data mining. Amira Aloui et.al [7] implemented a plugin
named “FO-FQ Tab plug-in”, which can be integrated with
protégé editor for building the fuzzy ontologies from large
databases. To overcome the drawbacks of the existing system
for accessing the related science information, M.S.P.Babu et.al
[8] proposed a new framework for automatic generation of
ontology and RDF for real-time web data. Tahani Alsubait
et.al [9] developed the e-learning suite, with the set of
questions designed using ontological representation.
A.H.M.Rupasingha et.al [10] suggested that the performance
of the ontology generation is always dependent on the
specificity of the terms. Seongwook Youn et.al [11] discussed
pros and cons of tools like protégé 2000, OilEd, Apollo,
OntoLingua, Onto Edit, webODE, KAON, ICOM, DEO,
webOnto that is used for ontology creation.Kgotatso Desmond
Mogotlane et.al[12] presented a comparative study of plugins
of protege tool like DB2OWL and Data Master.
Sudeepthi Govathoti1, M.S. Prasasd Babu2
Research Scholar, Andhra University, Visakhapatnam &Assistant Professor, Anurag Group of Institutions, Hyderabad,
Professor, Department of CS&SE, Andhra University, Visakhapatnam,
sudeepthicse@cvsr.ac.in1
, profmspbabu@gmail.com2
An Implementation of a New Framework for
Automatic Generation of Ontology and RDF to
Real Time Web and Journal Data
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
89 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
III. PROPOSED WORK
Semantic web capabilities like RDF & ontology are applied
to enrich the knowledge. The proposed work is an
implementation of the framework proposed by the authors [8].
The framework is designed with reference to the semantic web
Stack. It is carried out in two phases, namely Data extraction
phase and Data representation phase. Web scraping is
performed using HTML parsing technique in data extraction
phase by giving sample search query as an input to multiple
repositories. DOM parsing and HTML parsing techniques are
applied to validate the data retrieved from multiple repositories
by considering only the document section of the webpage.
Extensible markup language (XML) is the base for the
semantic web representations; the validated information is
converted into semi-structured notations by using XSD
declaration from DOM tree and passed as an input for the next
layers of the proposed framework. XML notation is given as
an input to data representation phase. RDF notation is
generated and represented in graphical form using Graphviz
tool. A textual representation of RDF graph is provided using
Turtle, the Terse RDF Triple Language. Improved Inverted
Indexing technique is applied for ontological representation of
words by excluding the stop words.
Figure 1: Proposed framework
IV METHODOLOGY
Implementation of the framework proposed in Section III will
be carried out in two phases namely data extraction and data
representation phases. The details are given below.
Phase 1: Data extraction
Data extraction phase performs web scraping from multiple
repositories and stores the scraped data into the database. The
Data extraction phase is sub- divided into three steps namely
web scraping, data validation, XML Conversion. The scraped
data is further validated by removing the unwanted data in the
considering document section of web page. The data stored in
table format in the database, after the data validation process,
is converted into the Semi-Structured Notations (i.e. XML
Notations) and passed as an input to the data representation
phase.
Step 1: Web Scraping
Web scraping, also be referred as screen scraping or Web
harvesting, is used to fetch and extract the data from a web
document using HTML parsing techniques. Here Web pages
are crawled and the content of the Web page is extracted. The
data in the Web page includes three sections namely: Web
page statistics bar, document section and descriptive section.
The three items are stored in a database as three different
attributes in a database table. HTML parsing technique is used
for scraping data from the web documents is shown in Fig 2:
Figure 2: Content of the Web page
Step 2: Data Validation
In the Data Validation Step, the data collected from step 1 is
validated using HTML and DOM parsing techniques. Here
unwanted data is removed and the necessary portion of URLs
is retained. In this step web status bar and descriptive sections
are removed in the database table. The validated data is stored
in a database. Document section displays the results in the
form of page title, URL, Snippet (description) for the given
search query.
Figure 3: Content of the Web page after Data validation
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
90 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Descriptive section provides the Wikipedia information
about the input query. The Data validation process is carried
out by considering only the document section of the web page
as shown in Fig: 3.
Step.3: XML Conversion
In XML Conversion Step the data, validated in step 2, is converted
into a DOM tree using XML Schema Definition (XSD). The
conversion is performed on data that is validated and stored in
database by considering each individual field/ attribute into
namespace convention. The XSD declaration of DOM tree
has hierarchical structures which have root node, representing
the search key word and three child nodes, representing Title,
URL and Description respectively. The XSD declaration of
the DOM tree with an example is shown in Fig 4:
Figure 4: XSD Declaration of DOM tree
Phase2: Data representation
Data extracted from steps 1,2 and 3 is maintained in an XML
format and is given as an input to data representation phase. In
Semantic Web architecture, the major source of data
representation imposes RDF-ization and Ontology generation.
Hence the data representation phase is sub- divided into two
steps namely RDF-ization and ontology generation, which are
explained in detail in step 4 & 5 respectively.
Step 4: RDF-ization
The Resource Description Framework (RDF) is the basic
building block in semantic web, promoting conceptual
modeling of web data [13]. The RDF-ization process is carried
out using Turtle notation and Graphviz tool. In this step the
XML notation data stored in extraction phase is given as an
input to RDF-ization. The RDF notation is visualized in the
form of RDF graph using Graphviz tool. Decomposition of
tuple creates a new blank node corresponding to the row and a
new triple set is obtained. Each tuple in a relational database is
decomposed as RDF triples, namely: the title is taken as
subject, URL is considered as predicate and description is
taken as object. A node can be a URI reference, literal or the
blank node. The graph in Fig: 5 is an example of RDF-ization
process of a semantic net.
Figure 5: RDF Triple
The triple is represented as a <subject, predicate, object>
format by exploring the relationship among the nodes [14].
The XML conversion carried out in step 3 is represented in
RDF syntax using Turtle notation from the convention
specified in “http://www.w3.org/1999/02/22-rdf-syntax-ns#“
as shown in Fig 6:
Fig-6: RDF-ization
The <rdf: Description> element provides the description of
resource identified by <rdf: about> attribute. The tags <rbss:
title>, <rbss: keyword>, <rbss: URL> are the properties of the
resource identified. The RDF represented in turtle notation is
visualized in a graphical format using Graphviz tool .It is open
source software that is used for generating graphs.
Step 5: Ontology Generation
Ontology is defined as a formal specification of
conceptualization of the domain of Interest. In ontology
generation step, the RDF notation obtained from step 4 is used
to create a vocabulary of words using improved inverted
indexing algorithm. Improved Inverted Indexing algorithm is
employed on real time web data collected from multiple
repositories and text documents. The words from the
description tags are extracted by excluding the stop words and
frequency count/Term frequency (TF) of each word is
maintained. The illustration of improved inverted indexing
algorithm is presented as follows:
Algorithm: Improved inverted indexing
Input: Database D= {T1, T2…Tn}, Storage Database
Output: Attributes {A1, A2…An}, where Ai, for i=1,2…n
are representing ontology vocabulary.
Parameters: Swrdk= Array of Stop Words
attsLq= Snippet attribute
Wordsk= Words stemmed from snippet attribute
attsLf= Word frequencies after stemming
attsL= ontology along with the frequencies count.
1. Swrdk={};
2. for i=0;i<=i+,i≤ D do
3. attsLq=Query Coverage(D,i);
4. Wordsk=Words Separate(attsLq,Swrdk);
5. attsLf= Words Usage frequency(D,attsLq,WordsK);
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
91 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. attsL=attsLqU attsLf
7. f=highest_freq(attsL)
8. if (f<freq(attsL)) then
9. sort(Wordsk,freq(attsL))
10. end if
11.end for
12.return (Wordsk,freq(attsL)
V. RESULTS AND DISCUSSION
The Semantic web stack proposed by Tim Berners Lee [15] is
implemented by using the frame work proposed by the authors
in section III. It is implemented in PHP version: 5 (Open
Source scripting language) and MySQL version: 5 (open-
source relational database management system) environments.
It is tested on an input with test dataset comprising of sample
search keywords. Response time is the amount of time that
elapses from the receipt of the query until the results are
displayed to user. Response time can be measured on server
side or client side as shown in Fig 7.
Figure 7: Response Time
Throughput is defined as number of queries executed per
second (qps). Throughput and response time are observed for
the set of retrieval operations with respect to the page load
times. The performance of framework implemented with
respect to throughput is shown in Fig 8.
Figure 8: Throughput
.
A sample search query is given as an input and web scraping
results are shown in Fig 9:
Figure 9: Web scraping results
Web scraping performance is evaluated by considering the
following parameters like database size and count of URL’s
extracted which is shown in Fig 10:
Figure 10: Web scraping analysis
Scraped data from multiple repositories is given as an input to
data validation step. The validated data is obtained as an
output to data validation process by applying HTML and
DOM parsing technique. Data validation considers only the
document section of a web page. Data validation results are
shown in Fig 11:
Figure 11: Data validation results
The performance of data validation processing with wanted
and unwanted data from the scraped data by considering the
following parameters like database size and count of URL’s is
shown in Fig 12:
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
92 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure12: Data Validation
The validated data stored in database is converted into the
XML notations by applying XSD declaration as shown in
Fig 13:
Figure13: XML Conversion
Resource Description Framework (RDF) is a recommended
standard of World Wide Web Consortium (W3C) [16]. RDF
representation of data in turtle form is shown in Fig 14:
Figure14: RDF-ization results Turtle form
RDF generation for sample relation named “testrdf” which has
an attributes as <name, description, freq> is considered. The
“testrdf” represents the relation name is considered as a class
in an RDF graph and has set of three nodes that are connecting
the testrdf in depth wise manner represents the tuple of a
relation as shown in Fig 15:
Figure15: RDF Graph generation
Ontology generation for the data obtained from multiple
repositories as well as the text file. Improved inverted
indexing technique is applied for extracting the words with
their frequencies discarding the stop words, in the order of
highest precedence. The result of ontology generation for real
time web data with frequency is shown in Fig 16:
Figure16: Ontology Generation for Real time web data
The highest frequency word is considered as a frequent search
term for the purpose of rule framing using description logic.
The rule mapping is done for the efficient retrieval operation
which will be future work. The result of ontology generation
for text document is shown in Fig 17.
Figure17: Ontology Generation for text document
VI. CONCLUSION
The evolution of web has taken many forms namely web 1.0,
web 2.0, web 3.0 , web 4.0 which lead to high-end
information retrieval systems using semantic web. The existing
traditional system collects the data from search engines is
exhibiting average performance in retrieval. Implementation of
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
93 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
proposed framework for automatic generation of ontology’s
and RDF improves the performance of traditional search
engines by incorporating semantic capabilities. It includes the
application of HTML parsing technique, DOM parsing
techniques and Turtle notation of graphviz tool. The algorithm
improves information retrieval in Semantic Web and Expert
Systems. The future work includes applying efficient
cryptography for securing database and rule framing for the
design of an expert system.
REFERENCES
[1] Sareh Aghaei, Mohammad Ali Nematbakhsh and Hadi Khosravi
Farsani, “Evolution Of The World Wide Web: From Web 1.0 To
Web 4.0”, International Journal of Web & Semantic Technology
(IJWesT) Vol.3, No.1, January 2012.
[2] Abdeslem DENNAI, Sidi Mohammed BENSLIMANE,"Semantic
Indexing of Web Documents Based on Domain Ontology", I.J.
Information Technology and Computer Science, 2015, 02, 1-11.
[3] Seema Redekar, Vishal Chekkala, Siddhapa Gouda, Swapnil
Yalgude,"Web Search Engine Using Ontology Learning",International
Journal of Innovative Research in Computer and Communication
Engineering,Vol. 5, Issue 3, March 2017.
[4] G Sudeepthi, G Anuradha, M Surendra Prasad Babu,” A survey on
semantic web search engine” International Journal of Computer
Science Issues, 2012/3 IJCSI, Volume 9 Issue 2 Pages 241-245.
[5] R. Wilkinson, ‟Effective retrieval of structured documents‟. (S.-V. New
York, Ed.) Pages 311 – 317, 1994.
[6] Amel Grissa Touzi, Hela Ben Massoud and Alaya Ayadi,” Automatic
Ontology Generation for Data Mining Using FCA and Clustering”,
arxiv.org, no. 1311.1764.
[7] Amira Aloui, ENIT, Tunis, Tunisia,”A Fuzzy Ontology-Based Platform
for Flexible Querying”, International Journal of Service Science,
Management, Engineering, and Technology, Vol.6, Issue 3, July-
September 2015, pp 12-26.
[8] Prof. M Surendra Prasad Babu, Sudeepthi Govathoti,”A Semantic Model
for Building Integrated Ontology Databases”, 7th
IEEE International
conference on software engineering and service science.
[9] Tahani Alsubait, Bijan Parsia, Ulrike Sattler,”Ontology-Based Multiple
Choice Question Generation”, Kunstl Intell, Vol. 30, 2016, pp 183-188.
[10] Rupasingha A. H. M. Rupasingha, “Improving Web Service Clustering
through a Novel Ontology Generation Method by Domain Specificity “,
IEEE 24th International Conference on Web Services, 2017.
[11] Seongwook Youn, Anchit Arora, Preetham Chandrasekhar, Paavany
Jayanty, Ashish Mestry and Shikha Sethi,”Survey about Ontology
Development Tools for Ontology-based Knowledge Management”.
[12] Kgotatso Desmond Mogotlane, Jean Vincent Fonou-
Dombeu,”Automatic Conversion of Relational Databases into
Ontologies”.
[13] FatemeAbiri, Mohsen Kahani, FataneZarinkalam"An Entity Based RDF
Indexing Schema Using Hadoop and HBase", 2011.
[14] Faizan Shaikh, Usman A. Siddiqui, IramShahzadi, SyedUami, Zubair A.
Shaik "SWISE: Semantic Web based Intelligent Search Engine", 2010.
[15] Gopal Pandey,” The Semantic Web: An Introduction and Issues”,
International Journal of Engineering Research and Applications, Vol. 2,
Issue 1,Jan-Feb 2012, pp.780-786.
[16] Resource Description Framework (RDF). Model and Syntax
Specification. Technical
report, W3C.
Prof. M.S.Prasad Babu was born on 12 -
08-1956 in Andhra Pradesh, India. He
obtained his bachelors to doctoral degrees
from Andhra University. He has 39 years of
teaching and research experience. He guided
12 PhD's and 210 PG students for their
thesis. He was the president of ICT section
of ISCA in 2006-07. He attended about 50
National and International Conferences in
India and abroad and presented keynotes. He contributed about
250 papers National and International journals and
conferences. He developed 10 international reputed Web
portals. He won ISCA Young Scientist Award in 1986, State
Best teacher award for engineering in 2015 and Dr. Sarvepalli
Radhakrishnan Best Academician Award of Andhra
University for 2014. He was the Conference Steering Chair
for eight IEEE ICSESS of Beijing, China from 2010 to 2017.
Prof Babu is presently working as Vice Principal of AU Engg
College, Chairman, Faculty of Engineering of Andhra
University and senior Professor in Computer Science &
Systems Engineering discipline.
G. SUDEEPTHI was born in,
Rajahmundry, Andhra Pradesh, India in
1983. She received B. TECH from National
Institute of Technology Warangal, India in
2005 and M. Tech in Computer Science &
Engineering from KLC Vijayawada, India
in 2007. She is working as part time
research scholar in the Dept. of CS & SE, Andhra University
and as a Assistant Professor at Anurag Group of Institutions
Hyderabad, India. She has got more than ten years of teaching
experience. She Qualified FET-2011 Conducted by JNTUH.
She contributed seven research papers in International journals
and Presented one research paper at International conference in
Beijing, China and another in Warangal, India.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
94 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

What's hot (19)

PDF
Semantic - Based Querying Using Ontology in Relational Database of Library Ma...
dannyijwest
 
PDF
XML Retrieval: A Survey
ijceronline
 
PPTX
Metadata mapping
Roldan Basilio
 
PPTX
Mods0210
Song,Yoo-hwa
 
PPTX
Linked Data for Czech Legislation
Martin Necasky
 
PPTX
Metadata Mapping & Crosswalks
Nikos Palavitsinis, PhD
 
PDF
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...
IJwest
 
PDF
Unit 4 rdbms study_material
gayaramesh
 
PPTX
Linked Data Hypercubes
Dave Reynolds
 
PDF
Paper id 25201463
IJRAT
 
PPTX
Information Intermediaries
Dave Reynolds
 
PDF
Automatically converting tabular data to
IJwest
 
PDF
ESWC SS 2013 - Tuesday Tutorial 2 Maribel Acosta and Barry Norton: Interactio...
eswcsummerschool
 
PDF
Spotlight
Stefano Lariccia
 
PPT
Genre discovery in corpus management systems (2004)
Joseba Abaitua
 
PDF
Phd presentation
Fabiana Lanotte
 
PDF
Unit 3 rdbms study_materials-converted
gayaramesh
 
PPT
Metadata crosswalks
Richard.Sapon-White
 
PDF
Short Report Bridges performance gap between Relational and RDF
Akram Abbasi
 
Semantic - Based Querying Using Ontology in Relational Database of Library Ma...
dannyijwest
 
XML Retrieval: A Survey
ijceronline
 
Metadata mapping
Roldan Basilio
 
Mods0210
Song,Yoo-hwa
 
Linked Data for Czech Legislation
Martin Necasky
 
Metadata Mapping & Crosswalks
Nikos Palavitsinis, PhD
 
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...
IJwest
 
Unit 4 rdbms study_material
gayaramesh
 
Linked Data Hypercubes
Dave Reynolds
 
Paper id 25201463
IJRAT
 
Information Intermediaries
Dave Reynolds
 
Automatically converting tabular data to
IJwest
 
ESWC SS 2013 - Tuesday Tutorial 2 Maribel Acosta and Barry Norton: Interactio...
eswcsummerschool
 
Spotlight
Stefano Lariccia
 
Genre discovery in corpus management systems (2004)
Joseba Abaitua
 
Phd presentation
Fabiana Lanotte
 
Unit 3 rdbms study_materials-converted
gayaramesh
 
Metadata crosswalks
Richard.Sapon-White
 
Short Report Bridges performance gap between Relational and RDF
Akram Abbasi
 

Similar to An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data (20)

PDF
A semantic based approach for knowledge discovery and acquistion from multipl...
csandit
 
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
cscpconf
 
PDF
H017554148
IOSR Journals
 
PDF
IRJET- Data Retrieval using Master Resource Description Framework
IRJET Journal
 
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
IJwest
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
Computer Science Journals
 
PDF
Semantic Knowledge Acquisition of Information for Syntactic web
dannyijwest
 
PDF
A Novel Data Extraction and Alignment Method for Web Databases
IJMER
 
PDF
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
PDF
L017418893
IOSR Journals
 
PDF
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
International Journal of Technical Research & Application
 
PDF
Vision Based Deep Web data Extraction on Nested Query Result Records
IJMER
 
PPT
The Social Data Web
George Thomas
 
PPT
Gt ea2009
George Thomas
 
PDF
R01765113122
IOSR Journals
 
PDF
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
iosrjce
 
PPTX
Data.dcs: Converting Legacy Data into Linked Data
Matthew Rowe
 
PDF
Annotation for query result records based on domain specific ontology
ijnlc
 
PDF
Annotating Search Results from Web Databases
Mohit Sngg
 
A semantic based approach for knowledge discovery and acquistion from multipl...
csandit
 
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
cscpconf
 
H017554148
IOSR Journals
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET Journal
 
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
IJwest
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
Computer Science Journals
 
Semantic Knowledge Acquisition of Information for Syntactic web
dannyijwest
 
A Novel Data Extraction and Alignment Method for Web Databases
IJMER
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
L017418893
IOSR Journals
 
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
International Journal of Technical Research & Application
 
Vision Based Deep Web data Extraction on Nested Query Result Records
IJMER
 
The Social Data Web
George Thomas
 
Gt ea2009
George Thomas
 
R01765113122
IOSR Journals
 
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
iosrjce
 
Data.dcs: Converting Legacy Data into Linked Data
Matthew Rowe
 
Annotation for query result records based on domain specific ontology
ijnlc
 
Annotating Search Results from Web Databases
Mohit Sngg
 
Ad

Recently uploaded (20)

PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Ad

An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real-Time Web and Journal Data

  • 1.  Abstract— Information on the web is tremendously increasing in recent years with the faster rate. This massive or voluminous data has driven intricate problems for information retrieval and knowledge management. As the data resides in a web with several forms, the Knowledge management in the web is a challenging task. Here the novel 'Semantic Web' concept may be used for understanding the web contents by the machine to offer intelligent services in an efficient way with a meaningful knowledge representation. The data retrieval in the traditional web source is focused on 'page ranking' techniques, whereas in the semantic web the data retrieval processes are based on the ‘concept based learning'. The proposed work is aimed at the development of a new framework for automatic generation of ontology and RDF to some real time Web data, extracted from multiple repositories by tracing their URI’s and Text Documents. Improved inverted indexing technique is applied for ontology generation and turtle notation is used for RDF notation. A program is written for validating the extracted data from multiple repositories by removing unwanted data and considering only the document section of the web page. Index Terms— Semantic Web, Resource description framework, Ontology, Improved inverted indexing technique, Knowledge management. I. INTRODUCTION World Wide Web (WWW) is considered as a global information repository that identifies documents and other web resources by Uniform Resource Locators, interlinked by hypertext links. Search engines are used to retrieve the information from the web. Data overburden is the most concerning issue in these days for the existing system. Evolution of web includes the web versions of web 1.0, 2.0 etc. In this series, the web version 3.0 is referred to as semantic web [1] is evolved as a knowledge management support across the globe. Search engines should be enriched with semantic web capabilities that analyze webpage content and provide more relevant results corresponding to the user query. Semantic web standards include resource description framework (RDF), web ontology, RDF Schema and rule interchange format (RIF) for handling data. Resource description framework (RDF) provides a conceptual description of information for representing the web resources like Turtle syntax, N-Triples etc. Resource Description Framework (RDF) describes data on the Web in graph form [2]. Ontologies consist of the finite set of terms, relationships, constraints and axioms [3]. Ontologies have proven to be useful for effective knowledge modeling and information retrieval. The remaining paper is arranged as follows: In Section 2 the related work is presented. The proposed work and its methodology are discussed in Section 3 & 4. The results are presented in Section 5. Conclusions are given in Section 6. II. RELATED WORK M.S.P.Babu et.al [4] provided the overview of some of the semantic search engines that yield unique search experience for users. Wilkinson et.al [5] proposed an information retrieval system using document structure. Amel Grissa Touzi et.al [6] suggested the Fuzzy Ontology of Data mining (FODM) for processing automated generation of ontologies in the domain of data mining. Amira Aloui et.al [7] implemented a plugin named “FO-FQ Tab plug-in”, which can be integrated with protégé editor for building the fuzzy ontologies from large databases. To overcome the drawbacks of the existing system for accessing the related science information, M.S.P.Babu et.al [8] proposed a new framework for automatic generation of ontology and RDF for real-time web data. Tahani Alsubait et.al [9] developed the e-learning suite, with the set of questions designed using ontological representation. A.H.M.Rupasingha et.al [10] suggested that the performance of the ontology generation is always dependent on the specificity of the terms. Seongwook Youn et.al [11] discussed pros and cons of tools like protégé 2000, OilEd, Apollo, OntoLingua, Onto Edit, webODE, KAON, ICOM, DEO, webOnto that is used for ontology creation.Kgotatso Desmond Mogotlane et.al[12] presented a comparative study of plugins of protege tool like DB2OWL and Data Master. Sudeepthi Govathoti1, M.S. Prasasd Babu2 Research Scholar, Andhra University, Visakhapatnam &Assistant Professor, Anurag Group of Institutions, Hyderabad, Professor, Department of CS&SE, Andhra University, Visakhapatnam, [email protected] , [email protected] An Implementation of a New Framework for Automatic Generation of Ontology and RDF to Real Time Web and Journal Data International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 89 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. III. PROPOSED WORK Semantic web capabilities like RDF & ontology are applied to enrich the knowledge. The proposed work is an implementation of the framework proposed by the authors [8]. The framework is designed with reference to the semantic web Stack. It is carried out in two phases, namely Data extraction phase and Data representation phase. Web scraping is performed using HTML parsing technique in data extraction phase by giving sample search query as an input to multiple repositories. DOM parsing and HTML parsing techniques are applied to validate the data retrieved from multiple repositories by considering only the document section of the webpage. Extensible markup language (XML) is the base for the semantic web representations; the validated information is converted into semi-structured notations by using XSD declaration from DOM tree and passed as an input for the next layers of the proposed framework. XML notation is given as an input to data representation phase. RDF notation is generated and represented in graphical form using Graphviz tool. A textual representation of RDF graph is provided using Turtle, the Terse RDF Triple Language. Improved Inverted Indexing technique is applied for ontological representation of words by excluding the stop words. Figure 1: Proposed framework IV METHODOLOGY Implementation of the framework proposed in Section III will be carried out in two phases namely data extraction and data representation phases. The details are given below. Phase 1: Data extraction Data extraction phase performs web scraping from multiple repositories and stores the scraped data into the database. The Data extraction phase is sub- divided into three steps namely web scraping, data validation, XML Conversion. The scraped data is further validated by removing the unwanted data in the considering document section of web page. The data stored in table format in the database, after the data validation process, is converted into the Semi-Structured Notations (i.e. XML Notations) and passed as an input to the data representation phase. Step 1: Web Scraping Web scraping, also be referred as screen scraping or Web harvesting, is used to fetch and extract the data from a web document using HTML parsing techniques. Here Web pages are crawled and the content of the Web page is extracted. The data in the Web page includes three sections namely: Web page statistics bar, document section and descriptive section. The three items are stored in a database as three different attributes in a database table. HTML parsing technique is used for scraping data from the web documents is shown in Fig 2: Figure 2: Content of the Web page Step 2: Data Validation In the Data Validation Step, the data collected from step 1 is validated using HTML and DOM parsing techniques. Here unwanted data is removed and the necessary portion of URLs is retained. In this step web status bar and descriptive sections are removed in the database table. The validated data is stored in a database. Document section displays the results in the form of page title, URL, Snippet (description) for the given search query. Figure 3: Content of the Web page after Data validation International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 90 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. Descriptive section provides the Wikipedia information about the input query. The Data validation process is carried out by considering only the document section of the web page as shown in Fig: 3. Step.3: XML Conversion In XML Conversion Step the data, validated in step 2, is converted into a DOM tree using XML Schema Definition (XSD). The conversion is performed on data that is validated and stored in database by considering each individual field/ attribute into namespace convention. The XSD declaration of DOM tree has hierarchical structures which have root node, representing the search key word and three child nodes, representing Title, URL and Description respectively. The XSD declaration of the DOM tree with an example is shown in Fig 4: Figure 4: XSD Declaration of DOM tree Phase2: Data representation Data extracted from steps 1,2 and 3 is maintained in an XML format and is given as an input to data representation phase. In Semantic Web architecture, the major source of data representation imposes RDF-ization and Ontology generation. Hence the data representation phase is sub- divided into two steps namely RDF-ization and ontology generation, which are explained in detail in step 4 & 5 respectively. Step 4: RDF-ization The Resource Description Framework (RDF) is the basic building block in semantic web, promoting conceptual modeling of web data [13]. The RDF-ization process is carried out using Turtle notation and Graphviz tool. In this step the XML notation data stored in extraction phase is given as an input to RDF-ization. The RDF notation is visualized in the form of RDF graph using Graphviz tool. Decomposition of tuple creates a new blank node corresponding to the row and a new triple set is obtained. Each tuple in a relational database is decomposed as RDF triples, namely: the title is taken as subject, URL is considered as predicate and description is taken as object. A node can be a URI reference, literal or the blank node. The graph in Fig: 5 is an example of RDF-ization process of a semantic net. Figure 5: RDF Triple The triple is represented as a <subject, predicate, object> format by exploring the relationship among the nodes [14]. The XML conversion carried out in step 3 is represented in RDF syntax using Turtle notation from the convention specified in “http://www.w3.org/1999/02/22-rdf-syntax-ns#“ as shown in Fig 6: Fig-6: RDF-ization The <rdf: Description> element provides the description of resource identified by <rdf: about> attribute. The tags <rbss: title>, <rbss: keyword>, <rbss: URL> are the properties of the resource identified. The RDF represented in turtle notation is visualized in a graphical format using Graphviz tool .It is open source software that is used for generating graphs. Step 5: Ontology Generation Ontology is defined as a formal specification of conceptualization of the domain of Interest. In ontology generation step, the RDF notation obtained from step 4 is used to create a vocabulary of words using improved inverted indexing algorithm. Improved Inverted Indexing algorithm is employed on real time web data collected from multiple repositories and text documents. The words from the description tags are extracted by excluding the stop words and frequency count/Term frequency (TF) of each word is maintained. The illustration of improved inverted indexing algorithm is presented as follows: Algorithm: Improved inverted indexing Input: Database D= {T1, T2…Tn}, Storage Database Output: Attributes {A1, A2…An}, where Ai, for i=1,2…n are representing ontology vocabulary. Parameters: Swrdk= Array of Stop Words attsLq= Snippet attribute Wordsk= Words stemmed from snippet attribute attsLf= Word frequencies after stemming attsL= ontology along with the frequencies count. 1. Swrdk={}; 2. for i=0;i<=i+,i≤ D do 3. attsLq=Query Coverage(D,i); 4. Wordsk=Words Separate(attsLq,Swrdk); 5. attsLf= Words Usage frequency(D,attsLq,WordsK); International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 91 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. 6. attsL=attsLqU attsLf 7. f=highest_freq(attsL) 8. if (f<freq(attsL)) then 9. sort(Wordsk,freq(attsL)) 10. end if 11.end for 12.return (Wordsk,freq(attsL) V. RESULTS AND DISCUSSION The Semantic web stack proposed by Tim Berners Lee [15] is implemented by using the frame work proposed by the authors in section III. It is implemented in PHP version: 5 (Open Source scripting language) and MySQL version: 5 (open- source relational database management system) environments. It is tested on an input with test dataset comprising of sample search keywords. Response time is the amount of time that elapses from the receipt of the query until the results are displayed to user. Response time can be measured on server side or client side as shown in Fig 7. Figure 7: Response Time Throughput is defined as number of queries executed per second (qps). Throughput and response time are observed for the set of retrieval operations with respect to the page load times. The performance of framework implemented with respect to throughput is shown in Fig 8. Figure 8: Throughput . A sample search query is given as an input and web scraping results are shown in Fig 9: Figure 9: Web scraping results Web scraping performance is evaluated by considering the following parameters like database size and count of URL’s extracted which is shown in Fig 10: Figure 10: Web scraping analysis Scraped data from multiple repositories is given as an input to data validation step. The validated data is obtained as an output to data validation process by applying HTML and DOM parsing technique. Data validation considers only the document section of a web page. Data validation results are shown in Fig 11: Figure 11: Data validation results The performance of data validation processing with wanted and unwanted data from the scraped data by considering the following parameters like database size and count of URL’s is shown in Fig 12: International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 92 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. Figure12: Data Validation The validated data stored in database is converted into the XML notations by applying XSD declaration as shown in Fig 13: Figure13: XML Conversion Resource Description Framework (RDF) is a recommended standard of World Wide Web Consortium (W3C) [16]. RDF representation of data in turtle form is shown in Fig 14: Figure14: RDF-ization results Turtle form RDF generation for sample relation named “testrdf” which has an attributes as <name, description, freq> is considered. The “testrdf” represents the relation name is considered as a class in an RDF graph and has set of three nodes that are connecting the testrdf in depth wise manner represents the tuple of a relation as shown in Fig 15: Figure15: RDF Graph generation Ontology generation for the data obtained from multiple repositories as well as the text file. Improved inverted indexing technique is applied for extracting the words with their frequencies discarding the stop words, in the order of highest precedence. The result of ontology generation for real time web data with frequency is shown in Fig 16: Figure16: Ontology Generation for Real time web data The highest frequency word is considered as a frequent search term for the purpose of rule framing using description logic. The rule mapping is done for the efficient retrieval operation which will be future work. The result of ontology generation for text document is shown in Fig 17. Figure17: Ontology Generation for text document VI. CONCLUSION The evolution of web has taken many forms namely web 1.0, web 2.0, web 3.0 , web 4.0 which lead to high-end information retrieval systems using semantic web. The existing traditional system collects the data from search engines is exhibiting average performance in retrieval. Implementation of International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 93 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. proposed framework for automatic generation of ontology’s and RDF improves the performance of traditional search engines by incorporating semantic capabilities. It includes the application of HTML parsing technique, DOM parsing techniques and Turtle notation of graphviz tool. The algorithm improves information retrieval in Semantic Web and Expert Systems. The future work includes applying efficient cryptography for securing database and rule framing for the design of an expert system. REFERENCES [1] Sareh Aghaei, Mohammad Ali Nematbakhsh and Hadi Khosravi Farsani, “Evolution Of The World Wide Web: From Web 1.0 To Web 4.0”, International Journal of Web & Semantic Technology (IJWesT) Vol.3, No.1, January 2012. [2] Abdeslem DENNAI, Sidi Mohammed BENSLIMANE,"Semantic Indexing of Web Documents Based on Domain Ontology", I.J. Information Technology and Computer Science, 2015, 02, 1-11. [3] Seema Redekar, Vishal Chekkala, Siddhapa Gouda, Swapnil Yalgude,"Web Search Engine Using Ontology Learning",International Journal of Innovative Research in Computer and Communication Engineering,Vol. 5, Issue 3, March 2017. [4] G Sudeepthi, G Anuradha, M Surendra Prasad Babu,” A survey on semantic web search engine” International Journal of Computer Science Issues, 2012/3 IJCSI, Volume 9 Issue 2 Pages 241-245. [5] R. Wilkinson, ‟Effective retrieval of structured documents‟. (S.-V. New York, Ed.) Pages 311 – 317, 1994. [6] Amel Grissa Touzi, Hela Ben Massoud and Alaya Ayadi,” Automatic Ontology Generation for Data Mining Using FCA and Clustering”, arxiv.org, no. 1311.1764. [7] Amira Aloui, ENIT, Tunis, Tunisia,”A Fuzzy Ontology-Based Platform for Flexible Querying”, International Journal of Service Science, Management, Engineering, and Technology, Vol.6, Issue 3, July- September 2015, pp 12-26. [8] Prof. M Surendra Prasad Babu, Sudeepthi Govathoti,”A Semantic Model for Building Integrated Ontology Databases”, 7th IEEE International conference on software engineering and service science. [9] Tahani Alsubait, Bijan Parsia, Ulrike Sattler,”Ontology-Based Multiple Choice Question Generation”, Kunstl Intell, Vol. 30, 2016, pp 183-188. [10] Rupasingha A. H. M. Rupasingha, “Improving Web Service Clustering through a Novel Ontology Generation Method by Domain Specificity “, IEEE 24th International Conference on Web Services, 2017. [11] Seongwook Youn, Anchit Arora, Preetham Chandrasekhar, Paavany Jayanty, Ashish Mestry and Shikha Sethi,”Survey about Ontology Development Tools for Ontology-based Knowledge Management”. [12] Kgotatso Desmond Mogotlane, Jean Vincent Fonou- Dombeu,”Automatic Conversion of Relational Databases into Ontologies”. [13] FatemeAbiri, Mohsen Kahani, FataneZarinkalam"An Entity Based RDF Indexing Schema Using Hadoop and HBase", 2011. [14] Faizan Shaikh, Usman A. Siddiqui, IramShahzadi, SyedUami, Zubair A. Shaik "SWISE: Semantic Web based Intelligent Search Engine", 2010. [15] Gopal Pandey,” The Semantic Web: An Introduction and Issues”, International Journal of Engineering Research and Applications, Vol. 2, Issue 1,Jan-Feb 2012, pp.780-786. [16] Resource Description Framework (RDF). Model and Syntax Specification. Technical report, W3C. Prof. M.S.Prasad Babu was born on 12 - 08-1956 in Andhra Pradesh, India. He obtained his bachelors to doctoral degrees from Andhra University. He has 39 years of teaching and research experience. He guided 12 PhD's and 210 PG students for their thesis. He was the president of ICT section of ISCA in 2006-07. He attended about 50 National and International Conferences in India and abroad and presented keynotes. He contributed about 250 papers National and International journals and conferences. He developed 10 international reputed Web portals. He won ISCA Young Scientist Award in 1986, State Best teacher award for engineering in 2015 and Dr. Sarvepalli Radhakrishnan Best Academician Award of Andhra University for 2014. He was the Conference Steering Chair for eight IEEE ICSESS of Beijing, China from 2010 to 2017. Prof Babu is presently working as Vice Principal of AU Engg College, Chairman, Faculty of Engineering of Andhra University and senior Professor in Computer Science & Systems Engineering discipline. G. SUDEEPTHI was born in, Rajahmundry, Andhra Pradesh, India in 1983. She received B. TECH from National Institute of Technology Warangal, India in 2005 and M. Tech in Computer Science & Engineering from KLC Vijayawada, India in 2007. She is working as part time research scholar in the Dept. of CS & SE, Andhra University and as a Assistant Professor at Anurag Group of Institutions Hyderabad, India. She has got more than ten years of teaching experience. She Qualified FET-2011 Conducted by JNTUH. She contributed seven research papers in International journals and Presented one research paper at International conference in Beijing, China and another in Warangal, India. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 94 https://sites.google.com/site/ijcsis/ ISSN 1947-5500