The impact of web on ir

Ms. T.Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore

WWW
 The world wide web is developed by Tim Berners-
Lee in 1990 at CERN to organize research documents
available on the internet.
 The World Wide Web ("WWW" or simply the "Web")
is a global information medium which users can read
and write via computers connected to the Internet.
 The term is often mistakenly used as a synonym for
the Internet itself, but the Web is a service that
operates over the Internet, just as e-mail also does.

Internet
 The internet is a series of huge computer networks that
allows many computers to connect and communicate with
each other globally.
 Upon the internet reside a series of languages which
allow information to travel between computers. These are
known as protocols.
 For instance, some common protocols for transferring
emails are IMAP, POP3 and SMTP.
 Just as email is a layer on the internet, the World Wide
Web is another layer which uses different protocols.

The World Wide Web uses three protocols:
 HTML (Hypertext markup language) - The language that
we write our web pages in.
 HTTP (Hypertext Transfer Protocol ) - Although other
protocols can be used such as FTP, this is the most common
protocol. It was developed specifically for the World Wide
Web and favored for its simplicity and speed. This protocol
requests the 'HTML' document from the server and serves it to
the browser.
 URLS (Uniform resource locator) - The last part of the puzzle
required to allow the web to work is a URL. This is the
address which indicates where any given document lives on
the web. It can be defined as <protocol>://<node>/<location>

IR on Web
Information Retrieval on the Web has always been
different and difficult task as compared with a
classical information retrieval system (Library
System).
 Hypertext: Documents present on the web are
different from general text-only documents because
of the presence of hyperlinks. It is estimated that
there are roughly 10 hyperlinks present per document

 Heterogeneity of document: The contents present
on a web page are heterogeneous in nature i.e., in
addition to text they might contain other multimedia
contents like audio, video and images.
 Duplication: On the Web, over 20% of the
documents present are either near or exact duplicates
of other documents and this estimation has not
included the semantic duplicates yet.

 Number of documents: The size of Web has grown
exponentially over the past few years. The collection of
documents is over trillions and this collection is much larger
than any collection of documents processed by an information
retrieval system. According to estimation, Web currently
grows by 10% per month.
 Lack of stability: Web pages lack stability in the sense that
the contents of Web pages are modified frequently. Moreover
any person using internet can create a Web pages even if it
contains authentic information or not.

 The users on the Web behave differently than the users of the
classical information retrieval systems. The users of the latter
are mostly trained librarians whereas the range of Web users
varies from a layman to a technically sound person. Typical
user behavior shows:
 Poor queries: Most of the queries submitted by users are
usually short and lack useful keywords that may help in the
retrieval of relevant information.
 Reaction to results: Usually users don’t evaluate all the
result screens, they restrict to only results displayed in the first
result screen.
 Heterogeneity of users: There is a wide variance in education
and Web experience between Web users.

IR systems includes two terms
 Objective
 Non-Objective
Objective terms: It is extrinsic to semantic content.
Ex: author name, document URL, date of publication.
Non-Objective Terms: It is intended to reflect the
information in the document and there is no
agreement about the choice or degree of applicability
of the terms, known as content terms.

IR Queries as follows:
 Keyword queries
 Boolean queries(AND,OR,NOT)
 Phrase queries
 Proximity queries
 Full document queries
 Natural Language Questions

 WWW expanding faster than any current search engine
can possibly index. Many web pages are updated
frequently or are dynamically generated which forces
search engines to repeatedly revisit them.
 Many dynamically allocated generated sites are not
indexable by search engines, known as Invisible web.
 The ordering of results is not always solely by relevance,
but sometimes influenced by monetary contributions. It is
difficult with business model.

 Some sites use tricks to manipulate the search engine
to improve their ranking for certain keywords, known
as search engine spamming
User
Info.needs Queries Stored Info
InformationSearch/Select
Translating
Info needs to
Queries
Matching Queries to
stored Information
Query Result evaluation:
Does the info found match
user’s info needs?

Web Problems divided into 2 classes
 Problem with data itself
 Problems regarding the user

 Distributed data-Documents Spread over millions of
different web servers
 Volatile data-Many documents change or disappear
rapidly Eg: Dead Links
 Large volume-Trillions of Separate documents
 Unstructured and redundant data-HTML errors, Duplicate
documents
 Quality of data-False Information, Poor quality writing
 Heterogeneous data-Multiple media types(Eg: images, video)

 How to specify the query?
 How to interpret the answer provided by the system?
To get the proper result, submit a good query to the search
system and obtain a manageable and relevant answer.

The impact of web on ir

More Related Content

What's hot (20)

Similar to The impact of web on ir (20)

Recently uploaded (20)

The impact of web on ir