SlideShare a Scribd company logo
6
Most read
9
Most read
15
Most read
Ms. T.Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore
WWW
 The world wide web is developed by Tim Berners-
Lee in 1990 at CERN to organize research documents
available on the internet.
 The World Wide Web ("WWW" or simply the "Web")
is a global information medium which users can read
and write via computers connected to the Internet.
 The term is often mistakenly used as a synonym for
the Internet itself, but the Web is a service that
operates over the Internet, just as e-mail also does.
Internet
 The internet is a series of huge computer networks that
allows many computers to connect and communicate with
each other globally.
 Upon the internet reside a series of languages which
allow information to travel between computers. These are
known as protocols.
 For instance, some common protocols for transferring
emails are IMAP, POP3 and SMTP.
 Just as email is a layer on the internet, the World Wide
Web is another layer which uses different protocols.
The World Wide Web uses three protocols:
 HTML (Hypertext markup language) - The language that
we write our web pages in.
 HTTP (Hypertext Transfer Protocol ) - Although other
protocols can be used such as FTP, this is the most common
protocol. It was developed specifically for the World Wide
Web and favored for its simplicity and speed. This protocol
requests the 'HTML' document from the server and serves it to
the browser.
 URLS (Uniform resource locator) - The last part of the puzzle
required to allow the web to work is a URL. This is the
address which indicates where any given document lives on
the web. It can be defined as <protocol>://<node>/<location>
IR on Web
Information Retrieval on the Web has always been
different and difficult task as compared with a
classical information retrieval system (Library
System).
 Hypertext: Documents present on the web are
different from general text-only documents because
of the presence of hyperlinks. It is estimated that
there are roughly 10 hyperlinks present per document
 Heterogeneity of document: The contents present
on a web page are heterogeneous in nature i.e., in
addition to text they might contain other multimedia
contents like audio, video and images.
 Duplication: On the Web, over 20% of the
documents present are either near or exact duplicates
of other documents and this estimation has not
included the semantic duplicates yet.
 Number of documents: The size of Web has grown
exponentially over the past few years. The collection of
documents is over trillions and this collection is much larger
than any collection of documents processed by an information
retrieval system. According to estimation, Web currently
grows by 10% per month.
 Lack of stability: Web pages lack stability in the sense that
the contents of Web pages are modified frequently. Moreover
any person using internet can create a Web pages even if it
contains authentic information or not.
 The users on the Web behave differently than the users of the
classical information retrieval systems. The users of the latter
are mostly trained librarians whereas the range of Web users
varies from a layman to a technically sound person. Typical
user behavior shows:
 Poor queries: Most of the queries submitted by users are
usually short and lack useful keywords that may help in the
retrieval of relevant information.
 Reaction to results: Usually users don’t evaluate all the
result screens, they restrict to only results displayed in the first
result screen.
 Heterogeneity of users: There is a wide variance in education
and Web experience between Web users.
IR systems includes two terms
 Objective
 Non-Objective
Objective terms: It is extrinsic to semantic content.
Ex: author name, document URL, date of publication.
Non-Objective Terms: It is intended to reflect the
information in the document and there is no
agreement about the choice or degree of applicability
of the terms, known as content terms.
IR Queries as follows:
 Keyword queries
 Boolean queries(AND,OR,NOT)
 Phrase queries
 Proximity queries
 Full document queries
 Natural Language Questions
 WWW expanding faster than any current search engine
can possibly index. Many web pages are updated
frequently or are dynamically generated which forces
search engines to repeatedly revisit them.
 Many dynamically allocated generated sites are not
indexable by search engines, known as Invisible web.
 The ordering of results is not always solely by relevance,
but sometimes influenced by monetary contributions. It is
difficult with business model.
 Some sites use tricks to manipulate the search engine
to improve their ranking for certain keywords, known
as search engine spamming
User
Info.needs Queries Stored Info
InformationSearch/Select
Translating
Info needs to
Queries
Matching Queries to
stored Information
Query Result evaluation:
Does the info found match
user’s info needs?
Web Problems divided into 2 classes
 Problem with data itself
 Problems regarding the user
 Distributed data-Documents Spread over millions of
different web servers
 Volatile data-Many documents change or disappear
rapidly Eg: Dead Links
 Large volume-Trillions of Separate documents
 Unstructured and redundant data-HTML errors, Duplicate
documents
 Quality of data-False Information, Poor quality writing
 Heterogeneous data-Multiple media types(Eg: images, video)
 How to specify the query?
 How to interpret the answer provided by the system?
To get the proper result, submit a good query to the search
system and obtain a manageable and relevant answer.
Thank You!!!!!!!!!

More Related Content

PPTX
Web search vs ir
Primya Tamil
 
PPTX
Components of a search engine
Primya Tamil
 
PPTX
Introduction to Information Retrieval
Roi Blanco
 
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
PPT
Information Retrieval Models
Nisha Arankandath
 
PPTX
Information retrieval introduction
nimmyjans4
 
PPTX
Probabilistic information retrieval models & systems
Selman Bozkır
 
PPTX
Vector space model of information retrieval
Nanthini Dominique
 
Web search vs ir
Primya Tamil
 
Components of a search engine
Primya Tamil
 
Introduction to Information Retrieval
Roi Blanco
 
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Information Retrieval Models
Nisha Arankandath
 
Information retrieval introduction
nimmyjans4
 
Probabilistic information retrieval models & systems
Selman Bozkır
 
Vector space model of information retrieval
Nanthini Dominique
 

What's hot (20)

PPTX
Term weighting
Primya Tamil
 
PPTX
Text mining
Koshy Geoji
 
PPT
Inverted index
Krishna Gehlot
 
PDF
CS8080 IRT UNIT I NOTES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
Information retrieval-systems notes
BAIRAVI T
 
PPTX
Data mining: Classification and prediction
DataminingTools Inc
 
PPTX
Grid protocol architecture
Pooja Dixit
 
PPTX
Automatic indexing
dhatchayaninandu
 
PPTX
Data mining primitives
lavanya marichamy
 
PPT
Web data mining
Institute of Technology Telkom
 
PDF
PAC Learning
Sanghyuk Chun
 
DOCX
Big data lecture notes
Mohit Saini
 
PPT
Clustering: Large Databases in data mining
ZHAO Sam
 
PPTX
Handwritten Digit Recognition(Convolutional Neural Network) PPT
RishabhTyagi48
 
PPTX
Boolean,vector space retrieval Models
Primya Tamil
 
PPTX
Challenges of Conventional Systems.pptx
GovardhanV7
 
PPTX
Classification in data mining
Sulman Ahmed
 
PDF
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
PPTX
Text data mining1
KU Leuven
 
PPTX
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Term weighting
Primya Tamil
 
Text mining
Koshy Geoji
 
Inverted index
Krishna Gehlot
 
Information retrieval-systems notes
BAIRAVI T
 
Data mining: Classification and prediction
DataminingTools Inc
 
Grid protocol architecture
Pooja Dixit
 
Automatic indexing
dhatchayaninandu
 
Data mining primitives
lavanya marichamy
 
PAC Learning
Sanghyuk Chun
 
Big data lecture notes
Mohit Saini
 
Clustering: Large Databases in data mining
ZHAO Sam
 
Handwritten Digit Recognition(Convolutional Neural Network) PPT
RishabhTyagi48
 
Boolean,vector space retrieval Models
Primya Tamil
 
Challenges of Conventional Systems.pptx
GovardhanV7
 
Classification in data mining
Sulman Ahmed
 
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Text data mining1
KU Leuven
 
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Ad

Similar to The impact of web on ir (20)

PDF
CS8080_IRT__UNIT_I_NOTES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PPTX
Empowerment Technolodfvdvsadvgy III.pptx
JOHNPAOLOROSARIO
 
PDF
Www journey
AYUSH JAIN
 
PPT
Unit 1
karthiksmart21
 
PPTX
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
PPTX
Compo2 prelim
patrick060194
 
PDF
A42020106
IJERA Editor
 
PPTX
Introduction.pptx
Mahsadelavari
 
PPTX
World Wide Web (WWW)
Pramod Kshirsagar
 
PPTX
Internet
Sonika koul
 
PPT
Web Search and Mining
sathish sak
 
PPTX
world wide web
Richa Vasant
 
PPT
Itz Lecture Bi & Web Tech Standards Feb 2009
subramanian K
 
PPT
Information retrieval
Luis Goldster
 
PPT
Information retrieval
Harry Potter
 
PPT
Information retrieval
Young Alista
 
PPT
Information retrieval
James Wong
 
PPT
Information retrieval
Tony Nguyen
 
PPT
Information retrieval
Fraboni Ec
 
CS8080_IRT__UNIT_I_NOTES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Empowerment Technolodfvdvsadvgy III.pptx
JOHNPAOLOROSARIO
 
Www journey
AYUSH JAIN
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Compo2 prelim
patrick060194
 
A42020106
IJERA Editor
 
Introduction.pptx
Mahsadelavari
 
World Wide Web (WWW)
Pramod Kshirsagar
 
Internet
Sonika koul
 
Web Search and Mining
sathish sak
 
world wide web
Richa Vasant
 
Itz Lecture Bi & Web Tech Standards Feb 2009
subramanian K
 
Information retrieval
Luis Goldster
 
Information retrieval
Harry Potter
 
Information retrieval
Young Alista
 
Information retrieval
James Wong
 
Information retrieval
Tony Nguyen
 
Information retrieval
Fraboni Ec
 
Ad

Recently uploaded (20)

PPTX
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
Basics and rules of probability with real-life uses
ravatkaran694
 

The impact of web on ir

  • 1. Ms. T.Primya Assistant Professor Department of Computer Science and Engineering Dr. N. G. P. Institute of Technology Coimbatore
  • 2. WWW  The world wide web is developed by Tim Berners- Lee in 1990 at CERN to organize research documents available on the internet.  The World Wide Web ("WWW" or simply the "Web") is a global information medium which users can read and write via computers connected to the Internet.  The term is often mistakenly used as a synonym for the Internet itself, but the Web is a service that operates over the Internet, just as e-mail also does.
  • 3. Internet  The internet is a series of huge computer networks that allows many computers to connect and communicate with each other globally.  Upon the internet reside a series of languages which allow information to travel between computers. These are known as protocols.  For instance, some common protocols for transferring emails are IMAP, POP3 and SMTP.  Just as email is a layer on the internet, the World Wide Web is another layer which uses different protocols.
  • 4. The World Wide Web uses three protocols:  HTML (Hypertext markup language) - The language that we write our web pages in.  HTTP (Hypertext Transfer Protocol ) - Although other protocols can be used such as FTP, this is the most common protocol. It was developed specifically for the World Wide Web and favored for its simplicity and speed. This protocol requests the 'HTML' document from the server and serves it to the browser.  URLS (Uniform resource locator) - The last part of the puzzle required to allow the web to work is a URL. This is the address which indicates where any given document lives on the web. It can be defined as <protocol>://<node>/<location>
  • 5. IR on Web Information Retrieval on the Web has always been different and difficult task as compared with a classical information retrieval system (Library System).  Hypertext: Documents present on the web are different from general text-only documents because of the presence of hyperlinks. It is estimated that there are roughly 10 hyperlinks present per document
  • 6.  Heterogeneity of document: The contents present on a web page are heterogeneous in nature i.e., in addition to text they might contain other multimedia contents like audio, video and images.  Duplication: On the Web, over 20% of the documents present are either near or exact duplicates of other documents and this estimation has not included the semantic duplicates yet.
  • 7.  Number of documents: The size of Web has grown exponentially over the past few years. The collection of documents is over trillions and this collection is much larger than any collection of documents processed by an information retrieval system. According to estimation, Web currently grows by 10% per month.  Lack of stability: Web pages lack stability in the sense that the contents of Web pages are modified frequently. Moreover any person using internet can create a Web pages even if it contains authentic information or not.
  • 8.  The users on the Web behave differently than the users of the classical information retrieval systems. The users of the latter are mostly trained librarians whereas the range of Web users varies from a layman to a technically sound person. Typical user behavior shows:  Poor queries: Most of the queries submitted by users are usually short and lack useful keywords that may help in the retrieval of relevant information.  Reaction to results: Usually users don’t evaluate all the result screens, they restrict to only results displayed in the first result screen.  Heterogeneity of users: There is a wide variance in education and Web experience between Web users.
  • 9. IR systems includes two terms  Objective  Non-Objective Objective terms: It is extrinsic to semantic content. Ex: author name, document URL, date of publication. Non-Objective Terms: It is intended to reflect the information in the document and there is no agreement about the choice or degree of applicability of the terms, known as content terms.
  • 10. IR Queries as follows:  Keyword queries  Boolean queries(AND,OR,NOT)  Phrase queries  Proximity queries  Full document queries  Natural Language Questions
  • 11.  WWW expanding faster than any current search engine can possibly index. Many web pages are updated frequently or are dynamically generated which forces search engines to repeatedly revisit them.  Many dynamically allocated generated sites are not indexable by search engines, known as Invisible web.  The ordering of results is not always solely by relevance, but sometimes influenced by monetary contributions. It is difficult with business model.
  • 12.  Some sites use tricks to manipulate the search engine to improve their ranking for certain keywords, known as search engine spamming User Info.needs Queries Stored Info InformationSearch/Select Translating Info needs to Queries Matching Queries to stored Information Query Result evaluation: Does the info found match user’s info needs?
  • 13. Web Problems divided into 2 classes  Problem with data itself  Problems regarding the user
  • 14.  Distributed data-Documents Spread over millions of different web servers  Volatile data-Many documents change or disappear rapidly Eg: Dead Links  Large volume-Trillions of Separate documents  Unstructured and redundant data-HTML errors, Duplicate documents  Quality of data-False Information, Poor quality writing  Heterogeneous data-Multiple media types(Eg: images, video)
  • 15.  How to specify the query?  How to interpret the answer provided by the system? To get the proper result, submit a good query to the search system and obtain a manageable and relevant answer.