Dr B T Sampath Kumar
Professor
Department of Library and Information Science
Tumkur University, Tumakuru, INDIA
www.sampathkumar.info
Web decay and Internet Archives
Internet: The 8th Wonder of the earth
•Network of networks.
•Information super highway.
•Ocean of information.
•It is a global computer network providing a
variety of information and communication
facilities.
Cont..
•The Internet, a very complex and
revolutionary invention of 1960’s.
•It has played a key role in rapid access to
information.
•The adoption of web based technology for
publishing information has increased the
availability of electronic information over the
web.
Cont..
•Therefore, the web is popular among
researchers and it has been considered as an
important means of locating and sharing
scientific information.
•The present day researchers have been
increasingly citing Web sources (URLs) in their
scholarly publications.
Use of web sources as citations
•Use of web sources have become common in
journal articles, conference articles,
theses/dissertations/students
projects/assignments.
•A plethora of literature in available on the
use of web sources as citations in scholarly
journal articles.
Author(s) Year
No. of web
citations
Percentage of
Web citations
Mardani 2011 4253 22.62
Tajeddini et al. 2011 4562 11.00
Sadat-Moosavi et al. 2012 2886 24.00
Sampath Kumar & Manoj Kumar 2012 2890 18.77
Sampath Kumar & Vinay Kumar 2013 1290 18.91
Sife & Bernard 2013 1487 9.60
Sampath Kumar & Prithviraj 2014 5698 36.19
Sampath Kumar et al. 2015 1930 12.69
Vinay Kumar &Sampath Kumar 2017 2133 29.00
Sife and Lwoga 2018 574 3.30
Vinay Kumar and Sushmitha 2019 1105 16.46
Why do we use Web sources as
citations?
•The popularity of Web sources among
academic and scientific community.
•Most of the articles/books available online.
•Easy to find and access information.
•Available 24X7 and for free of cost.
Problem with Web sites
•Some websites simply removed as their
website domain names expire.
•In some other cases the web pages are
removed from the Web by the web master.
•Some websites are designed with new file
structures.
•Some websites may move to new online
locations.
Cont…
•The content may be removed from the Web
because the information is no longer relevant
or correct.
•Typographical errors associated while citing a
URL.
•The server is made offline for maintenance.
•It causes a temporary inaccessibility of the
web resource.
Cont..
•Network problem at client side or at the
server side.
•This disappearance of websites over a period
of time is referred as web decay.
Web decay
•Decay is the state or process of rotting or
decomposition.
•Web decay refers to a process whereby the
web source disappears over a period of time.
Accessible /Active web site
•URL of the web site is valid and it can
accessible on the Internet using an web
browser
•Ex: www.lisacademy.org
Cont..
Missing web site
•A missing website is one that returns with an
HTTP error message when its URL is entered
in Web browser/W3C Link Checker.
•Vanished websites/decayed/missing/rotted.
Author(s) Year
No. of web
citations
% of
Decayed
web
citations
Mardani 2011 4253 18.00
Tajeddini et al. 2011 4562 34.00
Sadat-Moosavi et al. 2012 2886 36.00
Sampath Kumar and Manoj Kumar 2012 2890 26.08
Sampath Kumar and Vinay Kumar 2013 1290 39.84
Sife and Bernard 2013 1487 58.00
Sampath Kumar and Prithviraj 2014 5698 50.09
Sampath Kumar et al. 2015 1930 30.98
Vinay Kumar and Sampath Kumar 2017 2133 38.58
Zhao et al. 2017 7058 37.09
Sife and Lwoga 2018 574 44.10
Vinay Kumar and Sushmitha 2019 1105 43.44
Implications of Web decay
•The users will not trust the web sources.
•The user will not the get the original
information.
•Web sources will not be used as references in
the scholarly communication.
•More impact on the open access.
How to check the availability of web
site?
• Directly enter the URL of the source in a web
browser.
• Use of link checker.
URL link checkers
These tool are used to verify for broken URL’s
on a website
Some of the URL link checkers are:
•W3C Link Checker
•Online Website Link Checker
•Dead link checker
•Web accessibility checker
W3C link checker
HTTP error codes
•HTTP-300: This code informs the client that
the requested resource has different choices
and all choices cannot be resolved into one.
•HTTP 301 (Moved permanently): This code
indicates that the requested resource has
been sent to a new URL.
•It redirects clients to a new location for the
requested resource.
HTTP error codes
•HTTP-302 (Moved Temporarily): This code
indicates that a new temporary location has
been assigned to a resource.
•Server redirects the requester to the new
location of requested material.
•HTTP-400 (Bad Request): The request
contains bad syntax, so that the server could
not understand the request
HTTP error codes
•HTTP-401 (Unauthorized): Few web
resources require a username and password
to process
•HTTP-403 (Forbidden): This code indicates
that the failure of the operation for the
request due to the un-readability of a file or
directory.
•HTTP-404 (Page Not Found): This code
indicates that the server has not found
anything matching with the requested URL.
HTTP error codes
•HTTP-410 (Gone): This code indicates that
the requested resource is no longer available
on the server and will not be available again.
•HTTP-504 (Gateway Time-out): This code
indicates that the internal servers (that are
acting as proxies) are unable to receive an in
time response from the upstream server
HTTP error codes
•HTTP-415 (Unsupported Media Type): When
the request is made for a particular media
resource and the server finds it inappropriate
•HTTP-503 (Service Unavailable): This code
indicates that the server is temporarily
unable to handle the request made by the
client.
•Overload on the server causes the
unavailability of the server’s response.
Internet Archive
•Internet Archive is a non-profit library of
millions of free books, movies, software,
music, websites, and more
Web archiving initiatives
Web Archives Country Year Main scope of
archive content
Australia's Web
Archive
Australia 1996 National
Government of
Canada Web Archive
(GCWA)
Canada 2005 National
governmental
Internet Archive
(Wayback machine)
USA 1996 International &
service provider
Internet memory
Foundation
France,
Netherlan
ds
2004 International &
service provider
Japan Web Archiving
Project
Japan 2004 National
Wayback machine: An ideal tool to
recover the decayed Web sources
•It is a free online resource that was created in
1996
•It helps to build a digital library of webpages
•It offers permanent and free access to
researchers, historians, scholars, and the
general public
Wayback machine
This archives contains:
•330 billion web pages
•20 million books and texts
•4.5 million audio recordings
•4 million videos
•3 million images
•200,000 software programs
Percentage of recovered web sites in
Wayback machine
Author(s) Year
% of web
citations
recovered
Tajeddini et al. 2011
11.00
12.00
Sadat-Moosavi et al. 2012
17.00
12.00
Sampath Kumar and Vinay Kumar 2013 44.55
Sampath Kumar and Prithviraj 2014 58.23
Sampath Kumar et al. 2015 48.33
Vinay Kumar and Sampath Kumar 2017 58.81
Sife and Lwoga 2018 6.30
Pandora
•It is a national web archive for the
preservation of Australia's online
publications.
•Established by the National Library of
Australia in 1996.
•It has been built in collaboration with
Australian state libraries and cultural
collecting organisations
While citing websites..
•The citations to web content should include
full bibliographic information.
•Authors should test web sources for their
availability..
•Web master should not to use lengthy URLs
while creating web sources.
Cont..
•The authors need to cite the scholarly
information from the authentic web
documents.
•The need to check the accessibility status of
URLs before citing them in their scholarly
works.
•If a URL citations is not accessible, then, the
author may use any Internet Archive to
recover the web source.
Cont..
•It is suggested that editors or editorial staff of
journals need to check the URL citations
(URLs) cited in the articles submitted to
them.
• The URL check should confirm the
accessibility and this shall be one of the
criteria for the acceptance of the article.
Role of LIS professionals
•Need to conduct orientation program among
the faculty/researcher students on:
•Citing web sources in scholarly literature.
•Decay of web sources and its implications
on scholarly content
•Link checkers
•Internet Archives
•Recovery of decayed web sources
•Archiving the web sources in Web archives
Conclusion
•Internet archive plays a significant role in
archiving web site.
•However it covers only a portion of the web.
•It will not frequently monitor the changes in
the content of the web page.
•We need to Save web page to archive
manually.
Feed back to
94483 20187
sampathkumar.info

Web decay and Internet Archive

  • 1.
    Dr B TSampath Kumar Professor Department of Library and Information Science Tumkur University, Tumakuru, INDIA www.sampathkumar.info Web decay and Internet Archives
  • 2.
    Internet: The 8thWonder of the earth •Network of networks. •Information super highway. •Ocean of information. •It is a global computer network providing a variety of information and communication facilities.
  • 3.
    Cont.. •The Internet, avery complex and revolutionary invention of 1960’s. •It has played a key role in rapid access to information. •The adoption of web based technology for publishing information has increased the availability of electronic information over the web.
  • 4.
    Cont.. •Therefore, the webis popular among researchers and it has been considered as an important means of locating and sharing scientific information. •The present day researchers have been increasingly citing Web sources (URLs) in their scholarly publications.
  • 5.
    Use of websources as citations •Use of web sources have become common in journal articles, conference articles, theses/dissertations/students projects/assignments. •A plethora of literature in available on the use of web sources as citations in scholarly journal articles.
  • 6.
    Author(s) Year No. ofweb citations Percentage of Web citations Mardani 2011 4253 22.62 Tajeddini et al. 2011 4562 11.00 Sadat-Moosavi et al. 2012 2886 24.00 Sampath Kumar & Manoj Kumar 2012 2890 18.77 Sampath Kumar & Vinay Kumar 2013 1290 18.91 Sife & Bernard 2013 1487 9.60 Sampath Kumar & Prithviraj 2014 5698 36.19 Sampath Kumar et al. 2015 1930 12.69 Vinay Kumar &Sampath Kumar 2017 2133 29.00 Sife and Lwoga 2018 574 3.30 Vinay Kumar and Sushmitha 2019 1105 16.46
  • 7.
    Why do weuse Web sources as citations? •The popularity of Web sources among academic and scientific community. •Most of the articles/books available online. •Easy to find and access information. •Available 24X7 and for free of cost.
  • 8.
    Problem with Websites •Some websites simply removed as their website domain names expire. •In some other cases the web pages are removed from the Web by the web master. •Some websites are designed with new file structures. •Some websites may move to new online locations.
  • 9.
    Cont… •The content maybe removed from the Web because the information is no longer relevant or correct. •Typographical errors associated while citing a URL. •The server is made offline for maintenance. •It causes a temporary inaccessibility of the web resource.
  • 10.
    Cont.. •Network problem atclient side or at the server side. •This disappearance of websites over a period of time is referred as web decay.
  • 11.
    Web decay •Decay isthe state or process of rotting or decomposition. •Web decay refers to a process whereby the web source disappears over a period of time.
  • 12.
    Accessible /Active website •URL of the web site is valid and it can accessible on the Internet using an web browser •Ex: www.lisacademy.org
  • 14.
    Cont.. Missing web site •Amissing website is one that returns with an HTTP error message when its URL is entered in Web browser/W3C Link Checker. •Vanished websites/decayed/missing/rotted.
  • 16.
    Author(s) Year No. ofweb citations % of Decayed web citations Mardani 2011 4253 18.00 Tajeddini et al. 2011 4562 34.00 Sadat-Moosavi et al. 2012 2886 36.00 Sampath Kumar and Manoj Kumar 2012 2890 26.08 Sampath Kumar and Vinay Kumar 2013 1290 39.84 Sife and Bernard 2013 1487 58.00 Sampath Kumar and Prithviraj 2014 5698 50.09 Sampath Kumar et al. 2015 1930 30.98 Vinay Kumar and Sampath Kumar 2017 2133 38.58 Zhao et al. 2017 7058 37.09 Sife and Lwoga 2018 574 44.10 Vinay Kumar and Sushmitha 2019 1105 43.44
  • 17.
    Implications of Webdecay •The users will not trust the web sources. •The user will not the get the original information. •Web sources will not be used as references in the scholarly communication. •More impact on the open access.
  • 18.
    How to checkthe availability of web site? • Directly enter the URL of the source in a web browser. • Use of link checker.
  • 19.
    URL link checkers Thesetool are used to verify for broken URL’s on a website Some of the URL link checkers are: •W3C Link Checker •Online Website Link Checker •Dead link checker •Web accessibility checker
  • 20.
  • 27.
    HTTP error codes •HTTP-300:This code informs the client that the requested resource has different choices and all choices cannot be resolved into one. •HTTP 301 (Moved permanently): This code indicates that the requested resource has been sent to a new URL. •It redirects clients to a new location for the requested resource.
  • 28.
    HTTP error codes •HTTP-302(Moved Temporarily): This code indicates that a new temporary location has been assigned to a resource. •Server redirects the requester to the new location of requested material. •HTTP-400 (Bad Request): The request contains bad syntax, so that the server could not understand the request
  • 29.
    HTTP error codes •HTTP-401(Unauthorized): Few web resources require a username and password to process •HTTP-403 (Forbidden): This code indicates that the failure of the operation for the request due to the un-readability of a file or directory. •HTTP-404 (Page Not Found): This code indicates that the server has not found anything matching with the requested URL.
  • 30.
    HTTP error codes •HTTP-410(Gone): This code indicates that the requested resource is no longer available on the server and will not be available again. •HTTP-504 (Gateway Time-out): This code indicates that the internal servers (that are acting as proxies) are unable to receive an in time response from the upstream server
  • 31.
    HTTP error codes •HTTP-415(Unsupported Media Type): When the request is made for a particular media resource and the server finds it inappropriate •HTTP-503 (Service Unavailable): This code indicates that the server is temporarily unable to handle the request made by the client. •Overload on the server causes the unavailability of the server’s response.
  • 32.
    Internet Archive •Internet Archiveis a non-profit library of millions of free books, movies, software, music, websites, and more
  • 33.
    Web archiving initiatives WebArchives Country Year Main scope of archive content Australia's Web Archive Australia 1996 National Government of Canada Web Archive (GCWA) Canada 2005 National governmental Internet Archive (Wayback machine) USA 1996 International & service provider Internet memory Foundation France, Netherlan ds 2004 International & service provider Japan Web Archiving Project Japan 2004 National
  • 34.
    Wayback machine: Anideal tool to recover the decayed Web sources •It is a free online resource that was created in 1996 •It helps to build a digital library of webpages •It offers permanent and free access to researchers, historians, scholars, and the general public
  • 35.
    Wayback machine This archivescontains: •330 billion web pages •20 million books and texts •4.5 million audio recordings •4 million videos •3 million images •200,000 software programs
  • 42.
    Percentage of recoveredweb sites in Wayback machine Author(s) Year % of web citations recovered Tajeddini et al. 2011 11.00 12.00 Sadat-Moosavi et al. 2012 17.00 12.00 Sampath Kumar and Vinay Kumar 2013 44.55 Sampath Kumar and Prithviraj 2014 58.23 Sampath Kumar et al. 2015 48.33 Vinay Kumar and Sampath Kumar 2017 58.81 Sife and Lwoga 2018 6.30
  • 53.
    Pandora •It is anational web archive for the preservation of Australia's online publications. •Established by the National Library of Australia in 1996. •It has been built in collaboration with Australian state libraries and cultural collecting organisations
  • 57.
    While citing websites.. •Thecitations to web content should include full bibliographic information. •Authors should test web sources for their availability.. •Web master should not to use lengthy URLs while creating web sources.
  • 58.
    Cont.. •The authors needto cite the scholarly information from the authentic web documents. •The need to check the accessibility status of URLs before citing them in their scholarly works. •If a URL citations is not accessible, then, the author may use any Internet Archive to recover the web source.
  • 59.
    Cont.. •It is suggestedthat editors or editorial staff of journals need to check the URL citations (URLs) cited in the articles submitted to them. • The URL check should confirm the accessibility and this shall be one of the criteria for the acceptance of the article.
  • 60.
    Role of LISprofessionals •Need to conduct orientation program among the faculty/researcher students on: •Citing web sources in scholarly literature. •Decay of web sources and its implications on scholarly content •Link checkers •Internet Archives •Recovery of decayed web sources •Archiving the web sources in Web archives
  • 61.
    Conclusion •Internet archive playsa significant role in archiving web site. •However it covers only a portion of the web. •It will not frequently monitor the changes in the content of the web page. •We need to Save web page to archive manually.
  • 62.
    Feed back to 9448320187 sampathkumar.info