SlideShare a Scribd company logo
How to Find a Needle in the Haystack Adrian Stevenson Learning Technology Services University of Manchester Institutional Web Management Workshop 2005 Parallel Session 4pm - 5.30pm,  Wednesday 6 th  July 2005
Overview Introduction to Cross searching / metasearch The Problem – why metasearch? JISC Information Environment Quick introduction to XML and Web Services Metasearch Technologies Z39.50, SRU/SRW, OAI Metasearch issues NISO Metasearch Initiative
Cross Searching Cross searching has many names: Metasearch Distributed search Parallel search Federated search Broadcast search Cross-database search  Common theme of allowing search and retrieval to span multiple databases, sources, platforms, protocols, and vendors at once
The Problem Web users such as researchers or tutors frequently require information from a variety of different sources User required to search many different service interfaces, each with a different look and feel, metadata and subject classifications.  The results are almost always supplied in HTML, which makes them difficult to merge.  Users search many services and portals such as the RDN, zetoc and COPAC, image resources, e-prints, learning objects, external and internal resources. If a user wants to obtain a local copy of the range of search results, they often have to merge the results themselves, for example by creating a text file.
JISC Information Environment Cross searching is at the core of the JISC IE JISC notes that considerable investment has been made to provide high-quality digital information resources  But students, lecturers and researchers are faced with a vast and sometimes bewildering range of sources of electronic information. Each source has its own name, interface, features and search facilities.  Users remain unaware of their existence or fail to discover their value for their own learning, teaching or research. A key challenge is therefore to achieve a managed, coherent and shared information environment that will overcome these obstacles
JISC: Helping Users find digital information Being able to cross-search will considerably simplify users’ interactions with online resources. This should encourage take-up and greatly improve means of accessing these resources.  Institutions will be able to incorporate these services within their own institutional online environments, presenting local content alongside nationally provided resources. A second aspect relates to making the Information Environment actually work. Making the Information Environment work requires the implementation of a range of commonly-agreed technical standards and protocols
JISC IE Technical Architecture “The JISC Information Environment technical architecture specifies a set of standards and protocols that support the delivery of integrated networked services that allow the end-user to  discover ,  access ,  use  and  publish  digital and physical resources”
Metasearch Technologies Two main approaches: Real-time cross searching Z39.50 Search and Retrieve URL / Web Service - SRU/SRW Harvesting Open Archives Initiative Protocol for Metadata Harvesting – OAI-PMH
Metasearch Technologies Other approaches: Hybrid Combination of Z39.50, SRU/W, and OAI and .. Screen scraping parsing the HTML to find patterns or parts of content. Screen scraping is an ad-hoc technique that is dependent on a consistent format for the data being scraped Regular expressions used for screen scraping. Perl has strong support for regular expressions – grep Difficult, unreliable and laborious
Z39.50 ANSI/NISO Z39.50 - 2003 Information Retrieval : Application Service Definition & Protocol Specification The National Information Standards Organization (NISO) is an American National Standards Institute (ANSI) accredited standards developer that serves the library, information, and publishing communities
Z39.50 Z39.50 is designed to enable communication between computers, typically those used to manage library catalogues  A portal can send a real-time query to a number of Z39.50 enabled content providers and a results set is returned to the user  The  AHDS Gateway , physically based in London, uses Z39.50 to query five different databases containing information on archaeology (York), history (Colchester), the performing arts (Glasgow), the visual arts (Newcastle), and textual studies (Oxford) They are driven by different database management software and run on a variety of hardware platforms. Z39.50 enables searches across the five sites. Library OPAC and desktop applications such as EndNote can also be used to search Z targets
AHDS
Z39.50 Z39.50 employs a client/server model One computer, the client or, in Z39.50 terms, the ‘Origin’, submits a request to another computer, the server or ‘Target’ which then services the request and returns an answer Queries can be sent to multiple databases simultaneously to cross search Records can be returned in a number of formats or ‘syntaxes’ as requested by the client.  These typically include: MARC ( Machine Readable Cataloging ) SUTRS  (Simple Unstructured Text Record Syntax) Raw ASCII text file XML (eXtensible Markup Language)
What is XML? a technology for the management, display and organisation of data a programming language a markup language  a markup language used to describe the structure of data not really a language a standard for creating languages that meet the XML criteria Some possible definitions?
XML: elements <language>  English  </language> <tag> </tag> content
XML must be well formed a root element is required <ead> … ..all your tags and content… </ead> closing tags are required
XML must be well formed a root element is required <ead> … ..all your tags and content… </ead> closing tags are required Tags must be properly nested Case matters
Valid XML Valid XML provides consistency and facilitates the exchange of data XML must conform to a Document Type Definition (DTD) or Schema to be valid Schemas and DTDs specify the elements and attributes and defines how they can be used: Sequence of elements Maximum and minimum values P eople can agree to  use a common Schema for interchanging data e-learning: IEEE Learning Object Metadata Schema (LOM)
Some Valid XML - EAD (Encoded Archival Description) <archdesc level=&quot;fonds&quot;> <did> <repository>John Rylands University Library of Manchester</repository> <unitid countrycode=&quot;GB&quot; repositorycode=&quot;0133&quot;>GB 0133 NCN</unitid> <unittitle>Papers of Norman Nicholson</unittitle> <unitdate normal=&quot;1899-1987&quot;>1899-1987</unitdate> <physdesc> <extent>0.44 cu.m; 1,201 items</extent> </physdesc> <langmaterial> <language langcode=&quot;eng&quot;>English</language> </langmaterial> <origination>Nicholson, Norman Cornthwaite, 1914-1987</origination> <note>Created by the John Rylands Library archivist</note> </did> … ..</archdesc>
Something to remember about XML XML  does not do anything itself . It is pure information wrapped in XML tags.  You must use  other means  to send, receive or display the data XML XML technologies Display here like this  Display there like that extract this data  for this purpose extract that data  for that purpose is used by to..
Why Use XML? Because everyone else is! International standard, supported by the W3C XML is open, licence free and platform neutral XML is human and machine readable XML documents are text documents
Why Use XML? Separation of content and presentation With proprietary systems content is inextricably bound up with format XML does not determine the presentation of the data Y ou can use CSS (stylesheets) or XSLT (Extensible Style Sheet Language for Transformations) to present XML data The flexibility of XML enables the presentation of merged search results to the user.
Web Services A Web Service is an online application that can be accessed by other applications in machine to machine (m2m) interactions Web services use XML to achieve this interoperability SOAP WSDL: Web Services Description Language UDDI: Universal Description, Discovery and Integration
What is a Web Service A Web Service is a process of some kind, some functionality, for example: A search and retrieve procedure A conversion process Fahrenheit to Centigrade MARC record to Dublin Core record LCSH subject headings to Dewey Decimal Classification numbers
Publicly available Web Services Google’s ‘similar pages’ Amazon’s book connections: ‘customers who bought this also bought this’ These services can be used in other applications Xmethods website has a list of some experimental services –  http://www.xmethods.net
Creating a Web Service Web services can be built for existing applications, or created from scratch A key element of a Web Service is an XML file with details of how to interact with the service – the WSDL (Web Services Description Language) file
Zetoc WSDL extract http://zetoc.mimas.ac.uk/soap/zetocsoap.wsdl … <complexType name=&quot;JournalRequest&quot;> <sequence> <element ref=&quot;srw:startRecord&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot;/> <element ref=&quot;bath:any&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;dc:title&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;dc:creator&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:jtitle&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:issn&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:volume&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:issue&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:spage&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;dcterms:issued&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> </sequence> </complexType>
Interacting with a Web Service Once the client application knows how to interact with the service, the client and service communicate using messages encoded in XML These messages are frequently expressed in SOAP These messages are generally passed over HTTP (but they don’t have to be)
SOAP A way of packaging XML information and passing it from one system to another Allows one system to make requests of another and to process the reply Systems can be completely different, running on different software, hardware
SOAP request <soap:Envelope  xmlns:soap=&quot;http://schemas.xmlsoap.org/soap/envelope/&quot;> <soap:Body> <zetoc:JournalRequest> <dc:creator>apps</dc:creator> <oujnl:title>materialia</oujnl:title> <oujnl:issn>1359-6462</oujnl:issn> <oujnl:volume>48</oujnl:volume> … </zetoc:JournalRequest> </soap:Body> </soap:Envelope>
SOAP response HTTP/1.1 200 OK Content-Type: text/xml <soap:Envelope > <soap:Body> < zetoc:IdentifierSearchResponse  > <srw:numberOfRecords>1</srw:numberOfRecords> <dc:identifier>RN125218404</dc:identifier> <zetoc:type>J</zetoc:type> <dc:title>Phase compositions in magnesium-rare earth alloys  containing yttrium, gadolinium or dysprosium</dc:title>   … </ zetoc:IdentifierSearchResponse  > </soap:Body> </soap:Envelope>
To recap … SOAP is a standard used for wrapping XML messages The XML that is sent and returned within the SOAP wrapper is determined by the WSDL file for any particular Web Service This is all done on a machine-to-machine level – you should never have to see a SOAP message However we can demonstrate with XML SPY editor so we can see the SOAP messages [demo]
Search Retrieve URL / Web Service (SRU/SRW) Takes the core of Z39.50 and re-implements as Web Service SRU and SRW are XML based protocols designed to be a low barrier to entry solutions for performing searches and information retrieval operations across the internet.  The protocol has two ways that it can be carried: via SOAP – Search Retrieve Web Service as parameters in a URL. - SRU – Search/Retrieve by URL The primary function of SRU/SRW is to allow a user to search a remote database of records.  This is done via the searchRetrieve operation: the client sends a searchRetreiveRequest and  the server responds with a searchRetrieveResponse
Example SRW request Most important part is the ‘query’.  It contains a Common Query Language (CQL) string: The request contains other parameters, all of these are optional except for ‘version’
Example SRW response Response must contain ‘version’ and ‘number of records’
Some SRU Requests SRU requests are URL with query string ‘ Explain’ request: http://z3950.loc.gov:7090/voyager   Describes the database/index and functionality A simple search for the term &quot;dinosaur“: http://z3950.loc.gov:7090/voyager?version=1.1& operation=searchRetrieve&query=dinosaur And the first of these records: http://z3950.loc.gov:7090/voyager?version=1.1&operation=searchRetrieve&query=dinosaur&maximumRecords=1
Open Archives Initiative (OAI) The Open Archives Initiative (OAI) provides is a mechanism for sharing metadata records based on HTTP and XML Enables metadata records about resources to be ‘harvested’ from multiple distributed services, typically into a central database (which itself may be a Z39.50 target) Records harvested periodically e.g.. Once a day, hour etc. Generally considered to be an elegant, simple and efficient protocol 6 requests types or ‘verbs’ -  GetRecord ,  Identify ,  ListIdentifier ,  ListMetadataFormats ,  ListRecords  and  ListSets . JORUM Learning Object Repository Service OAI interface at: http://repository.jorum.ac.uk/intralibrary/IntraLibrary-OAI?verb=identify
Real Time Cross Searching VS. Harvesting Delays occur with real time cross-searching The response time for searches sent to multiple search targets tends to be limited by the worst performing target or intervening network delays. Very difficult to build flexible browse interfaces based on a distributed set of gateway databases. OAI harvesting periodic so search results may not be accurate and up to date
OAI - Connect Portal Connect’ Learning & Teaching Portal  http://www.connect.ac.uk Connect is a HE Academy project (used to be the LTSN – Learning and Teaching Support Network) Connect harvests in records from HE Academy subject centres around the UK Records harvested by server at Rutherford Appleton Labs
Connect Portal
Connect Portal
Connect Portal
Connect Portal
Connect Portal
Metasearch issues: Metadata Format As users searching cross-domain, it makes sense to use a cross-domain metadata schema. Dublin Core is a good contender for this and is required for use of OAI-PMH. However, domains will use their own metadata schemas, such as the IEEE-LOM for learning objects.  Mappings required to enable cross searching, but some of the semantic richness of the original resource may be lost. Common Meaning – Semantic issues There needs to be agreement amongst content providers about the meaning of terms such as ‘title’, ‘article’, ‘research paper’, ‘learning object’ There will inevitably be difficulties in reaching agreement about the meaning of metadata elements, as they are used differently in different contexts.
Metasearch issues: Metadata Political The decision to make resources more widely available has implications for the organisations concerned: It may be seen as a loss of control or ownership staff may not possess the skills required to support more complex systems Legal legal requirements of Freedom of Information Legislation in several countries a significant factor in the dissemination of public sector resources.  The Intellectual Property Rights (IPR) of those providing sources may need to be protected.
Why not just use Google? Its content is limited to the visible Web Limited search functionality Can’t search by specific criteria (metadata) such as ‘publication date’, ‘author’, ‘educational level’ Little quality control Google Scholar? Still a web crawl Evidence that gives unreliable results
NISO Metasearch Initiative This NISO MetaSearch initiative is trying to bring the area of metasearching together around a NISO standard.  “ Best Practices for Metasearch” document due out June 15 th  2005 http://www.niso.org/committees/MetaSearch-info.html
Overview Introduction to Cross searching / metasearch The Problem – why metasearch? JISC Information Environment Quick introduction to XML and Web Services Metasearch Technologies Z39.50, SRU/SRW, OAI Metasearch issues NISO Metasearch Initiative
Contact Adrian Stevenson Learning Technology Services Internet Services University of Manchester adrian.stevenson [at] manchester.ac.uk Tel: +44 (0) 161 306 3109

More Related Content

What's hot (20)

PDF
Linked Open Data Principles, Technologies and Examples
Open Data Support
 
PPTX
internet
ITNet
 
PPTX
Internet
ravalaum
 
PPTX
World Wide Web (WWW) Technology
Kamyar Lajani
 
PDF
Quick Linked Data Introduction
Michael Hausenblas
 
PPT
Technical skills in multimedia for odl learners
Daniel Koloseni
 
PPT
Semantic Technolgy
Talat Fakhri
 
PPTX
web technology and soical networking
Vijay Bansal
 
PPSX
An Introduction to Semantic Web Technology
Ankur Biswas
 
PDF
Overview of Open Data, Linked Data and Web Science
Haklae Kim
 
PDF
Semantic web technology
Stanley Wang
 
PPT
Internet Applications
irenazd
 
PPTX
How Internet Work
trendy updates
 
PPTX
Introduction to internet.
Anish Thomas
 
PDF
An introduction to Linked (Open) Data
Ali Khalili
 
PPTX
Web Technology
Love Kothari
 
PDF
Presentation1
Napat Kasonsit
 
PPTX
WWW or World Wide Web
Saransh Arora
 
Linked Open Data Principles, Technologies and Examples
Open Data Support
 
internet
ITNet
 
Internet
ravalaum
 
World Wide Web (WWW) Technology
Kamyar Lajani
 
Quick Linked Data Introduction
Michael Hausenblas
 
Technical skills in multimedia for odl learners
Daniel Koloseni
 
Semantic Technolgy
Talat Fakhri
 
web technology and soical networking
Vijay Bansal
 
An Introduction to Semantic Web Technology
Ankur Biswas
 
Overview of Open Data, Linked Data and Web Science
Haklae Kim
 
Semantic web technology
Stanley Wang
 
Internet Applications
irenazd
 
How Internet Work
trendy updates
 
Introduction to internet.
Anish Thomas
 
An introduction to Linked (Open) Data
Ali Khalili
 
Web Technology
Love Kothari
 
Presentation1
Napat Kasonsit
 
WWW or World Wide Web
Saransh Arora
 

Similar to How to Find a Needle in the Haystack (20)

PPT
The JISC Information Environment and VLEs
Andy Powell
 
PPT
Technical overview of the JISC Information Environment
Andy Powell
 
PPT
The JISC Information Environment and collection description
Andy Powell
 
PPT
5 steps to becoming a JISC IE content provider
Andy Powell
 
PPT
Metadata april 8 2013
Richard.Sapon-White
 
PPT
From Provider to Portal - a chain of interoperability
Andy Powell
 
PPTX
Ltr Presentaion 2
burmaball
 
PPT
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Andy Powell
 
PPT
Web Services and the JISC IE
Andy Powell
 
PPT
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Andy Powell
 
PPT
Resource Discovery Landscape
Andy Powell
 
PPT
Re-usable metadata, re-usable content
Paul Walk
 
PPT
Digital library and MLE integration - where are we now and where do we want t...
Andy Powell
 
PPT
Resource
Erik Mitchell
 
PDF
Three Dimensional Database: Artificial Intelligence to eCommerce Web service ...
CSCJournals
 
PDF
Semantic web services and its challenges
iaemedu
 
PPT
Metadata practice and direction: a community perspective
lisld
 
PPT
Metadata Workshop - Utrecht - November 5, 2008
askamy
 
PPT
Realizing Service Finder at ESTC 2008
Emanuele Della Valle
 
The JISC Information Environment and VLEs
Andy Powell
 
Technical overview of the JISC Information Environment
Andy Powell
 
The JISC Information Environment and collection description
Andy Powell
 
5 steps to becoming a JISC IE content provider
Andy Powell
 
Metadata april 8 2013
Richard.Sapon-White
 
From Provider to Portal - a chain of interoperability
Andy Powell
 
Ltr Presentaion 2
burmaball
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Andy Powell
 
Web Services and the JISC IE
Andy Powell
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Andy Powell
 
Resource Discovery Landscape
Andy Powell
 
Re-usable metadata, re-usable content
Paul Walk
 
Digital library and MLE integration - where are we now and where do we want t...
Andy Powell
 
Resource
Erik Mitchell
 
Three Dimensional Database: Artificial Intelligence to eCommerce Web service ...
CSCJournals
 
Semantic web services and its challenges
iaemedu
 
Metadata practice and direction: a community perspective
lisld
 
Metadata Workshop - Utrecht - November 5, 2008
askamy
 
Realizing Service Finder at ESTC 2008
Emanuele Della Valle
 
Ad

More from Adrian Stevenson (20)

PPTX
Tools for Data Manipulation - UKAD Open Refine Workshop
Adrian Stevenson
 
PPTX
Exploring British Design
Adrian Stevenson
 
PPTX
SEO Matters
Adrian Stevenson
 
PPTX
Linking Data with sameAs: Challenges and Solutions - Workshop
Adrian Stevenson
 
PPTX
“Il n’y a pas de hors-texte” - Challenges for Archival Linked Data
Adrian Stevenson
 
PPTX
Wrapping and Unwrapping History: What’s Gained and What’s Lost
Adrian Stevenson
 
PPTX
Very Gentle Linked Data Workshop
Adrian Stevenson
 
PPTX
Digital Humanities and the First World War
Adrian Stevenson
 
PPTX
Lessons from ‘Linking Lives’ and ‘WW1 Discovery’ Projects
Adrian Stevenson
 
PPTX
The Winner Takes it All? -APIs and Linked Data Battle It Out
Adrian Stevenson
 
PPTX
Introduction to APIs and Linked Data
Adrian Stevenson
 
PPTX
GLAM Rocks! London Semantic Web Meetup
Adrian Stevenson
 
PPTX
Linked Data - the Future for Open Repositories. Kultivate Workshop
Adrian Stevenson
 
PPTX
High and Lows of Library Linked Data
Adrian Stevenson
 
PPTX
2 minutes on LOCAH Linking Lives at Europeana Tech 2011
Adrian Stevenson
 
PPTX
Linked Open Data: Opportunities & Barriers for Archives
Adrian Stevenson
 
PPT
Locah Project Show and Tell
Adrian Stevenson
 
PPTX
Report on the International Linked Open Data for Libraries, Archives and Muse...
Adrian Stevenson
 
PPT
Aggregation Using Linked Data – LOCAH Project Experiences
Adrian Stevenson
 
PPT
Linked Data - the Future for Open Repositories?
Adrian Stevenson
 
Tools for Data Manipulation - UKAD Open Refine Workshop
Adrian Stevenson
 
Exploring British Design
Adrian Stevenson
 
SEO Matters
Adrian Stevenson
 
Linking Data with sameAs: Challenges and Solutions - Workshop
Adrian Stevenson
 
“Il n’y a pas de hors-texte” - Challenges for Archival Linked Data
Adrian Stevenson
 
Wrapping and Unwrapping History: What’s Gained and What’s Lost
Adrian Stevenson
 
Very Gentle Linked Data Workshop
Adrian Stevenson
 
Digital Humanities and the First World War
Adrian Stevenson
 
Lessons from ‘Linking Lives’ and ‘WW1 Discovery’ Projects
Adrian Stevenson
 
The Winner Takes it All? -APIs and Linked Data Battle It Out
Adrian Stevenson
 
Introduction to APIs and Linked Data
Adrian Stevenson
 
GLAM Rocks! London Semantic Web Meetup
Adrian Stevenson
 
Linked Data - the Future for Open Repositories. Kultivate Workshop
Adrian Stevenson
 
High and Lows of Library Linked Data
Adrian Stevenson
 
2 minutes on LOCAH Linking Lives at Europeana Tech 2011
Adrian Stevenson
 
Linked Open Data: Opportunities & Barriers for Archives
Adrian Stevenson
 
Locah Project Show and Tell
Adrian Stevenson
 
Report on the International Linked Open Data for Libraries, Archives and Muse...
Adrian Stevenson
 
Aggregation Using Linked Data – LOCAH Project Experiences
Adrian Stevenson
 
Linked Data - the Future for Open Repositories?
Adrian Stevenson
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

How to Find a Needle in the Haystack

  • 1. How to Find a Needle in the Haystack Adrian Stevenson Learning Technology Services University of Manchester Institutional Web Management Workshop 2005 Parallel Session 4pm - 5.30pm, Wednesday 6 th July 2005
  • 2. Overview Introduction to Cross searching / metasearch The Problem – why metasearch? JISC Information Environment Quick introduction to XML and Web Services Metasearch Technologies Z39.50, SRU/SRW, OAI Metasearch issues NISO Metasearch Initiative
  • 3. Cross Searching Cross searching has many names: Metasearch Distributed search Parallel search Federated search Broadcast search Cross-database search Common theme of allowing search and retrieval to span multiple databases, sources, platforms, protocols, and vendors at once
  • 4. The Problem Web users such as researchers or tutors frequently require information from a variety of different sources User required to search many different service interfaces, each with a different look and feel, metadata and subject classifications. The results are almost always supplied in HTML, which makes them difficult to merge. Users search many services and portals such as the RDN, zetoc and COPAC, image resources, e-prints, learning objects, external and internal resources. If a user wants to obtain a local copy of the range of search results, they often have to merge the results themselves, for example by creating a text file.
  • 5. JISC Information Environment Cross searching is at the core of the JISC IE JISC notes that considerable investment has been made to provide high-quality digital information resources But students, lecturers and researchers are faced with a vast and sometimes bewildering range of sources of electronic information. Each source has its own name, interface, features and search facilities. Users remain unaware of their existence or fail to discover their value for their own learning, teaching or research. A key challenge is therefore to achieve a managed, coherent and shared information environment that will overcome these obstacles
  • 6. JISC: Helping Users find digital information Being able to cross-search will considerably simplify users’ interactions with online resources. This should encourage take-up and greatly improve means of accessing these resources. Institutions will be able to incorporate these services within their own institutional online environments, presenting local content alongside nationally provided resources. A second aspect relates to making the Information Environment actually work. Making the Information Environment work requires the implementation of a range of commonly-agreed technical standards and protocols
  • 7. JISC IE Technical Architecture “The JISC Information Environment technical architecture specifies a set of standards and protocols that support the delivery of integrated networked services that allow the end-user to discover , access , use and publish digital and physical resources”
  • 8. Metasearch Technologies Two main approaches: Real-time cross searching Z39.50 Search and Retrieve URL / Web Service - SRU/SRW Harvesting Open Archives Initiative Protocol for Metadata Harvesting – OAI-PMH
  • 9. Metasearch Technologies Other approaches: Hybrid Combination of Z39.50, SRU/W, and OAI and .. Screen scraping parsing the HTML to find patterns or parts of content. Screen scraping is an ad-hoc technique that is dependent on a consistent format for the data being scraped Regular expressions used for screen scraping. Perl has strong support for regular expressions – grep Difficult, unreliable and laborious
  • 10. Z39.50 ANSI/NISO Z39.50 - 2003 Information Retrieval : Application Service Definition & Protocol Specification The National Information Standards Organization (NISO) is an American National Standards Institute (ANSI) accredited standards developer that serves the library, information, and publishing communities
  • 11. Z39.50 Z39.50 is designed to enable communication between computers, typically those used to manage library catalogues A portal can send a real-time query to a number of Z39.50 enabled content providers and a results set is returned to the user The AHDS Gateway , physically based in London, uses Z39.50 to query five different databases containing information on archaeology (York), history (Colchester), the performing arts (Glasgow), the visual arts (Newcastle), and textual studies (Oxford) They are driven by different database management software and run on a variety of hardware platforms. Z39.50 enables searches across the five sites. Library OPAC and desktop applications such as EndNote can also be used to search Z targets
  • 12. AHDS
  • 13. Z39.50 Z39.50 employs a client/server model One computer, the client or, in Z39.50 terms, the ‘Origin’, submits a request to another computer, the server or ‘Target’ which then services the request and returns an answer Queries can be sent to multiple databases simultaneously to cross search Records can be returned in a number of formats or ‘syntaxes’ as requested by the client. These typically include: MARC ( Machine Readable Cataloging ) SUTRS (Simple Unstructured Text Record Syntax) Raw ASCII text file XML (eXtensible Markup Language)
  • 14. What is XML? a technology for the management, display and organisation of data a programming language a markup language a markup language used to describe the structure of data not really a language a standard for creating languages that meet the XML criteria Some possible definitions?
  • 15. XML: elements <language> English </language> <tag> </tag> content
  • 16. XML must be well formed a root element is required <ead> … ..all your tags and content… </ead> closing tags are required
  • 17. XML must be well formed a root element is required <ead> … ..all your tags and content… </ead> closing tags are required Tags must be properly nested Case matters
  • 18. Valid XML Valid XML provides consistency and facilitates the exchange of data XML must conform to a Document Type Definition (DTD) or Schema to be valid Schemas and DTDs specify the elements and attributes and defines how they can be used: Sequence of elements Maximum and minimum values P eople can agree to use a common Schema for interchanging data e-learning: IEEE Learning Object Metadata Schema (LOM)
  • 19. Some Valid XML - EAD (Encoded Archival Description) <archdesc level=&quot;fonds&quot;> <did> <repository>John Rylands University Library of Manchester</repository> <unitid countrycode=&quot;GB&quot; repositorycode=&quot;0133&quot;>GB 0133 NCN</unitid> <unittitle>Papers of Norman Nicholson</unittitle> <unitdate normal=&quot;1899-1987&quot;>1899-1987</unitdate> <physdesc> <extent>0.44 cu.m; 1,201 items</extent> </physdesc> <langmaterial> <language langcode=&quot;eng&quot;>English</language> </langmaterial> <origination>Nicholson, Norman Cornthwaite, 1914-1987</origination> <note>Created by the John Rylands Library archivist</note> </did> … ..</archdesc>
  • 20. Something to remember about XML XML does not do anything itself . It is pure information wrapped in XML tags. You must use other means to send, receive or display the data XML XML technologies Display here like this Display there like that extract this data for this purpose extract that data for that purpose is used by to..
  • 21. Why Use XML? Because everyone else is! International standard, supported by the W3C XML is open, licence free and platform neutral XML is human and machine readable XML documents are text documents
  • 22. Why Use XML? Separation of content and presentation With proprietary systems content is inextricably bound up with format XML does not determine the presentation of the data Y ou can use CSS (stylesheets) or XSLT (Extensible Style Sheet Language for Transformations) to present XML data The flexibility of XML enables the presentation of merged search results to the user.
  • 23. Web Services A Web Service is an online application that can be accessed by other applications in machine to machine (m2m) interactions Web services use XML to achieve this interoperability SOAP WSDL: Web Services Description Language UDDI: Universal Description, Discovery and Integration
  • 24. What is a Web Service A Web Service is a process of some kind, some functionality, for example: A search and retrieve procedure A conversion process Fahrenheit to Centigrade MARC record to Dublin Core record LCSH subject headings to Dewey Decimal Classification numbers
  • 25. Publicly available Web Services Google’s ‘similar pages’ Amazon’s book connections: ‘customers who bought this also bought this’ These services can be used in other applications Xmethods website has a list of some experimental services – http://www.xmethods.net
  • 26. Creating a Web Service Web services can be built for existing applications, or created from scratch A key element of a Web Service is an XML file with details of how to interact with the service – the WSDL (Web Services Description Language) file
  • 27. Zetoc WSDL extract http://zetoc.mimas.ac.uk/soap/zetocsoap.wsdl … <complexType name=&quot;JournalRequest&quot;> <sequence> <element ref=&quot;srw:startRecord&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot;/> <element ref=&quot;bath:any&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;dc:title&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;dc:creator&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:jtitle&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:issn&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:volume&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:issue&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;oujnl:spage&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> <element ref=&quot;dcterms:issued&quot; minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot; nillable=&quot;true&quot;/> </sequence> </complexType>
  • 28. Interacting with a Web Service Once the client application knows how to interact with the service, the client and service communicate using messages encoded in XML These messages are frequently expressed in SOAP These messages are generally passed over HTTP (but they don’t have to be)
  • 29. SOAP A way of packaging XML information and passing it from one system to another Allows one system to make requests of another and to process the reply Systems can be completely different, running on different software, hardware
  • 30. SOAP request <soap:Envelope xmlns:soap=&quot;http://schemas.xmlsoap.org/soap/envelope/&quot;> <soap:Body> <zetoc:JournalRequest> <dc:creator>apps</dc:creator> <oujnl:title>materialia</oujnl:title> <oujnl:issn>1359-6462</oujnl:issn> <oujnl:volume>48</oujnl:volume> … </zetoc:JournalRequest> </soap:Body> </soap:Envelope>
  • 31. SOAP response HTTP/1.1 200 OK Content-Type: text/xml <soap:Envelope > <soap:Body> < zetoc:IdentifierSearchResponse > <srw:numberOfRecords>1</srw:numberOfRecords> <dc:identifier>RN125218404</dc:identifier> <zetoc:type>J</zetoc:type> <dc:title>Phase compositions in magnesium-rare earth alloys containing yttrium, gadolinium or dysprosium</dc:title> … </ zetoc:IdentifierSearchResponse > </soap:Body> </soap:Envelope>
  • 32. To recap … SOAP is a standard used for wrapping XML messages The XML that is sent and returned within the SOAP wrapper is determined by the WSDL file for any particular Web Service This is all done on a machine-to-machine level – you should never have to see a SOAP message However we can demonstrate with XML SPY editor so we can see the SOAP messages [demo]
  • 33. Search Retrieve URL / Web Service (SRU/SRW) Takes the core of Z39.50 and re-implements as Web Service SRU and SRW are XML based protocols designed to be a low barrier to entry solutions for performing searches and information retrieval operations across the internet. The protocol has two ways that it can be carried: via SOAP – Search Retrieve Web Service as parameters in a URL. - SRU – Search/Retrieve by URL The primary function of SRU/SRW is to allow a user to search a remote database of records. This is done via the searchRetrieve operation: the client sends a searchRetreiveRequest and the server responds with a searchRetrieveResponse
  • 34. Example SRW request Most important part is the ‘query’. It contains a Common Query Language (CQL) string: The request contains other parameters, all of these are optional except for ‘version’
  • 35. Example SRW response Response must contain ‘version’ and ‘number of records’
  • 36. Some SRU Requests SRU requests are URL with query string ‘ Explain’ request: http://z3950.loc.gov:7090/voyager Describes the database/index and functionality A simple search for the term &quot;dinosaur“: http://z3950.loc.gov:7090/voyager?version=1.1& operation=searchRetrieve&query=dinosaur And the first of these records: http://z3950.loc.gov:7090/voyager?version=1.1&operation=searchRetrieve&query=dinosaur&maximumRecords=1
  • 37. Open Archives Initiative (OAI) The Open Archives Initiative (OAI) provides is a mechanism for sharing metadata records based on HTTP and XML Enables metadata records about resources to be ‘harvested’ from multiple distributed services, typically into a central database (which itself may be a Z39.50 target) Records harvested periodically e.g.. Once a day, hour etc. Generally considered to be an elegant, simple and efficient protocol 6 requests types or ‘verbs’ - GetRecord , Identify , ListIdentifier , ListMetadataFormats , ListRecords and ListSets . JORUM Learning Object Repository Service OAI interface at: http://repository.jorum.ac.uk/intralibrary/IntraLibrary-OAI?verb=identify
  • 38. Real Time Cross Searching VS. Harvesting Delays occur with real time cross-searching The response time for searches sent to multiple search targets tends to be limited by the worst performing target or intervening network delays. Very difficult to build flexible browse interfaces based on a distributed set of gateway databases. OAI harvesting periodic so search results may not be accurate and up to date
  • 39. OAI - Connect Portal Connect’ Learning & Teaching Portal http://www.connect.ac.uk Connect is a HE Academy project (used to be the LTSN – Learning and Teaching Support Network) Connect harvests in records from HE Academy subject centres around the UK Records harvested by server at Rutherford Appleton Labs
  • 45. Metasearch issues: Metadata Format As users searching cross-domain, it makes sense to use a cross-domain metadata schema. Dublin Core is a good contender for this and is required for use of OAI-PMH. However, domains will use their own metadata schemas, such as the IEEE-LOM for learning objects. Mappings required to enable cross searching, but some of the semantic richness of the original resource may be lost. Common Meaning – Semantic issues There needs to be agreement amongst content providers about the meaning of terms such as ‘title’, ‘article’, ‘research paper’, ‘learning object’ There will inevitably be difficulties in reaching agreement about the meaning of metadata elements, as they are used differently in different contexts.
  • 46. Metasearch issues: Metadata Political The decision to make resources more widely available has implications for the organisations concerned: It may be seen as a loss of control or ownership staff may not possess the skills required to support more complex systems Legal legal requirements of Freedom of Information Legislation in several countries a significant factor in the dissemination of public sector resources. The Intellectual Property Rights (IPR) of those providing sources may need to be protected.
  • 47. Why not just use Google? Its content is limited to the visible Web Limited search functionality Can’t search by specific criteria (metadata) such as ‘publication date’, ‘author’, ‘educational level’ Little quality control Google Scholar? Still a web crawl Evidence that gives unreliable results
  • 48. NISO Metasearch Initiative This NISO MetaSearch initiative is trying to bring the area of metasearching together around a NISO standard. “ Best Practices for Metasearch” document due out June 15 th 2005 http://www.niso.org/committees/MetaSearch-info.html
  • 49. Overview Introduction to Cross searching / metasearch The Problem – why metasearch? JISC Information Environment Quick introduction to XML and Web Services Metasearch Technologies Z39.50, SRU/SRW, OAI Metasearch issues NISO Metasearch Initiative
  • 50. Contact Adrian Stevenson Learning Technology Services Internet Services University of Manchester adrian.stevenson [at] manchester.ac.uk Tel: +44 (0) 161 306 3109