SlideShare a Scribd company logo
SubSift web services and workflows for profiling
and comparing scientists and their published works
Simon Price, Peter Flach, Sebastian Spiegler,
Christopher Bailey and Nikki Rogers
2
Outline of this paper
1. SubSift – submission sifting
2. Background Theory: Vector Space Model
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
3
1. SubSift – submission sifting
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
4
SubSift
SubSift is a prototype
application to support
academic peer review.
SubSift matches submitted
conference/journal papers to
potential peer reviewers
based on similarity to
published works.
Website:
http://subsift.ilrt.bris.ac.uk
5
SubSift has been used for...
15
6
Contribution of this work
SubSift RESTful web services:
• Open Source software (on Google Code)
• Hosted open web service at University of Bristol
Re-usable workflows for profiling and comparing scientists
and their published works.
Tool for constructing, manipulating and publishing
document-centric datasets.
Related Work
• SubSift uses techniques more normally associated with
Information Retrieval
• Full text search tools support text matching on large-scale
document collections
e.g. Apache Lucene, PostgreSQL, Oracle UltraSearch
Designed for 1:M matching but can also to do Cartesian product M:M matching.
• How SubSift differs:
• Exposes detailed metadata throughout.
• Partly a research tool: need to plug in + instrument new algorithms.
• Fewer licensing restrictions and dependencies for open source.
7
8
2. Background Theory: Vector Space Model
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
9
Vector Space Model (from Information Retrieval)
Vector Space Model consists of:
• bag-of-words representation
• cosine similarity
• tf-idf weighting
For a query (q), rank the documents (dj) in collection (D) by
descending similarity to the query.
10
Vector Space Model: bag-of-words representation
no. terms in each abstract
no. terms in DBLP author page of each PC member
11
Vector Space Model: cosine similarity
12
Vector Space Model: tf-idf weighting
13
Representational State Transfer (REST)
“RESTful” web services:
• URIs to represent resources
• HTTP POST/GET/PUT/DELETE correspond to usual
Create/Read/Update/Delete (CRUD) operations
• Response formats typically include: XML, JSON, CSV
REST is a design pattern for web services based on HTTP using its
familiar URIs, requests, responses, authentication, etc.
14
3. SubSift REST API
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
15
SubSift System Archicture
SUBSI FT
REST API
XML CSV TermsJSON YAML RDF
WEB
FILESTORE
SUBSIFT
HARVESTER
XSLT
CLIENT
16
SubSift REST API
17
Profiles
18
Matches
19
SubSift – canonical workflow
20
4. Demonstration Workflows
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
21
Workflow 1 – Submission Sifting
Workflow 1 – Web 2.0 Client Implementation
22
Workflow 1 – Papers is just a list of URLs (e.g.
Yahoo! Pipes)
23
24
Workflow 2 – Finding an Expert
25
Finding an expert
26
Workflow 3 –Visualising Similarity
27
Clustering staff based on homepage similarity
Dendrogram produced in Matlab from SubSift generated similarity matrix
28
Precision-recall at different thresholds
29
Similarity networks
Diagram created by Graphvis from SubSift generated dot file
30
Connectivity
Diagram created by Graphvis from SubSift generated dot file
31
Workflow 4 – Profiling Reading Lists
32
Profiling a research group by its publications
Diagram produced in Wordle using SubSift profile data
33
Workflow 5 – Ranking News Stories
34
And finally...
Future Work
• Scaling-up
• Currently a small-scale web application running on modest hardware.
• Plans to migrate to a larger-scale HPC application at Bristol.
• ExaMiner project
• Mining and mapping the University of Bristol’s research landscape.
• Crawling the University’s web pages to profile and visualise research interests
of and similarities between faculty, departments, research groups and
researchers.
• Plans to apply to websites of other Universities.
35
36
5. Conclusions
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
37
Conclusion
• SubSift Services useful outside of peer review domain
• Workflows for profiling/comparing scientists
 Promising e-Science and e-Research use cases for profiling and comparing
scientists and their published works.
• Tool for constructing, manipulating and publishing
document-centric datasets
 E.g. information retrieval, data mining, pattern analysis research.
 Publication of datasets in this way supports reproducibility of science.
 Connects data through Linked Data and the Semantic Web.

More Related Content

What's hot (20)

PPT
HDF-EOS Datablade: Efficiently Serving Earth Science Data
The HDF-EOS Tools and Information Center
 
PDF
4. Crossref and Atypon
Crossref
 
PDF
Time travel and time series analysis with pandas + statsmodels
Alexander Hendorf
 
PDF
balloon Synopsis at ISWC 2014 Developer Worksop
Kai Schlegel
 
PDF
Visualising statistical Linked Data with Plone
Eau de Web
 
PPTX
Ecore Model Reflection in RDF
Steven Battle
 
PDF
Health Sciences Research Informatics, Powered by Globus
Globus
 
PPTX
Online direct import of specimen records from iDigBio infrastructure into tax...
Viktor Senderov
 
PDF
Automate your PDF factsheets with xlwings Reports
xlwings
 
PPTX
LD4KD 2015 - Demos and tools
Vrije Universiteit Amsterdam
 
PDF
BDE SC3.3 Workshop - BDE Platform: Technical overview
BigData_Europe
 
PDF
Dogfooding data at Lyft
markgrover
 
PPT
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
ASIS&T
 
PPT
Uk discovery-jisc-project-showcase
RDTF-Discovery
 
PDF
Corpus studio Erwin Komen
CLARIAH
 
PPTX
F# for Data*
Sergey Tihon
 
PPTX
COUNTER Point: Making the Most of Imperfect Data
Lindsay Cronk
 
PDF
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
PDF
xlwings reports: Reporting with Excel & Python
xlwings
 
PPTX
Implementing BigPetStore with Apache Flink
Márton Balassi
 
HDF-EOS Datablade: Efficiently Serving Earth Science Data
The HDF-EOS Tools and Information Center
 
4. Crossref and Atypon
Crossref
 
Time travel and time series analysis with pandas + statsmodels
Alexander Hendorf
 
balloon Synopsis at ISWC 2014 Developer Worksop
Kai Schlegel
 
Visualising statistical Linked Data with Plone
Eau de Web
 
Ecore Model Reflection in RDF
Steven Battle
 
Health Sciences Research Informatics, Powered by Globus
Globus
 
Online direct import of specimen records from iDigBio infrastructure into tax...
Viktor Senderov
 
Automate your PDF factsheets with xlwings Reports
xlwings
 
LD4KD 2015 - Demos and tools
Vrije Universiteit Amsterdam
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
BigData_Europe
 
Dogfooding data at Lyft
markgrover
 
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
ASIS&T
 
Uk discovery-jisc-project-showcase
RDTF-Discovery
 
Corpus studio Erwin Komen
CLARIAH
 
F# for Data*
Sergey Tihon
 
COUNTER Point: Making the Most of Imperfect Data
Lindsay Cronk
 
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
xlwings reports: Reporting with Excel & Python
xlwings
 
Implementing BigPetStore with Apache Flink
Márton Balassi
 

Viewers also liked (20)

PPTX
Historical Photographs of China - the journey towards sustainability and utility
Simon Price
 
PPTX
Data Sharing and Standards
Simon Price
 
PPTX
Supporting Big Data, Open Data, Data Analytics and Data Science
Simon Price
 
PPTX
Academic IT support for Data Science
Simon Price
 
PPT
Nature Locator
Simon Price
 
PPTX
Code Club - a Fight Club inspired approach to software inspection and review
Simon Price
 
PPTX
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price
 
PPTX
Co-designing Research IT and Research Data Services
Simon Price
 
PPTX
NewsPatterns - visualisation layer of news feed mining
Simon Price
 
PPT
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Simon Price
 
PPTX
Managing Large-scale Multimedia Development Projects
Simon Price
 
PPT
Managing research data at Bristol
Simon Price
 
PPTX
Research IT at the University of Bristol
Simon Price
 
PPTX
Mobile Apps for Research Data Collection
Simon Price
 
PPT
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
PPTX
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Simon Price
 
PPTX
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Simon Price
 
PPTX
Adapting CARDIO for BOS
Simon Price
 
PPT
Webs of People, Webs of Data
Simon Price
 
PPTX
Clinical Experience Recorder
Simon Price
 
Historical Photographs of China - the journey towards sustainability and utility
Simon Price
 
Data Sharing and Standards
Simon Price
 
Supporting Big Data, Open Data, Data Analytics and Data Science
Simon Price
 
Academic IT support for Data Science
Simon Price
 
Nature Locator
Simon Price
 
Code Club - a Fight Club inspired approach to software inspection and review
Simon Price
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price
 
Co-designing Research IT and Research Data Services
Simon Price
 
NewsPatterns - visualisation layer of news feed mining
Simon Price
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Simon Price
 
Managing Large-scale Multimedia Development Projects
Simon Price
 
Managing research data at Bristol
Simon Price
 
Research IT at the University of Bristol
Simon Price
 
Mobile Apps for Research Data Collection
Simon Price
 
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Simon Price
 
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Simon Price
 
Adapting CARDIO for BOS
Simon Price
 
Webs of People, Webs of Data
Simon Price
 
Clinical Experience Recorder
Simon Price
 
Ad

Similar to SubSift web services and workflows for profiling and comparing scientists and their published works (20)

PPT
Mining and Mapping the Research Landscape
Simon Price
 
PDF
Ak4301197200
IJERA Editor
 
PPT
Vellino presentationtocisti
Andre Vellino
 
PDF
A new approach to gather similar operations extracted from web services
IJECEIAES
 
PPT
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
PPT
Synthese Recommender System
Andre Vellino
 
PDF
PlanetData: Consuming Structured Data at Web Scale
PlanetData Network of Excellence
 
PDF
Planetdata simpda
Elena Simperl
 
PPTX
Evolving a Medical Image Similarity Search
Sujit Pal
 
PDF
Gomadam Dissertation
Karthik Gomadam
 
PDF
Multikeyword Hunt on Progressive Graphs
IRJET Journal
 
PPTX
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
SEAD
 
PPTX
Metadata for Research Objects
seanb
 
PPTX
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Marco Brambilla
 
PDF
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
Editor IJAIEM
 
PDF
A Linked Fusion of Things, Services, and Data to Support a Collaborative Data...
Eric Stephan
 
PDF
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 
PDF
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Chris Fregly
 
PPTX
Boundless Opportunity
Rachel Frick
 
PDF
1st meeting of PG PUSHPIN
Wolfgang Reinhardt
 
Mining and Mapping the Research Landscape
Simon Price
 
Ak4301197200
IJERA Editor
 
Vellino presentationtocisti
Andre Vellino
 
A new approach to gather similar operations extracted from web services
IJECEIAES
 
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
Synthese Recommender System
Andre Vellino
 
PlanetData: Consuming Structured Data at Web Scale
PlanetData Network of Excellence
 
Planetdata simpda
Elena Simperl
 
Evolving a Medical Image Similarity Search
Sujit Pal
 
Gomadam Dissertation
Karthik Gomadam
 
Multikeyword Hunt on Progressive Graphs
IRJET Journal
 
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
SEAD
 
Metadata for Research Objects
seanb
 
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Marco Brambilla
 
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
Editor IJAIEM
 
A Linked Fusion of Things, Services, and Data to Support a Collaborative Data...
Eric Stephan
 
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Chris Fregly
 
Boundless Opportunity
Rachel Frick
 
1st meeting of PG PUSHPIN
Wolfgang Reinhardt
 
Ad

Recently uploaded (20)

PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Data base management system Transactions.ppt
gandhamcharan2006
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
deep dive data management sharepoint apps.ppt
novaprofk
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 

SubSift web services and workflows for profiling and comparing scientists and their published works

  • 1. SubSift web services and workflows for profiling and comparing scientists and their published works Simon Price, Peter Flach, Sebastian Spiegler, Christopher Bailey and Nikki Rogers
  • 2. 2 Outline of this paper 1. SubSift – submission sifting 2. Background Theory: Vector Space Model 3. SubSift REST API 4. Demonstration Workflows 5. Conclusions
  • 3. 3 1. SubSift – submission sifting 1. SubSift – submission sifting 2. Background Theory 3. SubSift REST API 4. Demonstration Workflows 5. Conclusions
  • 4. 4 SubSift SubSift is a prototype application to support academic peer review. SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works. Website: http://subsift.ilrt.bris.ac.uk
  • 5. 5 SubSift has been used for... 15
  • 6. 6 Contribution of this work SubSift RESTful web services: • Open Source software (on Google Code) • Hosted open web service at University of Bristol Re-usable workflows for profiling and comparing scientists and their published works. Tool for constructing, manipulating and publishing document-centric datasets.
  • 7. Related Work • SubSift uses techniques more normally associated with Information Retrieval • Full text search tools support text matching on large-scale document collections e.g. Apache Lucene, PostgreSQL, Oracle UltraSearch Designed for 1:M matching but can also to do Cartesian product M:M matching. • How SubSift differs: • Exposes detailed metadata throughout. • Partly a research tool: need to plug in + instrument new algorithms. • Fewer licensing restrictions and dependencies for open source. 7
  • 8. 8 2. Background Theory: Vector Space Model 1. SubSift – submission sifting 2. Background Theory 3. SubSift REST API 4. Demonstration Workflows 5. Conclusions
  • 9. 9 Vector Space Model (from Information Retrieval) Vector Space Model consists of: • bag-of-words representation • cosine similarity • tf-idf weighting For a query (q), rank the documents (dj) in collection (D) by descending similarity to the query.
  • 10. 10 Vector Space Model: bag-of-words representation no. terms in each abstract no. terms in DBLP author page of each PC member
  • 11. 11 Vector Space Model: cosine similarity
  • 12. 12 Vector Space Model: tf-idf weighting
  • 13. 13 Representational State Transfer (REST) “RESTful” web services: • URIs to represent resources • HTTP POST/GET/PUT/DELETE correspond to usual Create/Read/Update/Delete (CRUD) operations • Response formats typically include: XML, JSON, CSV REST is a design pattern for web services based on HTTP using its familiar URIs, requests, responses, authentication, etc.
  • 14. 14 3. SubSift REST API 1. SubSift – submission sifting 2. Background Theory 3. SubSift REST API 4. Demonstration Workflows 5. Conclusions
  • 15. 15 SubSift System Archicture SUBSI FT REST API XML CSV TermsJSON YAML RDF WEB FILESTORE SUBSIFT HARVESTER XSLT CLIENT
  • 20. 20 4. Demonstration Workflows 1. SubSift – submission sifting 2. Background Theory 3. SubSift REST API 4. Demonstration Workflows 5. Conclusions
  • 21. 21 Workflow 1 – Submission Sifting
  • 22. Workflow 1 – Web 2.0 Client Implementation 22
  • 23. Workflow 1 – Papers is just a list of URLs (e.g. Yahoo! Pipes) 23
  • 24. 24 Workflow 2 – Finding an Expert
  • 27. 27 Clustering staff based on homepage similarity Dendrogram produced in Matlab from SubSift generated similarity matrix
  • 29. 29 Similarity networks Diagram created by Graphvis from SubSift generated dot file
  • 30. 30 Connectivity Diagram created by Graphvis from SubSift generated dot file
  • 31. 31 Workflow 4 – Profiling Reading Lists
  • 32. 32 Profiling a research group by its publications Diagram produced in Wordle using SubSift profile data
  • 33. 33 Workflow 5 – Ranking News Stories
  • 35. Future Work • Scaling-up • Currently a small-scale web application running on modest hardware. • Plans to migrate to a larger-scale HPC application at Bristol. • ExaMiner project • Mining and mapping the University of Bristol’s research landscape. • Crawling the University’s web pages to profile and visualise research interests of and similarities between faculty, departments, research groups and researchers. • Plans to apply to websites of other Universities. 35
  • 36. 36 5. Conclusions 1. SubSift – submission sifting 2. Background Theory 3. SubSift REST API 4. Demonstration Workflows 5. Conclusions
  • 37. 37 Conclusion • SubSift Services useful outside of peer review domain • Workflows for profiling/comparing scientists  Promising e-Science and e-Research use cases for profiling and comparing scientists and their published works. • Tool for constructing, manipulating and publishing document-centric datasets  E.g. information retrieval, data mining, pattern analysis research.  Publication of datasets in this way supports reproducibility of science.  Connects data through Linked Data and the Semantic Web.