SlideShare a Scribd company logo
www.atmire.com
Metadata based
usage statistics
OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?
Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?
USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event
USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address
• Country
• User Agent
• HTTP Referrer
• ...
ITEM METADATA
Relate usage event to information stored in
your repository.
Allows statistics queries based on item
metadata.
→ Not possible with a statistics solution that
is not tied to the repository.
GENERATING METADATA BASED STATISTICS
How many downloads did
author "Barnes, Douglas F.”
get in the last year, grouped
by month
Metadata based statistics for DSpace
Metadata based statistics for DSpace
Metadata based statistics for DSpace
Metadata based statistics for DSpace
Metadata based statistics for DSpace
LINKING METADATA TO USAGE EVENTS
Solr Query
http://localhost:8080/solr/statistics/select?
facet=true&facet.offset=0&facet.mincount=1&facet.sort=
false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f
acet.method=enum&fq=bundleName:ORIGINAL&fq=type:
+0&fq=statistics_type:view&fq=-isBot:true&fq=-
isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO
+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,
+Douglas+F.)+&wt=javabin&rows=0
LINKING METADATA TO USAGE EVENTS
facet.field=dateYearMonth
group by the field dateYearMonth
fq=type:+0
only include bitstream downloads
fq=bundleName:ORIGINAL
only include files in bundle “ORIGINAL”
fq=-isBot:true
filter out all bot statistics
fq=-isInternal:true
filter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]
only include stats that are between Jul 1st 2014
and Jun 6th 2015
fq=+(author_mtdt:Barnes,+Douglas+F.)+
only include statistics that are by
author Barnes, Douglas F.
<response>
<lst name="responseHeader">
...
</lst>
<result name="response" numFound="164" start="0"></result>
<lst name="facet_counts">
<lst name="facet_fields">
<lst name="dateYearMonth">
<int name="2014-07">15</int>
<int name="2014-08">19</int>
<int name="2014-09">15</int>
<int name="2014-10">10</int>
<int name="2014-11">7</int>
<int name="2014-12">13</int>
<int name="2015-01">13</int>
<int name="2015-02">15</int>
<int name="2015-03">21</int>
<int name="2015-04">22</int>
<int name="2015-05">12</int>
<int name="2015-06">2</int>
</lst>
</lst>
</lst>
</response>
LINKING METADATA TO USAGE EVENTS
In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no
metadata
• The metadata is stored in the database
PROPOSED SOLUTION
1. Query the database for bitstream IDs
based on the author metadata
2. Use those IDs to query solr for statistics
PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and
inefficient to execute
• Inefficient but still possible
PROPOSED SOLUTION: DOWNSIDES
What if we want to show the 10 authors with
the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events
for each author
• sort those counts, and return the 10 highest
PROPOSED SOLUTION: DOWNSIDES
Very inefficient!
• do a lot of queries
• throw away most of the results: we only
need top 10
SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the
regular query
• results will be grouped by the values they have
for that field
SOLR FACETS: EXAMPLE
q=type:0&facet.field=owningItem
q=type:0
search for all usage events that are bitstream downloads
facet.field=owningItem
group these by item
count the # records in each group
OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in filtering and grouping
CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync
SIZE OF SINGLE USAGE EVENT
<doc>
<str name="ip">177.21.194.80</str>
<arr name="ip_search"><str>177.21.194.80</str></arr>
<arr name="ip_ngram"><str>177.21.194.80</str></arr>
<int name="type">0</int>
<int name="id">54</int>
<date name="time">2015-05-11T04:33:49.077Z</date>
<str name="dateYearMonth">2015-05</str>
<str name="dateYear">2015</str>
<str name="continent">SA</str>
<str name="countryCode">BR</str>
<float name="latitude">-10.0</float>
<float name="longitude">-55.0</float>
<arr name="bundleName"><str>ORIGINAL</str></arr>
<arr name="containerBitstream"><int>54</int></arr>
<arr name="owningItem"><int>1652</int></arr>
<arr name="containerItem"><int>1652</int></arr>
<arr name="owningColl"><int>14</int></arr>
<arr name="containerCollection"><int>14</int></arr>
<arr name="owningComm"><int>1</int></arr>
<arr name="containerCommunity"><int>1</int></arr>
<str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str>
<bool name="isBot">false</bool>
<bool name="isInternal">false</bool>
<str name="statistics_type">view</str>
<long name="_version_">1501767933804675072</long>
</doc>
25 elements
<doc>
<str name="ip">177.21.194.80</str>
...
<arr name="author_mtdt">
<str>Khandker, Shahidur R.</str>
<str>Barnes, Douglas F.</str>
<str>Samad, Hussain A.</str>
</arr>
<arr name="subject_mtdt">
<str>ACCESS TO LIGHTING</str>
<str>ACCESS TO MODERN ENERGY</str>
<str>AGRICULTURAL LAND</str>
<str>AGRICULTURAL RESIDUE</str>
<str>AIR CONDITIONERS</str>
<str>AIR POLLUTION</str>
<str>ALTERNATIVE ENERGY</str>
<str>ALTERNATIVE SOURCES OF ENERGY</str>
<str>APPROACH</str>
<str>ATMOSPHERE</str>
<str>AVAILABILITY</str>
<str>BASIC ENERGY</str>
<str>BIOMASS</str>
<str>BIOMASS BURNING</str>
<str>BIOMASS COLLECTION</str>
<str>BIOMASS CONSUMPTION</str>
<str>BIOMASS ENERGY</str>
...
<str>WORLD ENERGY</str>
<str>WORLD ENERGY OUTLOOK</str>
</arr>
...
</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects
KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be
updated as well
KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads
→ that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other
statistical reports
PERFORMANCE
Size of single usage event
Metadata updates
Amount of events
Live search queries
PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching
metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request
can’t be handled in time
interrupt the operation when Solr’s memory
usage nears its max
PERFORMANCE ENHANCEMENT: CACHING
Caching
store generated reports in a separate Solr core
retrieving them is very fast
invalidate cached reports after a set time
(e.g. 24 hours)
PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached
→ show the outdated version
In the mean time
→ generate a new version
Automatically show new report when it’s done
EXAMPLE: CACHE MISS
EXAMPLE: CACHE MISS
PROBLEM SOLVED?
Additional complexity
Number of usage events
keeps growing
Name variants
Different names for one author
“Who are the Most
Popular Authors in terms
of downloads?”
NAME VARIANTS USE CASE
https://openknowledge.worldbank.org/most-popular/author
Ferreira, Francisco H. G.
Ferreira, Francisco H.G.
Ferreira, Francisco
3 name variants:
Metadata based statistics for DSpace
SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:
(Ferreira, Francisco H. G.) OR
(Ferreira, Francisco H.G.) OR
(Ferreira, Francisco)
ALTERNATIVE SOLUTION
If you have unique IDs (e.g. ORCID)
Index, and search for them instead
www.atmire.com
Thank you!
Questions?
Desktop view Phone view
Desktop view
Phone view
Desktop view
Phone view

More Related Content

What's hot (20)

PPTX
DSpace 4.2 Basics & Configuration
DuraSpace
 
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
KEY
Introdução a web semântica e o case da globo.com
Renan Moreira de Oliveira
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PDF
Introducing Neo4j
Neo4j
 
PDF
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
PPTX
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j
 
PPTX
ENEL Electricity Grids on Neo4j Graph DB
Neo4j
 
PDF
Datalake Architecture
TechYugadi IT Solutions & Consulting
 
PDF
Introducing Delta Live Tables: Make Reliable ETL Easy on Delta Lake
Databricks
 
PDF
Slides: Taking an Active Approach to Data Governance
DATAVERSITY
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PPTX
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptx
Neo4j
 
PDF
Graphs for Enterprise Architects
Neo4j
 
DSpace 4.2 Basics & Configuration
DuraSpace
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Databricks Fundamentals
Dalibor Wijas
 
Data Lakehouse Symposium | Day 4
Databricks
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
Introdução a web semântica e o case da globo.com
Renan Moreira de Oliveira
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Introducing Neo4j
Neo4j
 
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j
 
ENEL Electricity Grids on Neo4j Graph DB
Neo4j
 
Introducing Delta Live Tables: Make Reliable ETL Easy on Delta Lake
Databricks
 
Slides: Taking an Active Approach to Data Governance
DATAVERSITY
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptx
Neo4j
 
Graphs for Enterprise Architects
Neo4j
 

Viewers also liked (20)

KEY
DSpace in Belgium and beyond
Bram Luyten
 
PDF
Working for Atmire
Bram Luyten
 
PDF
DSpace repositories today and tomorrow
Bram Luyten
 
PDF
DSpace UI prototype dsember
Bram Luyten
 
PDF
Durable Item Relations for DSpace
Bram Luyten
 
PDF
Email deposit
Bram Luyten
 
PDF
Git and Github - a 90 Minute interactive workshop
Bram Luyten
 
PDF
So we all have ORCID integrations, now what?
Bram Luyten
 
PDF
Enterprize aws
mamoru tateoka
 
PPTX
Tarea unidad II
Angela De Jesus Castro
 
PDF
¿Cómo organizar una estrategia de investigación?
Grial - University of Salamanca
 
PPS
Pilicolayi
Rafig Valiyev
 
PPTX
Límite de una función
mariofriedman
 
DOCX
Price list
Gunaep
 
PPT
Classroom20 precentation
aivanoulis
 
PPTX
Rubanomics - Corporate Presentation
Rheetam Mitra
 
PDF
Masgnb seminar itr_2013-program
Russian website "About Trenchless"
 
PPTX
Private Sector Leads Virgin Islands to Solar
Don Buchanan
 
PPTX
Presentation3-One Pound
ChaseTomlinson
 
ODP
Ingles isabel mª
miguelingp
 
DSpace in Belgium and beyond
Bram Luyten
 
Working for Atmire
Bram Luyten
 
DSpace repositories today and tomorrow
Bram Luyten
 
DSpace UI prototype dsember
Bram Luyten
 
Durable Item Relations for DSpace
Bram Luyten
 
Email deposit
Bram Luyten
 
Git and Github - a 90 Minute interactive workshop
Bram Luyten
 
So we all have ORCID integrations, now what?
Bram Luyten
 
Enterprize aws
mamoru tateoka
 
Tarea unidad II
Angela De Jesus Castro
 
¿Cómo organizar una estrategia de investigación?
Grial - University of Salamanca
 
Pilicolayi
Rafig Valiyev
 
Límite de una función
mariofriedman
 
Price list
Gunaep
 
Classroom20 precentation
aivanoulis
 
Rubanomics - Corporate Presentation
Rheetam Mitra
 
Masgnb seminar itr_2013-program
Russian website "About Trenchless"
 
Private Sector Leads Virgin Islands to Solar
Don Buchanan
 
Presentation3-One Pound
ChaseTomlinson
 
Ingles isabel mª
miguelingp
 
Ad

Similar to Metadata based statistics for DSpace (20)

PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
PDF
Handling of Large Data by Salesforce
Thinqloud
 
PPTX
Large Data Volume Salesforce experiences
Cidar Mendizabal
 
PPTX
Unifying your data management with Hadoop
Jayant Shekhar
 
PPTX
Database
nationalmobileapps
 
PPTX
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
PDF
Minerva: Drill Storage Plugin for IPFS
BowenDing4
 
PPTX
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
JeremyOtt5
 
PPT
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
PDF
Integrating Hadoop in Your Existing DW and BI Environment
Cloudera, Inc.
 
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
PDF
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
PPT
DataCite How To: Use the MDS
Frauke Ziedorn
 
PPTX
Customer Feedback Analytics for Starbucks
Nishant Gandhi
 
PPTX
Apache Eagle Strata Hadoop World London 2016
Arun Karthick Manoharan
 
PDF
SharePoint TechCon 2009 - 803
Andreas Grabner
 
PPTX
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
PPT
Hadoop HDFS.ppt
6535ANURAGANURAG
 
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
PPTX
LDV.pptx
Shams Pirzada
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Handling of Large Data by Salesforce
Thinqloud
 
Large Data Volume Salesforce experiences
Cidar Mendizabal
 
Unifying your data management with Hadoop
Jayant Shekhar
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
Minerva: Drill Storage Plugin for IPFS
BowenDing4
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
JeremyOtt5
 
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
Integrating Hadoop in Your Existing DW and BI Environment
Cloudera, Inc.
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
DataCite How To: Use the MDS
Frauke Ziedorn
 
Customer Feedback Analytics for Starbucks
Nishant Gandhi
 
Apache Eagle Strata Hadoop World London 2016
Arun Karthick Manoharan
 
SharePoint TechCon 2009 - 803
Andreas Grabner
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
LDV.pptx
Shams Pirzada
 
Ad

More from Bram Luyten (12)

PDF
Archiving Sensitive Data
Bram Luyten
 
PDF
Update on DSpace 7
Bram Luyten
 
PDF
DSpace 5.7 and 6.1 Preview
Bram Luyten
 
PDF
DSpace Today and Tomorrow
Bram Luyten
 
PDF
Mirage 2: A responsive user interface for DSpace
Bram Luyten
 
PDF
Dépôts institutionnels et collections spéciales en DSpace
Bram Luyten
 
PDF
Secrets of the DSpace Submission Form
Bram Luyten
 
PDF
Introduction to XMLUI and Mirage Theming for DSpace 3
Bram Luyten
 
PDF
What's in Store for DSpace 4?
Bram Luyten
 
PDF
ORCID for DSpace
Bram Luyten
 
PDF
Using Github for DSpace development
Bram Luyten
 
PDF
Workshop: Google Analytics for DSpace
Bram Luyten
 
Archiving Sensitive Data
Bram Luyten
 
Update on DSpace 7
Bram Luyten
 
DSpace 5.7 and 6.1 Preview
Bram Luyten
 
DSpace Today and Tomorrow
Bram Luyten
 
Mirage 2: A responsive user interface for DSpace
Bram Luyten
 
Dépôts institutionnels et collections spéciales en DSpace
Bram Luyten
 
Secrets of the DSpace Submission Form
Bram Luyten
 
Introduction to XMLUI and Mirage Theming for DSpace 3
Bram Luyten
 
What's in Store for DSpace 4?
Bram Luyten
 
ORCID for DSpace
Bram Luyten
 
Using Github for DSpace development
Bram Luyten
 
Workshop: Google Analytics for DSpace
Bram Luyten
 

Recently uploaded (20)

PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Python basic programing language for automation
DanialHabibi2
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Python basic programing language for automation
DanialHabibi2
 

Metadata based statistics for DSpace

  • 2. OVERVIEW 1. Why DSpace statistics? 2. Usage event vs. Item metadata 3. Generating metadata based statistics 4. Linking metadata to usage events 5. Performance 6. Problem solved?
  • 3. Statistics solution that knows DSpace: Structure “Which are the most downloaded bitstreams in a collection” Metadata “Who are the most popular authors in terms of downloads?” 1 - WHY DSPACE STATISTICS?
  • 4. USAGE EVENT VS. ITEM METADATA 2 types of metadata: Usage event metadata Additional information about the usage event Item metadata Additional information about the target of the usage event
  • 5. USAGE EVENT METADATA Additional information about the usage event Not related to repository Also possible with other statistics solutions: • IP address • Country • User Agent • HTTP Referrer • ...
  • 6. ITEM METADATA Relate usage event to information stored in your repository. Allows statistics queries based on item metadata. → Not possible with a statistics solution that is not tied to the repository.
  • 7. GENERATING METADATA BASED STATISTICS How many downloads did author "Barnes, Douglas F.” get in the last year, grouped by month
  • 13. LINKING METADATA TO USAGE EVENTS Solr Query http://localhost:8080/solr/statistics/select? facet=true&facet.offset=0&facet.mincount=1&facet.sort= false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f acet.method=enum&fq=bundleName:ORIGINAL&fq=type: +0&fq=statistics_type:view&fq=-isBot:true&fq=- isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO +2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes, +Douglas+F.)+&wt=javabin&rows=0
  • 14. LINKING METADATA TO USAGE EVENTS facet.field=dateYearMonth group by the field dateYearMonth fq=type:+0 only include bitstream downloads fq=bundleName:ORIGINAL only include files in bundle “ORIGINAL” fq=-isBot:true filter out all bot statistics fq=-isInternal:true filter out all internal statistics fq=time:[2014-07-01+TO+2015-06-06] only include stats that are between Jul 1st 2014 and Jun 6th 2015 fq=+(author_mtdt:Barnes,+Douglas+F.)+ only include statistics that are by author Barnes, Douglas F.
  • 15. <response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst> </response>
  • 16. LINKING METADATA TO USAGE EVENTS In a vanilla DSpace installation: • Usage statistics only contain bitstream IDs: no metadata • The metadata is stored in the database
  • 17. PROPOSED SOLUTION 1. Query the database for bitstream IDs based on the author metadata 2. Use those IDs to query solr for statistics
  • 18. PROPOSED SOLUTION: DOWNSIDES • Two queries to answer one question • The solr query can get very long and inefficient to execute • Inefficient but still possible
  • 19. PROPOSED SOLUTION: DOWNSIDES What if we want to show the 10 authors with the most downloads? • query the database for all authors • query SOLR to get the number of usage events for each author • sort those counts, and return the 10 highest
  • 20. PROPOSED SOLUTION: DOWNSIDES Very inefficient! • do a lot of queries • throw away most of the results: we only need top 10
  • 21. SOLR FACETS To do a facet query: • specify ”facet.field” along with the regular query • results will be grouped by the values they have for that field
  • 22. SOLR FACETS: EXAMPLE q=type:0&facet.field=owningItem q=type:0 search for all usage events that are bitstream downloads facet.field=owningItem group these by item count the # records in each group
  • 23. OUR SOLUTION • Add Item metadata to SOLR. • Use built-in filtering and grouping
  • 24. CHALLENGE: SIZE OF THE SOLR CORE That solution creates new challenges Metadata is duplicated in every statistical record that takes up a lot of space and it needs to be kept in sync
  • 25. SIZE OF SINGLE USAGE EVENT <doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long> </doc> 25 elements
  • 26. <doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ... </doc> SIZE OF SINGLE USAGE EVENT WITH METADATA 3 authors 140 subjects
  • 27. KEEPING METADATA IN SYNC When the metadata of an item changes • a mistake was corrected • extra info was added the statistical records for that item need to be updated as well
  • 28. KEEPING METADATA IN SYNC Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events. • That takes time • During that time, it takes longer to view other statistical reports
  • 29. PERFORMANCE Size of single usage event Metadata updates Amount of events Live search queries
  • 30. PERFORMANCE ENHANCEMENT: SYNCING Try to keep the load created by synching metadata in the statistics as low as possible: → only sync while solr is idle interrupt the operation when a search request can’t be handled in time interrupt the operation when Solr’s memory usage nears its max
  • 31. PERFORMANCE ENHANCEMENT: CACHING Caching store generated reports in a separate Solr core retrieving them is very fast invalidate cached reports after a set time (e.g. 24 hours)
  • 32. PERFORMANCE ENHANCEMENT: CACHING Don’t delete expired cached reports If a user requests a report that is cached → show the outdated version In the mean time → generate a new version Automatically show new report when it’s done
  • 35. PROBLEM SOLVED? Additional complexity Number of usage events keeps growing Name variants Different names for one author
  • 36. “Who are the Most Popular Authors in terms of downloads?” NAME VARIANTS USE CASE
  • 38. Ferreira, Francisco H. G. Ferreira, Francisco H.G. Ferreira, Francisco 3 name variants:
  • 40. SOLUTION FOR NAME VARIANTS include all name variants in Solr query: author_mtdt: (Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)
  • 41. ALTERNATIVE SOLUTION If you have unique IDs (e.g. ORCID) Index, and search for them instead