Metadata based statistics for DSpace

www.atmire.com
Metadata based
usage statistics

OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?

Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?

USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event

USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address
• Country
• User Agent
• HTTP Referrer
• ...

ITEM METADATA
Relate usage event to information stored in
your repository.
Allows statistics queries based on item
metadata.
→ Not possible with a statistics solution that
is not tied to the repository.

GENERATING METADATA BASED STATISTICS
How many downloads did
author "Barnes, Douglas F.”
get in the last year, grouped
by month

Metadata based statistics for DSpace

LINKING METADATA TO USAGE EVENTS
Solr Query
http://localhost:8080/solr/statistics/select?
facet=true&facet.offset=0&facet.mincount=1&facet.sort=
false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f
acet.method=enum&fq=bundleName:ORIGINAL&fq=type:
+0&fq=statistics_type:view&fq=-isBot:true&fq=-
isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO
+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,
+Douglas+F.)+&wt=javabin&rows=0

facet.field=dateYearMonth
group by the field dateYearMonth
fq=type:+0
only include bitstream downloads
fq=bundleName:ORIGINAL
only include files in bundle “ORIGINAL”
fq=-isBot:true
filter out all bot statistics
fq=-isInternal:true
filter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]
only include stats that are between Jul 1st 2014
and Jun 6th 2015
fq=+(author_mtdt:Barnes,+Douglas+F.)+
only include statistics that are by
author Barnes, Douglas F.

In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no
metadata
• The metadata is stored in the database

PROPOSED SOLUTION
1. Query the database for bitstream IDs
based on the author metadata
2. Use those IDs to query solr for statistics

PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and
inefﬁcient to execute
• Inefﬁcient but still possible

What if we want to show the 10 authors with
the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events
for each author
• sort those counts, and return the 10 highest

Very inefﬁcient!
• do a lot of queries
• throw away most of the results: we only
need top 10

SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the
regular query
• results will be grouped by the values they have
for that ﬁeld

SOLR FACETS: EXAMPLE
q=type:0&facet.field=owningItem
q=type:0
search for all usage events that are bitstream downloads
facet.field=owningItem
group these by item
count the # records in each group

OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in ﬁltering and grouping

CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync

SIZE OF SINGLE USAGE EVENT
<doc>
<str name="ip">177.21.194.80</str>
<arr name="ip_search"><str>177.21.194.80</str></arr>
<arr name="ip_ngram"><str>177.21.194.80</str></arr>
<int name="type">0</int>
<int name="id">54</int>
<date name="time">2015-05-11T04:33:49.077Z</date>
<str name="dateYearMonth">2015-05</str>
<str name="dateYear">2015</str>
<str name="continent">SA</str>
<str name="countryCode">BR</str>
<float name="latitude">-10.0</float>
<float name="longitude">-55.0</float>
<arr name="bundleName"><str>ORIGINAL</str></arr>
<arr name="containerBitstream"><int>54</int></arr>
<arr name="owningItem"><int>1652</int></arr>
<arr name="containerItem"><int>1652</int></arr>
<arr name="owningColl"><int>14</int></arr>
<arr name="containerCollection"><int>14</int></arr>
<arr name="owningComm"><int>1</int></arr>
<arr name="containerCommunity"><int>1</int></arr>
<str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str>
<bool name="isBot">false</bool>
<bool name="isInternal">false</bool>
<str name="statistics_type">view</str>
<long name="_version_">1501767933804675072</long>
</doc>
25 elements

<doc>
<str name="ip">177.21.194.80</str>
...
<arr name="author_mtdt">
<str>Khandker, Shahidur R.</str>
<str>Barnes, Douglas F.</str>
<str>Samad, Hussain A.</str>
</arr>
<arr name="subject_mtdt">
<str>ACCESS TO LIGHTING</str>
<str>ACCESS TO MODERN ENERGY</str>
<str>AGRICULTURAL LAND</str>
<str>AGRICULTURAL RESIDUE</str>
<str>AIR CONDITIONERS</str>
<str>AIR POLLUTION</str>
<str>ALTERNATIVE ENERGY</str>
<str>ALTERNATIVE SOURCES OF ENERGY</str>
<str>APPROACH</str>
<str>ATMOSPHERE</str>
<str>AVAILABILITY</str>
<str>BASIC ENERGY</str>
<str>BIOMASS</str>
<str>BIOMASS BURNING</str>
<str>BIOMASS COLLECTION</str>
<str>BIOMASS CONSUMPTION</str>
<str>BIOMASS ENERGY</str>
...
<str>WORLD ENERGY</str>
<str>WORLD ENERGY OUTLOOK</str>
</arr>
...
</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects

KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be
updated as well

KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads
→ that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other
statistical reports

PERFORMANCE
Size of single usage event
Metadata updates
Amount of events
Live search queries

PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching
metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request
can’t be handled in time
interrupt the operation when Solr’s memory
usage nears its max

PERFORMANCE ENHANCEMENT: CACHING
Caching
store generated reports in a separate Solr core
retrieving them is very fast
invalidate cached reports after a set time
(e.g. 24 hours)

PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached
→ show the outdated version
In the mean time
→ generate a new version
Automatically show new report when it’s done

PROBLEM SOLVED?
Additional complexity
Number of usage events
keeps growing
Name variants
Different names for one author

“Who are the Most
Popular Authors in terms
of downloads?”
NAME VARIANTS USE CASE

https://openknowledge.worldbank.org/most-popular/author

Ferreira, Francisco H. G.
Ferreira, Francisco H.G.
Ferreira, Francisco
3 name variants:

SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:
(Ferreira, Francisco H. G.) OR
(Ferreira, Francisco H.G.) OR
(Ferreira, Francisco)

ALTERNATIVE SOLUTION
If you have unique IDs (e.g. ORCID)
Index, and search for them instead

www.atmire.com
Thank you!
Questions?

Metadata based statistics for DSpace

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Metadata based statistics for DSpace (20)

More from Bram Luyten (12)

Recently uploaded (20)

Metadata based statistics for DSpace