Leslie Johnston Keynote, Best Practices Exchange 2011

From Records to Data:
It’s Not Just About
Collections Any More

Leslie Johnston, Library of Congress
Best Practices Exchange 2011

What are the Biggest
Insights that we have
Learned in Fifteen Years of
Building Digital Collections?

Researchers do not use digital
collections the same way that
they use analog collections

We Can Never Guess Every
Way that Our Collections Will
Be Used

Stewardship organizations
have, until recently, spoken of
“collections” or “content” or
“records” or even “files,” but
not data.

We Have Data in our Libraries,
Archives and Museums?

Yes.

Data is not just generated by
satellites, identified during
experiments, or collected
during surveys.

Datasets are not just scientific and business
tables and spreadsheets: our collections are
now considered data.

They are the building blocks for interpretation
and discovery that transform and combine
them into entities that we may not recognize.

More and more researchers want to use
collections as a whole, mining and organizing
the information in novel ways.

Researchers use algorithms to mine the rich
information and tools to create pictures that
translate that information into knowledge.

Researchers may want to interact with a
collection of artifacts, or they may want to
work with a data corpus.

Consider the Digging Into Data
Challenge
The repositories available for research include not only
scientific information—astronomy, geology, physics, biology,
social science surveys—but images, film, sound,
newspapers, maps, art, archaeology, architecture and
government records.

http://www.diggingintodata.org/

What Constitutes “Big Data?”
The definition of Big Data is very fluid, as it is a moving
target — what cannot be easily manipulated with common
tools — and specific to the organization: what can be
managed and stewarded by any one institution in its
infrastructure. One researcher or organization’s concept of
a large data set is small to another.

Not too long ago, an organization would be surprised to
need 10 TB of storage for a large digital collection. Now a
collection can increase by 10 TB in a single week.

We still have collections. But what we also
have is Big Data, which requires us to rethink
the infrastructure that is needed to support
Big Data services. Our community used to
expect researchers to come to us, ask us
questions about our collections, and use our
digital collections in our environment.

Now our collections are, more often than not,
self-serve.

Case Study: Web Archives
• Web Archives, such as the one at the
Library of Congress, may be
comprised of billions of files.
• When we began archiving election web
sites, we imagined users browsing
through the web pages, studying the
graphics or use of phrases or links. But
when our first researchers came to the
Library, they wanted to know about all
those topics, but they used scripts to
query for them and sort them into
categories. They were not very much
interested in reading web pages.

http://www.loc.gov/webarchiving/

Case Study: Historic Newspapers
• The Chronicling America collection
has over 4 million page images from
historic newspapers with OCR from
organizations in 25 states.
• The site gets approximately 4 million
views per day.
• Some researchers want to search
for stories in historic newspapers.
• Some researchers want to mine
newspaper OCR for trends across
time periods and geographic areas.
• Requests have come in to analyze
all 4 million page images.

http://chroniclingamerica.loc.gov/

Case Study: Twitter
• The Twitter archive has 10s of billions
of tweets in it.
• Research requests have included users
looking for their own Twitter history, the
study of the geographic spread of news,
the study of the spread of epidemics,
and the study of the transmission of
new uses of language.
social
science
visualization

social media status

events

personal
privacy
commercial

Can each of our organizations support real-
time querying of billions of full-text
items? Can we provide tools for collection
analysis and visualization? Can we support
the frequent downloading by researchers of
collections that may be over 200 TB each?

These are among the questions that all of our
institutions are grappling with as we build
large digital collections and discover new
ways in which they can be used.

So what are our
institutions doing
about preservation
and access to our
Big Collections and
Big Data?

Collaboration
www.digitalpreservation.gov/ndsa

The National Digital Stewardship Alliance is an
initiative of the National Digital Information
Infrastructure and Preservation Program at the
Library of Congress, with almost 100 member
organizations that share a sense of dedication to
digital preservation, and want to work
collaboratively across the community.

The NDSA operates through five working groups:
Content; Standards and Practices; Infrastructure;
Innovation; and Outreach.

Tool Development

All stewardship organizations can and should
participate in the development and use of open
access tools for use across the community.

NDIIPP is revising its Tools and Services
Directory to include a broader range of projects,
some of which are always looking for other
organizations to contribute to!

http://www.digitalpreservation.gov/partners/resources/tools

As an Example…

Seeing and Sharing Digital Cultural
Heritage Collections Differently
with ViewShare/Recollection

bigish ideas

› heterogeneous data
› one big distributed collection
› open distributed infrastructure
› mindset: records -> data

the ViewShare idea
digital cultural heritage collections
include temporal, locative, and
categorical data that, could be
tapped to better dynamically
interact with and understand those
collections.

the challenges
› we all have different kinds of
metadata
› that data is in different kinds of
systems
› much of that data is messy
› much of that data is not in the
format we might wish it was

ingest collection
descriptions from
spreadsheets, MODS
records, or ATOM and
RSS

Augment: derive
ISO dates,
latitude and
longitude
coordinates, and
break apart
data

design views:
graphical interface
for assembling
views

publish views on the site or embed
views with one line of javascript into
any HTML document.

share data and views
share not only the end results, but
also the raw data for other others to
create their own views.

data use and re-use

recent work
› support for public/private views and data
› beta support for OAI and ContentDM data
loading
› full open source release on SourceForge:
http://sourceforge.net/projects/loc-recollect/

what’s next?
› viewshare.org public launch on
November 1, 2011
› big data sets: in a while
› remix across data sets: long view

contact us
› Let us know if you are interested in
participation in the NDSA through the web
site
› Let us know if there is a tool or service that
is missing from our directory
› visit http://recollection.zepheira.com/ to get
a sneak peek at ViewShare
› email NDIIPPaccess@loc.gov if you are
interested in a ViewShare account

Questions?

Leslie Johnston
lesliej@loc.gov

Leslie Johnston Keynote, Best Practices Exchange 2011

More Related Content

What's hot

Viewers also liked

Similar to Leslie Johnston Keynote, Best Practices Exchange 2011

Recently uploaded

Leslie Johnston Keynote, Best Practices Exchange 2011