Skip to content

Repository Service

Thib Guicherd-Callin edited this page Dec 6, 2023 · 16 revisions

This page reflects the upcoming Alpha-7 release. The Alpha-6 version is here.

Artifacts

The LOCKSS repository provides a store of artifacts, which are the individual files (plus metadata) that the system stores, audits & repairs, and serves/replays. Artifacts may be either a complete HTTP response, including the HTTP response headers, or a plain file. The former are used to preserve web sites, usually collected by the LOCKSS crawler or other pluggable crawlers. and are suitable to be used to replay the site. Plain files are typically directly stored and retrieved by external clients as part of a PLN infrastructure. The audit/repair mechanism works the same for both. The term payload always refers to the content exclusive of any HTTP headers (if present).

The repository comprises an artifact index, which stores identifying information and provides lookup and search, and a datastore. The index is currently implemented in Solr and the datastore is a collection of large WARC files each containing multiple artifact data. HTTP response artifacts are stored as WARC response records; plain file artifacts are stored as WARC resource records.

The artifact object returned by various endpoints reflects information in the index: the identifying tuple (below), a unique artifact uuid which can be used to retrieve the content and perform other operations on the artifact, and some metadata. (Strictly speaking, the artifact object represents only information in the index, but we also use the term loosely to include the content which is stored in the datastore.)

Artifacts are identified by a tuple: (namespace, AUID, URI, version). See the Glossary.

  • namespace can (and should) be omitted in most circumstances. It defaults to "lockss".
  • auid identifies the AU to which the Artifact belongs.
  • uri is the name of the Artifact. In crawled AUs it's a URL; in non-crawled AUs (Named AUs) it may be any arbitrary, client-supplied string.
  • version is an integer which is assigned automatically - each addition of an Artifact with the same (namespace, AUID, URI) creates a new version.

Each artifact also has a unique identifier (UUID) string, generated by the system, which is used to refer to it in some of the API endpoints.

Storing/retrieving artifacts

Artifact data are stored in to the repository be sending a request with a multipart entity containing identifying info (the tuple above), an optional HTTP response header (status line and response headers), and an arbitrarily large part for the payload. Artifacts may be retrieved as a similar multipart entity (deprecated), or more efficiently as a response that can be streamed directly into an application. The presence or absence of the HTTP response header determines (on store) or indicates (on fetch) whether the artifact is an HTTP response or a plain file.

The repository REST API, and the Java client library we provide, accommodate arbitrarily large files by streaming the payload part directly between disk or application (or, in the case of crawlers, network connection to publishers' sites) and the multipart message being sent or received. Other Java (and Python?) libraries exist which operate similarly, and clients that don't want the maximum artifact size to be limited by available memory should use them or implement similar mechanisms.

"Store" is accomplished with POST /artifacts to create a new artifact, followed by PUT /artifacts/{artifactUuid} to commit the artifact. Uncommitted artifacts may be looked up by exact name match, and their content may be read, but by default they will not be included in the result set of searches that may return multiple matches.

This two-stage store/commit is motivated by client processes (such as the LOCKSS crawler) that may need to examine/validate the artifact before deciding whether to store it permanently. If the artifact fails validation it may be deleted. If neither committed nor deleted after some time (default 4 hours) it will be automatically deleted.

"Fetch" is also a two step operation. First an artifact UUID is obtained, either via an artifact lookup/search (with GET /aus/{auid}/artifacts or GET /artifacts), or from a previously returned artifact (e.g., from POST /artifacts. Then the UUID is used in GET /artifacts/{artifactUuid}/response, GET /artifacts/{artifactUuid}/payload, or GET /artifacts/{artifactUuid}, to retrieve the actual data.

The same lookup/search endpoints are used to find a single artifact, by fully-specifying the artifact with AUID, URI, and (possibly implicit) version, as well as to search for multiple artifacts, by being less specific. The former will return 0 or 1 artifacts, the latter 0 or N artifacts.

REST endpoints

The repository API doc generated from the API specification omits many details and descriptions that are needed to use the interface. The descriptions below are intended to fill in the gaps.

Endpoints that return lists or collections of objects (Namespaces, AUIDs, Artifacts) may return a page of data containing partial results, see REST-APIs.

adds an artifact to the repository. The request body is multipart/form-data containing the following parts:

  • artifactProps - a Json map containing:
    • namespace - optional
    • auid - required
    • uri - required
    • collectionDate - optional. A long integer, default "now". Intended to represent the date/time the file was originally collected from its source
  • httpResponseHeader - optional, the header part of an HTTP response (status line and response headers) iff this artifact represents an HTTP response, otherwise omitted. Note: the presence or absence of this part determines whether a response artifact or a resource artifact will be created.
  • payload - the content bytes.

If successful an Artifact is returned in the response. That artifact's UUID may be used to commit, read or delete the artifact. If not committed within 4 hours the artifact will be deleted.

Imports artifacts from an archive, currently only WARC archives are supported.

These optional query args are supported:

  • namespace - optional, default "lockss"
  • excludeStatusPattern - regular expression matching status codes of response records not to import. (E.g., use "(4|5).." to exclude error responses from a crawler-generated WARC)
  • storeDuplicate - if "true", duplicate artifacts will be stored anyway.

The request body is multipart/form-data containing the following parts:

  • auid
  • archive - the WARC file

An artifact will be stored for each response or resource record in the WARC. If excludeStatusPattern is supplied, response records with a matching HTTP status code will not be stored. By default, if an artifact with the same namespace/auid/URI already exists in the repository, a new artifact will not be stored if its content would be identical to the (latest) already existing version. Set storeDuplicate=true to suppress the duplicate check. The response is a sequence of ImportStatus objects describing each artifact added to the repository.

commits a just-added artifact. Artifacts must be committed in order to become permanent.

  • artifactUuid - the UUID of the artifact returned by the add request.
  • committed = "true" to commit the artifact.

looks up artifacts in a single AU and returns a paged list of all those that match all of the supplied criteria. The normal way to look up a single artifact by name is to supply just an AUID and URI, and allow version to default to 'LATEST'. This (along with the default namespace) uniquely identifies an artifact; it will return an array containing zero (if not found) or one artifact.

  • auid - required.
  • namespace - optional, default "lockss".
  • uri - optional. If included restricts the results to those artifacts in the AU with the specified URI.
  • uriPrefix - optional. If included restricts the results to those artifacts in the AU whose URI begins with the specified string.
  • version - optional. If omitted, or the string 'LATEST' only the most recently stored version of each otherwise matching artifact will be included in the result. If the string 'ALL', all versions of each otherwise matching artifact will be included. If an integer, a uri must also be specified, and if an artifact with that uri and version exists it will be returned.
  • includeUncommitted - optional, default false. If true the result will include any matching uncommitted artifacts.
  • limit - optional, the maximum number of artifacts returned in a page. If omitted, defaults to the configured global default (which defaults to 1000).
  • continuationToken - optional, the continuationToken returned from the previous request. If supplied, any other artifact-specifying parameters will be ignored.

looks up artifacts across AUs and returns those that match all of the supplied criteria. The interpretation of the parameters is the same as the preceeding endpoint except:

  • version - optional, must be either ALL or LATEST, defaults to ALL.

retrieves an artifact's data as an HTTP response, with status line, headers and (optionally) body. This HTTP response is the body of the REST response; it is incumbent upon the client to parse the response into headers and body.

  • artifactUuid - required.
  • includeContent - optional, determines whether the payload will be included in the response. (Some processes need only the HTTP response headers; some don't know whether they need the content until after examining the response headers. As the content may be arbitrarily large, including it when not needed may be inefficient. Hence this enum:
    • ALWAYS (default) - include content (payload).
    • NEVER - do not include the conent.
    • IF_SMALL - include the payload if it's below a threshold size. (Mitigates the cost of a second request in cases where including the payload is cheap. Useful when it's not known whether the payload will be needed until the headers are examined.)

The response will have Content-Type: application/http;msgtype=response. Normally (for response artifacts) the body will be the HTTP response collected by the crawler (with some additional X-Lockss headers). For resource artifacts, the header part will be synthesized with the Content-Type, Content-Length and X-LockssRepo-Artifact-Digest reflecting the artifact data, preceded by a 200 status line. Depending on includeContent, the response body may be elided.

retrieves an artifact's raw data (payload) only.

  • artifactUuid - required.
  • includeContent - optional, as above (though of questionable utility here)

The response body is the raw artifact content, with no headers. The REST response headers Content-Type (if known), Content-Length and X-Lockss-Payload-Digest reflect the artifact content.

retrieves an artifact's data (HTTP response status and headers and optionally payload) as a multipart response. This endpoint will be removed: efficiently handling large multipart responses in the client has proven difficult.

  • artifactUuid - required.
  • includeContent - optional, determines whether the payload part will be included in the response. (Some processes need only the HTTP response headers; some don't know whether they need the content until after examining the response headers. As the content may be arbitrarily large, including it when not needed may be inefficient. Hence this enum:
    • ALWAYS (default) - include content (payload).
    • NEVER - do not include the conent.
    • IF_SMALL - include the payload if it's below a threshold size. (Mitigates the cost of a second request in cases where including the payload is cheap. Useful when it's not known whether the payload will be needed until the headers are examined.)

The response is a multipart/form-data containing these parts:

  • artifactProps - a Json map containing the identification info of the artifact, plus some metadata:
    • namespace
    • auid
    • uri
    • version
    • uuid
    • collectionDate
    • contentLength
    • contentDigest
  • httpResponseHeader - the header part of an HTTP response (status line and response headers) iff this artifact is an HTTP response, otherwise omitted.
  • payload - the actual data.

deletes the artifact with the matching artifactUuid. If the artifact hasn't been committed the space will be reclaimed; if it has been the space likely won't be reclaimed.

returns an AuSize object with the total payload size of all artifacts in the AU and of all latest-version artifacts in the AU, and the disk space consumed by the AU.

returns a paged list of all AUIDs of artifacts in the repository. Note that this isn't necessarily the same as the list of all AUs that have been configured in this LOCKSS node - it will not include configured AUs in which no content has been stored, and may include formerly- (or never-) configured AUs.

starts or ends "bulk mode", which allows a batch of artifacts to be added at much greater speed by updating the index only once at the end of the batch.

  • auid - required.
  • namespace - optional, default "lockss".
  • op - either "start" or "finish"

returns a paged list of all the namespaces.

returns a RepositoryInfo containing characteristics and capacity of storage spaces, etc.

Replay engine support

These endpoints are used to support replay engines; they're likely not of interest to most users. The first two return CDX records containing a "filename" which is actually an encoded string describing a WARC record in a particular WARC file. Supplying this "filename" to the third endpoint will return a WARC file containing the desired record.

looks up an artifact and returns a CDX record suitable for use by OpenWayback

  • namespace - required.
  • q - required, an OpenWayback query string. Supported fields are url and type (urlquery/prefixquery).
  • count - optional, maximum number of results returned in a page.
  • start_page - optional, page number of results to return (1-based).

looks up an artifact and returns a CDX record suitable for use by PyWb.

  • namespace - required.
  • uri - required.
  • matchType - optional, either "exact" or "prefix". Default is "exact".
  • output - optional, specifies the format in which the results should be formatted. Either "cdx" or "json"; default is "cdx".
  • closest - optional, a timestamp. If supplied the results will be sorted by temporal proximity to the target timestamp.
  • sort - NYI.
  • fl - NYI.
  • limit - NYI.

retrieves a synthetic WARC file containing the requested record.

  • filename - required, should be the "filename" from (one of) the CDX record(s) returned from one of the two previous endpoints.

To Do;

  • Describe atomicity, commit, bulk