Crawler Service

Crawler Service is an extensible service managing various methods of content collection. It provides a REST API which allows clients to request and monitor crawls using a variety of collection mechanisms. ("Crawl" in this context means any process which collects content from an external source. "External crawl" is any crawl not performed by the classic LOCKSS crawler.) The alpha7 release supports the classic LOCKSS crawler and Wget. Upcoming releases will support crawlers such as Crawljax and Heritrix, as well as a loadable plugin architecture that allows PLN admins to integrate additional crawlers (similar to the architecture for publisher plugins). Any crawler that produces WARC files as output can be integrated.

In release 2.0-alpha7 all external crawls are initiated by clients, and the content is imported into a Named Archival Unit (see PLN Usage. Thus the same crawl must be initiated on each node in order to ingest the content into a LOCKSS network (analogous to direct deposit). A future release will extend publisher plugins to be able to define the parameters of external crawls, which will allow them to be run automatically on all nodes, as with classic crawls. Until that happens, the model for using external crawlers is similar to that for direct deposit: obtain an AUID for a Named AU, configure the AU, then request a crawl. See Wget Crawler below.

REST API

The CrawlerService API doc generated from the API specification omits many details and descriptions that are needed to use the interface. The descriptions below are intended to fill in the gaps.

All fields and arguments not listed as optional are required. Unless otherwise indicated, all request and response bodies are JSON representations of the designated object. Timestamps are standard *nix milliseconds since the epoch.

Model Objects

The Crawler API accepts and returns three primary objects:

CrawlDesc contains the parameters necessary to invoke a crawl. It is used in both requests and status responses. *CrawlJob describes the status of a queued or executing crawl job as an executable or executing entity. It contains the request info but no details of the actual crawl.
CrawlStatus contains the detailed crawler-specific progress of a running or completed crawl.

CrawlDesc

auid - the AUID of the AU into which the content should be ingested.
crawlKind - either "newContent", for a crawl that follows links, or "repair", which collects only those URLs explicitly specified.
crawlerId - Currently one of "classic" or "wget".
forceCrawl - optional, if true any conditions otherwise preventing the crawl (such a crawl windows or a too-recent crawl) will be ignored.
refetchDepth - optional, defaults to -1
priority - optional, defaults to value in global config (which defaults to 10). Used to order the crawl queue.
crawlList - list of URLs to crawl (and follow links from, for new content crawls)
crawlDepth - optional, maximum depth to which to follow links.
extraCrawlerData - optional, a map of crawler-specific argument names and values.

CrawlJob

crawlDesc - the CrawlDesc used to request the crawl
requestDate - timestamp when the crawl reqest was received.
jobId - a unique identifier created by the system when the request is received.
jobStatus - a jobStatus indicating the current state of the crawl job.
startDate - timestamp when the crawl job started running.
endDate - timestamp when the crawl job finished.

CrawlStatus

The fields that are filled in vary depending on the information available from the particular crawler. Several of the fields are Counters which, depending on system configuration, may contain just a count, or a count and a link to a pager of the actual values.

REST Endpoints

The /jobs endpoints manipulate and query the queue of CrawlJobs: enqueing, deleting and querying the status of queued or running crawls. The /crawls endpoints query detailed information about individual crawls, analogous to the crawl status info available from the classic UI.

POST /jobs enqueues a crawl job. The request body should be a CrawlDesc. The response is a CrawlJob.

DELETE /jobs deletes all queued and running crawl jobs.

GET /jobs returns a pager of all queued and running CrawlJobs.

DELETE /crawls/{jobId} deletes the enqueued or running crawl with the specified jobId.

GET /crawls/{jobId} returns a CrawlStatus with detailed information about the selected crawl.

GET /crawls returns a pager of the CrawlStatus of all known crawl jobs.

Wget Crawler

The Wget Crawler invokes wget with user-supplied arguments to crawl a site and build a WARC file, then imports the contents of the WARC file into the repository. The resulting AU will be polled and can be browsed using ServeContent or another replay engine.

To use wget to crawl an AU, perform the following steps directed at each node on which the AU will be preserved. See this script for exact details.

Choose a handle for the AU, to uniquely identify the AU. Any string is allowed; the same handle must be used on each node.
Obtain the AU's AUID using POST /auids on any box in the PLN.
If the AU has not already been configured, configure it by sending PUT /aus/{auid} to the Configuration Service. This step should not be repeated if the AU is already configured, e.g., if the crawl is rerun, or a different crawl into the same AU is run. The current mechanism stores some state information as non-definitional parameters in the AU configuration; configuring the AU a second time will overwrite this state information. The configuration sent in the request should look like {"auConfig": "auId": "auId", {"handle": "handle", "features": "crawledAu"}}.
Wait for a few seconds to ensure the AU has been created.
Request a crawl by sending POST /jobs to Crawler Service. The body should be a CrawlDesc containing at least auid, crawlKind="newContent", crawlerId="wget", crawlList="semicolon separated list of start URLs". See the script for a suggested value for extraCrawlerData. If the request is accepted, the resulting jobId will be contained in the returned CrawlJobs.
The status of the crawl can be monitored in Crawler Service's UI, under Daemon Status / Crawl Status.

Crawler Service

REST API

Model Objects

CrawlDesc

CrawlJob

CrawlStatus

REST Endpoints

Wget Crawler

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally