jq module to process Wikidata JSON format
This git repository contains a module for the jq data transformation language to process entity data from Wikidata or other Wikibase instances serialized in its JSON format.
Several methods exist to get entity data from Wikidata. This module is designed to process entities in their JSON serialization especially for large numbers of entities. Please also consider using a dedicated client such as wikidata-cli instead.
Installation requires jq version 1.5 or newer.
Put wikidata.jq to a place where jq can find it as module.
One way to do so is to check out this repository to directory ~/.jq/wikidata/:
mkdir -p ~/.jq && git clone https://github.com/nichtich/jq-wikidata.git ~/.jq/wikidataThe shortest method to use functions of this jq module is to directly include the module. Try to process a single Wikidata entity (see below for details about per-item acces):
wget http://www.wikidata.org/wiki/Special:EntityData/Q42.json
jq 'include "wikidata"; .entities[].labels|reduceLabels' Q42.jsonIt is recommended to put Wikidata entities in a newline delimited JSON file:
jq -c .entities[] Q42.json > entities.ndjson
jq -c 'include "wikidata"; .labels|reduceLabels' entities.ndjsonMore complex scripts should better be put into a .jq file:
include "wikidata";
.labels|reduceLabelsThe file can then be processed this way:
jq -f script.jq entities.ndjsonWikidata JSON dumps are made available at https://dumps.wikimedia.org/wikidatawiki/entities/. The current dumps exceed 35GB even in its most compressed form. The file contains one large JSON array so it should better be converted into a stream of JSON objects for further processing.
With a fast and stable internet connection it's possible to process the dump on-the fly like this:
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 \
| bzcat | jq -nc --stream 'include "wikidata"; ndjson' | jq .idJSON data for single entities can be ontained via the Entity Data URL. Examples:
- https://www.wikidata.org/wiki/Special:EntityData/Q42.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006-F1.json
The module function entity_data_url creates these URLs from Wikidata
itentifier strings. The resulting data is wrapped in JSON object; unwrap with
.entities|.[]:
curl $(echo Q42 | jq -rR 'include "wikidata"; entity_data_url') | jq '.entities|.[]'As mentioned above you better use wikidata-cli for accessing small sets of items:
wd d Q42To get sets of items that match a given criteria either use SPARL or MediaWiki API module wbsearchentities and/or MediaWiki API module wbgetentities.
Use function reduceEntity or more specific functions (reduceInfo, reduceItem, reduceProperty, reduceLexeme) to reduce the JSON data structure without loss of essential information.
Furher select only some specific fields if needed:
jq '{id,labels}' entities.ndjsonApplies reduceInfo and one of reduceItem, reduceProperty, reduceLexeme.
reduceEntitySimplifies labels, descriptions, aliases, claims, and sitelinks of an item.
reduceItemSimplifies labels, descriptions, aliases, and claims of a property.
reduceProperty.labels|reduceLabels.descriptions|reduceDescriptions.aliases|reduceAliases.sitelinks|reduceSitelinksSimplifies lemmas, forms, and senses of a lexeme entity.
reduceLexeme.forms|reduceForms.senses|reduceSensesRemoves unnecessary fields .id, .hash, .type, .property and simplifies
values for each claim.
.claims|reduceClaimsReduces a single claim value.
.claims.P26[]|reduceClaim...
Only lexemes have forms.
.forms|reduceForms
reduceInfoRemoves additional information fields pageid, ns, title, lastrevid, and modified.
To remove selected field see jq function del.
Module function ndjson can be used to process a stream with an array of
entities into a list of entities:
bzcat latest-all.json.bz2 | jq -n --stream 'import "wikidata"; ndjson'Alternative, possibly more performant methods to process array of entities are described here:
bzcat latest-all.json.bz2 | head -n-1 | tail -n+2 | sed 's/,$//'The source code is hosted at https://github.com/nichtich/jq-wikidata.
Bug reports and feature requests are welcome!
Made available under the MIT License by Jakob Voß.