1 Introduction

Natural joins are fundamental in the relational algebra, and generally the most costly operations to implement. A poor implementation choice can lead to unaffordable query times, so the implementation of joins has been a concern since the birth of the relational model. Apart from efficient algorithms to join two tables (i.e., pair-wise joins), database management systems sought optimized strategies (e.g., [43]) to join several tables (i.e., multijoins), where differences between good and bad plans can be huge in terms of efficiency. Multijoins were typically handled as sequences of pair-wise joins: a query plan was a binary expression tree where the leaves were the tables to join and the internal nodes were the (pair-wise) joins to carry out. The main optimization concern was to avoid huge intermediate results, much larger than the final outputs, at internal nodes of the expression tree.

The concept of a worst-case optimal (wco) algorithm [8] was coined to define a multijoin algorithm that does not produce those huge intermediate results. Formally, a wco algorithm takes time \(\tilde{O}(Q^*)\), where \(Q^*\) is the largest output size on some database instance with the same table sizes of the given one (\(\tilde{O}(Q^*)\) allows multiplying \(Q^*\) by terms that do not depend, or depend only logarithmically, on the database size). They [8] proved that no pair-wise based multijoin algorithm can be wco. Several wco join algorithms were proposed since then [21, 27, 35,36,37, 39, 45].

Leapfrog Triejoin (LTJ) [45] is probably the simplest and most popular wco algorithm. At a high level, it can be regarded as reducing the multijoin by one attribute at a time, instead of by one relation at a time as in the pair-wise-join based query plans. LTJ chooses a suitable order in which the joined attributes will be eliminated (which means finding all their possible values in the output and branching on the subset of the output matching each such value). To proceed efficiently, LTJ needs the rows of each relation stored in a trie (or digital tree) where the root-to-leaf attribute order is consistent with the chosen attribute elimination order. Even though LTJ is wco with any elimination order, it turns out that, just like with the traditional query plans, there can be large performance differences when choosing different orders [17, 45]. This means, first, that choosing a good order is essential and, second, that LTJ needs tries storing each relation in every possible order of its attributes, that is, d! tries for a relation with d attributes that can participate in joins.

This high space requirement shows up, in one form or another, in all the existing wco algorithms, and has become an obstacle to their full adoption in database systems. Wco algorithms are of particular interest in graph databases, which can be regarded as labeled graphs, or as a single relational table with three attributes: source node, label, and target node. Standard query languages for graph databases like SPARQL [16] feature most prominently basic graph patterns (BGPs), which essentially are a combination of multijoins and simple selections. The concept of wco algorithms, as well as LTJ, can be translated into solving BGPs on graph databases [17]. This is very relevant because typical BGPs correspond to large and complex multijoins by non-key attributes [1, 17, 20, 39], where non-wco algorithms can be orders of magnitude slower than wco ones [1]. Still, LTJ needs \(3! = 6\) copies of the database in the form of tries, which even for three attributes is sufficiently space-demanding to discourage its full adoption (for example, one may restrict the query language to force labels to be constant in queries, so that only two tries are needed).

The implementation of various wco indices for graph databases seems to confirm that large space usage will be the price for featuring wco query times. For example, a wco version of Jena [17] doubles the space of the original non-wco version. Efficient wco implementations like Jena LTJ [17] and MillenniumDB [47] use around 14 times the space required to store the graph triples in raw form. The most popular systems for graph databases, like Jena [17], Virtuoso [13], RDF-3X [34], or Blazegraph [44], for example, give up on worst-case optimality in order to use “only” 5 to 7.5 times the size of a plain triple storage.

1.1 Our contribution

In this paper we show that, by using compact data structures, it is possible to achieve at the same time worst-case optimality—with an index that is as fast as the fastest classical ones and sometimes even faster—, while using much less space than the (orders of magnitude slower) classic indices—as little as 2.3 times the space of the raw triple data. More in detail:

  1. 1.

    We show how to implement the 6-trie wco LTJ algorithm in little space by adapting compact data structures for ordinal trees [18], in a way that requires only one bit, instead of one pointer, per trie edge. We further reduce space by storing only partial tries, using trie switching [5] to retain full functionality. The resulting structure, which we call CompactLTJ, uses 16%–18% of the space of classic LTJ implementations that store the 6 tries (MillenniumDB, Jena LTJ), and 30%–46% of the space used by other non-wco systems (Virtuoso, RDF-3X, Blazegraph). Our index matches the query time performance of the fastest wco system (MillenniumDB), while outperforming the others—particularly the non-wco systems—by a factor of 30–40.

  2. 2.

    We explore the use of adaptive variable elimination orders in LTJ, which recompute the best order as the join proceeds and better estimations are available. We further use an estimator for the next variable to bind that turns out to be more accurate. The combination obtains much more stable times than the traditional global-order strategy. For example, it makes CompactLTJ up to twice as fast as MillenniumDB to obtain the first 1000 results.

  3. 3.

    We incorporate dynamism into CompactLTJ, allowing the insertion and deletion of triples in the graph. By resorting to a recent technique to represent dynamic arrays and bitvectors so that the performance adapts to the frequency of the updates [30], we make CompactLTJ dynamic while retaining the performance of the static version, even under very high update frequencies (e.g., just 30% slower when receiving 1,000 updates between each pair of consecutive queries, and much less under more typical regimes).

  4. 4.

    We leave public versions of CompactLTJ that can be directly used by practitioners. These include our best static and dynamic variants and address practical issues like handling large outputs and using actual strings for IRIs and literals instead of numeric identifiers. The mapping from strings to internal identifiers and back is done through new compact data structures for string dictionaries, which add a modest amount of extra space and query time to the previous figures, both for static and dynamic variants of CompactLTJ.

Recent research [5] has shown that it is possible to go further in space reduction, so as to simulate the LTJ data structures within 0.6 to 1.0 times the size of the raw triple data. This significant reduction has a cost in terms of time performance, however: we show in the experiments that CompactLTJ is 30–60 times faster than these compressed data structures. We also show that other recent indices that offer beyond-wco query time guarantees, like Graphflow [26], ADOPT [48], and EmptyHeaded [1], do outperform CompactLTJ on particularly difficult queries, but again use 3–8 times more space. The techniques we develop in this paper could be used to develop more compact versions of those more powerful indices as well.

An early partial version of this paper appeared in Proc. GRADES’24 [4]. This extended version includes, most prominently, the trie-switching technique to further reduce space, the implementation of dynamism, and the handling of string identifiers, all with their corresponding experiments. We also describe a public software for practitioners.

2 Preliminary Concepts

2.1 Graph joins

2.1.1 Edge-Labeled Graphs

Let \(\mathcal {U}\) be a totally ordered, countably infinite set of constants, which we call the universe. In the RDF model [24], an edge-labeled graph is a finite set of triples \(G \subseteq \mathcal {U}^3\), where each triple \((s, p, o)\in \mathcal {U}^3\) encodes the directed edge \(s \xrightarrow {p} o\) from vertex s (the subject) to vertex o (the object), with edge label p (the predicate). We call \(N=|G|\) the number of triples in G. We also call \(\textrm{dom}(G) = \{s,p,o~|~(s,p,o) \in G\}\) the subset of \(\mathcal {U}\) used as constants in G. For any element \(u\in \mathcal {U}\), let \(u+1\) denote the successor of u in the total order \(\mathcal {U}\). We also denote \(U = \max \textrm{dom}(G)\). For simplicity, we will assume that the constants in \(\mathcal {U}\) have been mapped to integers in the range \([1\mathinner {.\,.}U]\), and will even assume \(\mathcal {U}= [1\mathinner {.\,.}U]\).

Example 1

Fig. 1 shows an example graph and the corresponding mapping of the constants in \(\mathcal {U}\) to integers. Nodes are Physics researchers and the Nobel prize. Labels indicate researchers advised by others (\(\textsf{adv}\)), that were nominated to the Nobel prize (\(\textsf{nom}\)), and that won it (\(\textsf{win}\)). \(\square \)

Fig. 1
figure 1

A labeled graph G with its string to integer mapping.

2.1.2 Basic Graph Patterns (BGPs)

A graph G is often queried to find patterns of interest, that is, subgraphs of G that are homomorphic to a given pattern Q. Unlike the graph G, which is formed only by constants in \(\mathcal {U}\), a pattern Q can contain also variables, formally defined as follows. Let \(\mathcal {V}\) denote an infinite set of variables, such that \(\mathcal {U}\cap \mathcal {V}=\emptyset \). A triple pattern t is a tuple \((s,p,o) \in (\mathcal {U}\cup \mathcal {V})^3\), and a basic graph pattern (BGP) is a finite set \(Q \subseteq (\mathcal {U}\cup \mathcal {V})^3\) of triple patterns. Each triple pattern in Q is an atomic query over the graph, equivalent to equality-based selections on a single ternary relation. Thus, a BGP corresponds to a full conjunctive query (i.e., a join query plus simple selections) over the relational representation of the graph.

Let \(\text {vars}(Q)\) denote the set of variables used in pattern Q. The evaluation of Q over a graph G is then defined to be the set of mappings \(Q(G):= \{ \mu : \text {vars}(Q) \rightarrow \textrm{dom}(G) \mid \mu (Q) \subseteq G\}\), called solutions, where \(\mu (Q)\) denotes the image of Q under \(\mu \), that is, the result of replacing each variable \(x \in \text {vars}(Q)\) in Q by \(\mu (x)\).

Example 2

A triple pattern, \((\textsf{Nobel},\textsf{win},x)\), on the graph G of Fig. 1, aims to bind variable x to all the values that make the triple (or edge) occur in G, namely \(\textsf{Thorne}\), \(\textsf{Bohr}\), \(\textsf{Thomson}\), and \(\textsf{Strutt}\). For example, in the first case, the triple \((\textsf{Nobel},\textsf{win},\textsf{Thorne})\) is in G. Formally, and already mapping strings to integers, the query with this single triple is \(Q_1=\{(6,9,x)\}\), and its evaluation on G is \(Q_1(G) = \{ \langle \mu (x) = 4\rangle , \langle \mu (x) = 1\rangle , \langle \mu (x) = 3\rangle , \langle \mu (x) = 2\rangle \}\).

Consider now the query formed by the triple patterns \((\textsf{Nobel},\textsf{win},x)\), \((\textsf{Nobel},\textsf{win},y)\), and \((x,\textsf{adv},y)\), which looks for pairs of Nobel winners where one was advised by the other. The answers are the pairs \((\textsf{Bohr},\textsf{Thomson})\) and \((\textsf{Thomson},\textsf{Strutt})\). On a relational table \(T_G\) with attributes (spo), this query corresponds to the relational algebra formula

$$\begin{aligned} & \rho (x/o) (\sigma _{s=\textsf{Nobel},p=\textsf{win}} (T_G)) \bowtie \\ & \rho (y/o) (\sigma _{s=\textsf{Nobel},p=\textsf{win}} (T_G)) \bowtie \\ & \rho (x/s,y/o) (\sigma _{p=\textsf{adv}} (T_G)). \end{aligned}$$

In our formalism, and translating to integers again, our query is \(Q_2 = \{ (6,9,x), (6,9,y), (x,7,y)\}\). Its evaluation is \(Q_2(G) = \{ \langle \mu (x) = 1, \mu (y) = 3\rangle , \langle \mu (x) = 3, \mu (y) = 2\rangle \}\). \(\square \)

2.2 Worst-case optimal joins

2.2.1 The AGM bound

A well-established bound to analyze join algorithms is the AGM bound, introduced by Atserias et al. [8], which sets a limit on the maximum output size for a natural join query. Let Q denote such a query and D a relational database instance. The AGM bound of Q over D, denoted \(Q^*\), is the maximum number of tuples generated by evaluating Q over any database instance \(D'\) containing a table \(R'\) for each table R of D, with the same attributes and \(|R'| \le |R|\) tuples. Though BGPs extend natural joins with self joins, constants in \(\mathcal {U}\), and the multiple use of a variable in a triple pattern, the AGM bound can still be applied to them by regarding each triple pattern as a relation formed by the triples that match its constants [17].

Given a join query (or BGP) Q and a database instance D, a join algorithm enumerates Q(D), the solutions for Q over D. A join algorithm is worst-case optimal (wco) if it has a running time in \(\tilde{O}(Q^*)\), which is \(O(Q^*)\) multiplied by terms that do not depend, or depend only polylogarithmically, on |D|. Atserias et al. [8] proved that there are queries Q for which no plan using only pair-wise joins can be wco.

Example 3

The query \(Q_2\) of Ex. 2 is a so-called triangle query. When it has no constants, the maximum output size of a triangle query on a graph of N edges is \(O(N^{3/2})\). All pairwise-join strategies used in relational databases take time \(\Theta (N^2)\) to solve the triangle query on certain graphs, and thus are not wco. A wco algorithm must solve the triangle query in time \(\tilde{O}(N^{3/2})\). \(\square \)

We describe next the the most frequently implemented wco algorithm.

Fig. 2
figure 2

The six tries corresponding to the graph of Fig. 1. The \(\tau \) marks on some nodes correspond to Ex. 5.

2.2.2 Leapfrog TrieJoin (LTJ)

We describe a popular wco algorithm, Leapfrog Triejoin [45], originally designed for natural joins in relational databases, as it is adapted for BGP matching on labeled graphs [17]. This algorithm relies on the trie representation of the graph edges (i.e., the triples). To work properly, LTJ requires \(3!=6\) different tries to be stored, each representing the triples in a specific order of their components. The need for this requirement shall be made clear later. We call spo, sop, pos, pso, osp, and ops these orders. For each triple \((s,p,o) \in G\), there is a corresponding root-to-leaf path labeled s, p, and o, in the spo trie. Similarly, there is a path labeled s, o, and p in the sop trie, and so on for the remaining orders. Consequently, each trie has height 3 and N leaves.

Example 4

Fig. 2 shows the 6 tries corresponding to the graph of Fig. 1; the root of each trie indicates the order. Disregard for now the marks \(\tau \) on some nodes. \(\square \)

Let \(Q = \{t_1, \ldots , t_q\}\) be a BGP and \(\text {vars}(Q) = \{x_1, \ldots , x_v\}\) its set of variables. LTJ uses a variable elimination approach. The algorithm carries out \(v = |\text {vars}(Q)|\) iterations, handling one particular variable of \(\text {vars}(Q)\) at a time. This involves defining a total order on \(\text {vars}(Q)\), which we call a variable elimination order (VEO).

Each triple pattern \(t_i\), for \(i=1,\ldots , q\), is associated with a suitable trie \(\tau _i\). The root-to-leaf path in \(\tau _i\) must start with the constants that appear in \(t_i\), and the rest of its levels must visit the variables of \(t_i\) in an order that is consistent with the VEO chosen for Q. This is why we need tries in the 6 orders.

The algorithm starts at the root of every \(\tau _i\) and descends by the children that correspond to the constants in \(t_i\). We then proceed to the variable elimination phase. Assume the order of the variables is \(\langle x_1,\ldots ,x_v\rangle \) and let \(Q_j \subseteq Q\) be the triple patterns that contain variable \(x_j\). Starting with the first variable, \(x_1\), LTJ finds each \(c\in \textrm{dom}(G)\) such that for every \(t \in Q_1\), c is a child of the current node of trie \(\tau \) of t

(if the trie \(\tau \) of t is consistent with the VEO, then the children of its current node contain precisely the possible values c for \(x_1\)).

During the execution, we keep a mapping \(\mu \) with the solutions of Q. As we find each constant c suitable for \(x_1\), we bind \(x_1\) to c, that is, we set \(\mu = \langle x_1:= c\rangle \) and branch on this value c. In this branch, we go down by c in all the tries \(\tau \) of triples \(t \in Q_1\). We now repeat the same process with \(Q_2\), finding suitable constants d for \(x_2\) and increasing the mapping to \(\mu = \langle x_1:= c, x_2:= d\rangle \), and so on. Once we have bound all variables in this way, \(\mu \) is a solution for Q (this happens many times because we branch on every binding to c, d, etc.). When it has considered all the bindings c for some variable \(x_j\), LTJ backtracks and continues with the next binding for \(Q_{j-1}\). When this process finishes, the algorithm has reported all the solutions for Q.

Example 5

Let us follow LTJ over the solution of query \(Q_2 = \{ t_1 = (6,9,x), t_2 = (6,9,y), t_3 = (x,7,y)\}\) of Ex. 2, using the VEO \(\langle x,y\rangle \). From the tries of Fig. 2, we will use spo as the tries \(\tau _1 = \tau _2\) for the triple patterns \(t_1\) and \(t_2\), and pso as the trie \(\tau _3\) for the triple pattern \(t_3\) (if we had chosen the VEO \(\langle y,x\rangle \), \(\tau _3\) should have been pos). We then descend by the constants 6 and 9 in both \(\tau _1\) and \(\tau _2\), and by 7 in the trie \(\tau _3\). We reach the nodes marked \(\tau _1\), \(\tau _2\), and \(\tau _3\) in the figure.

We now bind variable x, whose candidates descend from the current nodes of \(\tau _1\) and \(\tau _3\). Their children in common are 1, 3, and 4. LTJ branches with each of those candidates. The branch with \(\langle \mu (x) = 1\rangle \) descends by the corresponding nodes in \(\tau _1\) and \(\tau _3\); we mark the reached nodes \(\tau _1^x\) and \(\tau _3^x\). We now bind variable y, whose candidates descend from the current nodes in \(\tau _2\) and \(\tau _3\). Their only common children is 3, so we complete the binding \(\langle \mu (x)=1, \mu (y)=3 \rangle \); we mark the reached nodes \(\tau _2^y\) and \(\tau _3^y\) in the figure. Since we have bound all the variables, we deliver this binding as the first solution of \(Q_2(G)\). The other solution is found when branching with \(\langle \mu (x) = 3\rangle \) and then growing it to \(\langle \mu (x)=3, \mu (y)=2 \rangle \). Instead, the branch with binding \(\langle \mu (x) = 4\rangle \) does not produce any solution. \(\square \)

Operationally, the values c, d, etc. are found by intersecting the children of the current nodes in all the tries \(\tau _i\) for \(t_i \in Q_j\). LTJ carries out the intersection using the primitive \(\textsf{leap}\mathsf {(}v_i,c\mathsf {)}\), which finds the smallest constant \(c_i\ge c\) within the children of the current node \(v_i\) in trie \(\tau _{i}\); if there is no such value \(c_i\), \(\textsf{leap}\mathsf {(}v_i,c\mathsf {)}\) returns a special value \(\perp \).

The intersection process is as follows. Assume that we are handling variable \(x_j\) from the VEO. For each \(t_i \in Q_j\), we are at a specific node \(v_i\) in trie \(\tau _i\). The goal is to find the values that appear in the list of children of all such nodes \(v_i\). We keep the next candidate to be in the intersection, \(c_{\min }\), which is initialized to c. For each trie \(\tau _i\), we update \(c \leftarrow \textsf{leap}\mathsf {(}v_i, c\mathsf {)}\). After traversing all nodes \(v_i\), if \(c_{\min }=c\), we return \(c_{\min }\) as the next value in the intersection. Otherwise, we reset \(c_{\min }\) to c and restart the scan of trie nodes.

Every time a value c is returned, we know that c appears as a child of every \(v_i\). We then launch, as explained, a recursive branch with the new binding \(\langle x_{i_j}:= c\rangle \). Upon returning from that recursion branch, we remove the binding \(\langle x_j:= c\rangle \) from \(\mu \) and continue looking for other elements in the intersection. The intersection terminates when some \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) returns \(\perp \), in which case the recursive call returns to \(Q_{j-1}\) (or LTJ finishes if \(j=1\)).

Operation \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) uses exponential search [9, 45]: For each current node \(v_i\) we record where the previous \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) ended within its child list, and each new exponential search starts from that position. As a result, each \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) runs in \(O(\lg {\ell })\) time, where \(\ell \) is the distance between the two latest ending positions and \(\lg \) is the logarithm in base 2. While any polylogarithmic time guarantees that \(\text {LTJ} \) runs in wco time [45], exponential search is particularly effective in practice as \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) takes less time on nodes with more children.

Algorithm 1 shows the pseudocode for LTJ. It builds on just three primitives on the tries:

  1. 1.

    \(\textsf{child}(v,i)\), which descends to the ith child of node v,

  2. 2.

    \(\textsf{degree}(v)\), which computes the degree of a node, and

  3. 3.

    \(\textsf{access}(v,i)\), which reads the value of the ith child of v.

Operations \(\textsf{child}\) and \(\textsf{access}\) are required at line 2 of \(\textsf{LTJ}\) and line 6 of \(\textsf{leapfrog}\), while \(\textsf{degree}\) and \(\textsf{access}\) are required by \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\). Note that returning to the parent node, in line 8 of \(\textsf{leapfrog}\), can be done by just remembering it before descending by c in line 6. Note also that, in this line 6, we do not really need to find the children that lead to c because we have already found them in \(\textsf{seek}\). Storing those nodes is also useful to speed up the exponential searches of \(\textsf{leap}\) in line 4 of \(\textsf{seek}\), so as to start from where the previous search ended.

Algorithm 1
figure a

Evaluating the BGP \(Q=\{t_1,\ldots ,t_q\}\) with trie nodes \(\tau _1,\ldots ,\tau _q\) and variable ordering \(\langle x_1,\ldots ,x_v\rangle \). Symbol ‘:’ in line 7 denotes concatenation.

2.3 Variable Elimination Orders (VEOs)

Veldhuizen [45] showed that if \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) runs in polylogarithmic time, then LTJ is wco no matter the VEO chosen, as long as the tries used have the right attribute order. In practice, however, the VEO plays a fundamental role in the efficiency of the algorithm [17, 45]. A VEO yielding a large number of intermediate solutions that are later discarded during LTJ execution, will be worse than one that avoids exploring many such alternatives. One would prefer, in general, to first eliminate selective variables (i.e., the ones that yield a smaller candidate set when intersecting).

A heuristic to generate a good VEO in practice [5, 17, 47] computes, for each variable \(x_j\), its minimum weight

$$\begin{aligned} w_j = \min \{ w_{ij} ~|~ x_j \text { appears in triple } t_i \}, \end{aligned}$$
(1)

where \(w_{ij}\) is the weight of \(x_j\) in \(t_i\). The VEO sorts the variables in increasing order of \(w_j\), with a couple of restrictions: (i) each new variable should share some triple pattern with a previous variable, if possible; (ii) variables appearing only once in Q (called lonely) must be processed at the end.

To compute \(w_{ij}\), we (temporarily) choose a trie \(\tau _j\) where \(x_j\) appears right after the constants of \(t_i\), and descend in \(\tau _j\) by the constants. The number of children of the trie node v we have reached is the desired weight \(w_{ij}\). This is the size of the list in \(\tau _i\) to intersect when eliminating \(x_j\).

In this paper we explore the use of adaptive VEOs, which are defined progressively as the query processing advances, and may differ for each different binding of the preceding variables. ADOPT [48] is the first system combining LTJ with adaptive VEOs. The next variables to bind are chosen using reinforcement learning, by partially exploring possibly upcoming orders, and balancing the cost of exploring with that of the obtained improvements. We will compute adaptive VEOs, instead, simply as a variant of the formula presented above for global VEOs [17].

Other systems go even further in this beyond-wco path. Building on the well-known Yannakakis’ instance-optimal algorithm for acyclic queries [51], EmptyHeaded [1] applies a so-called Generalized Hypertree Decomposition [15], which decomposes cyclic queries into a tree where the nodes are cyclic components, so as to solve the nodes using a wco algorithm [36] and then apply Yannakakis’ algorithm on the resulting acyclic query on the intermediate results. Graphflow [26], Umbra [33], and Free Join [49] are examples of systems that integrate wco joins with pairwise joins in order to generate hybrid plans for evaluating graph queries. Other approaches like Tetris [21], Minesweeper [38] and Panda [2] also offer guarantees finer than just wco.

2.4 Trie switching

Fig. 3
figure 3

A partial trie configuration for the six tries of Fig. 2. The set of red edges and nodes (which are also doubly-circled) corresponds to a second set of children that descend by another attribute from the same first-level nodes.

Trie switching [5, Sec. 7.2.1] is a mechanism to decrease the space required by the six tries of LTJ, at the expense of possibly increasing query times by a small margin. The idea is that some parts of some tries are redundant with others and can be deleted. Consider the tries spo and pso. Once we have descended by instantiated values of p and s in pso, reaching node v, and need to work (e.g., intersect) the children o of node v, we could instead switch to the equivalent node \(v'\) in the trie spo, by descending by s and p from its root. The children of v and \(v'\) are the same, and therefore we can omit the children o of v in the trie pso.

By using trie switching we can then store some partial tries, from which we switch to others in case of need. For example, we can store only the tries spo, ps, pos, op, osp, and so, thereby saving 3 of the 6 last levels of the tries, which are the biggest (each last level has exactly N nodes). Further, a clever implementation can share the first level of the tries pso and po, sop and sp, and ops and os, further saving 3 of the 6 first levels of the tries (of total size \(|\textrm{dom}(G)|\)).

Example 6

Fig. 3 shows the partial tries corresponding to our running example. \(\square \)

Fig. 4
figure 4

The LOUDS representation of the trie spo of Fig. 2. Besides each node, with d children, we show its encoding \(0^d1\). On the bottom left, the levelwise concatenation of the encodings. The LOUDS representation concatenates all the levels. The bottom right shows our shorter representation, which uses \(0^{d-1}1\) instead of \(0^d1\) and removes the leaves.

Overall, we expect a space saving between 1/4 and 1/2 with the use of partial tries (we will obtain around 1/3 in practice in this paper). In exchange, some slowdown is to be expected due to the need to switch between tries, reentering another trie from the root instead of directly navigating to the children of the current node.

3 CompactLTJ: Leapfrog Triejoin on Compact Tries

We now introduce our compact representation of the LTJ tries, and combine them with techniques that improve the performance of the original proposal. Our index, CompactLTJ, represents separately the trie topology and the edge labels.

3.1 Trie topology

The Level-Order Unary Degree Sequence (LOUDS) [18] is a representation of n-node tree topologies using just \(2n+o(n)\) bits. It is obtained by traversing the tree levelwise (with each level traversed left to right). We append the encoding \(\textsf{0}^d\textsf{1}\) of each traversed node to a bit sequence T, where d is the number of children of the node. The final sequence T represents the tree using two bits per node: a \(\textsf{0}\) in the encoding of its parent and a \(\textsf{1}\) to terminate its own encoding. A bitvector representation of T then needs \(2n+o(n)\) bits, and allows navigating the tree in constant time.

Example 7

Fig. 4 shows the LOUDS representation of the trie spo of Fig. 2 (ignore the bottom-right part and node names for now). \(\square \)

Our trie topologies are particular in that all the leaves have the same depth, 3. Therefore, every internal node at depths 0–2 have children, and thus we can reduce their encoding to \(\textsf{0}^{d-1}\textsf{1}\). The leaves need not be encoded, which further saves space. In the original encoding, then, every node with d children spends \(d+1\) bits (leaves, with \(d=0\), included). In our new code, it spends d bits per node (leaves included, which means they disappear). Thus, we save n bits and therefore halve the original space [18].

Lemma 1

Our representation uses \(n-1\) bits on a trie of n nodes.

Proof

An internal node with d children is encoded as \(0^{d-1}1\) and leaves are not encoded; therefore we store exactly as many bits as edges in the trie, that is, \(n-1\) bits. \(\square \)

Our encoding also simplifies the traversal compared to the original LOUDS [18]. We will use the position preceding the encoding of a node as its trie identifier v (e.g., \(v=0\) for the root). The navigation to children makes use of the primitive \(\textsf{select}(T,j)\), which is the position of the jth occurrence of bit \(\textsf{1}\) in T. This primitive can be supported in O(1) time using just o(n) additional bits of space on top of T [11, 28] (see Appendix A for a description of this algorithm). With this primitive, we can navigate our representation as follows.

Lemma 2

For every \(v \ge 0\) and \(i \ge 1\), it holds

$$\begin{aligned} \textsf{child}(v,i) = \textsf{select}(T,v+i). \end{aligned}$$

Proof

Bitvector T lists, in levelwise order, all the trie nodes, using one \(\textsf{1}\)-terminated code per node. Since the code consists of exactly one bit per child, T also lists all the trie edges, in levelwise order, using one bit per edge. The targets of those edges also form a levelwise enumeration of the nodes, just missing the root. It follows that the edge leading to the ith child of v is at position \(T[v+i]\), and its target is the \((v+i+1)\)th node in levelwise order (since the first is the root). The identifier of that node is the position preceding its code in T, which is the position of the \((v+i)\)th \(\textsf{1}\) in T. \(\square \)

Example 8

The bottom-right of Fig. 4 shows our more compact representation. For the trie spo, we have:

$$\begin{aligned} T= & \mathsf {00001 ~ 111101 ~ 1111000010001}. \end{aligned}$$

For example, the identifier of the root is \(v=0\) and that of its fifth child is \(u = \textsf{child}(0,5)=\textsf{select}(T,0+5)=9\). The encoding of node \(u=9\) is at \(T[u+1\mathinner {.\,.}u+\textsf{degree}(u)] = T[10 \mathinner {.\,.}11] = \textsf{01}\). See v and u in Fig. 4. \(\square \)

In order to implement \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\), we also need to determine the number of children of a node v. This is the distance from v to the next \(\textsf{1}\) in T:

$$\begin{aligned} \textsf{degree}(v) = \textsf{selectnext}(T, v+1)-v, \end{aligned}$$

where the primitive \(\textsf{selectnext}(T,k)\) gives the position of the leftmost occurrence of \(\textsf{1}\) in \(T[k\mathinner {.\,.}]\). This primitive can also be computed in O(1) time using o(n) additional bits of space; see again Appendix A.

Example 9

Continuing with Ex. 8, the first child of \(u=9\) is \(w=\textsf{child}(9,1)=\textsf{select}(T,9+1)=15\). Node \(w=15\) has \(\textsf{degree}(15)=\textsf{selectnext}(15+1)-15=20-15=5\) children. See w in Fig. 4. \(\square \)

3.2 Node identifiers

The node identifiers are stored in a compact array L, each one using \(\lceil \lg {U} \rceil \) bits. The identifiers in L are deployed in the same levelwise order of the edges T, so the identifiers of the children of node v are all consecutive, in \(L[v+1 \mathinner {.\,.}v+\textsf{degree}(v)]\). This yields

$$\begin{aligned} \textsf{access}(v,i)=L[v+i] \end{aligned}$$

and allows implementing \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) efficiently by using exponential search from the current position. In our representation, we define \(\textsf{leap}\mathsf {(}[i,j],c\mathsf {)}\) as the smallest \(k \in [i\mathinner {.\,.}j]\) such that \(L[k] \ge c\), or \(j+1\) if no such k exists. With this convenient notation, the children of node v are searched as \(\textsf{leap}\mathsf {(}[v+1,v+\textsf{degree}(v)],c\mathsf {)}\).

Example 10

For our trie spo in Fig. 4, the index would store

$$\begin{aligned} T= & \mathsf {00001 ~ 111101 ~ 1111000010001}\\ L= & \mathsf {13456 ~ 777789 ~ 3251123451234} \end{aligned}$$

where, for example, the fifth (\(i=5\)) child of the root (\(v=0\)) descends by \(L[0+5]=\textsf{6}\) (to \(u=9\), as shown before). The first child of u, by \(L[9+1] = \textsf{8}\), leads to \(w=15\). The children of w have labels \(L[16\mathinner {.\,.}20] = \textsf{12345}\). It then holds, for example, that \(\textsf{leap}\mathsf {(}[16,20],4\mathsf {)}=19\). \(\square \)

3.3 xCompactLTJ: Using partial tries

We implement trie switching as follows. Consider the tries spo and sop in Fig. 2. Per trie switching, we choose to represent only spo and so (see Fig. 3); from the second we can switch to trie osp if we need to access predicates from so.

Example 11

Fig. 4 shows how the three levels of trie spo are represented. The first level, corresponding to s, is represented with bits \(\textsf{00001}\) and identifiers \(\textsf{13456}\), exactly as the first level of so (and sop, which is not represented). To represent so, then, we reuse the first level of spo, and store only the second level, o, of the trie:

where the gray nodes are not represented (see Fig. 2 again, or the red nodes at the top of Fig. 3). \(\square \)

As explained, in case we want to descend from a leaf of so to the predicates, we reenter the trie osp with the current values of o and s, respectively. This is the case where the use of partial tries may entail some time overhead.

In total, we represent 12 trie levels instead of 18. We call xCompactLTJ the CompactLTJ version using partial tries.

3.4 UnCompactLTJ: A non-compact variant

As a baseline to determine the slowdown incurred with compact representations like LOUDS, compared to classical ones, we introduce a version called unCompactLTJ, which is a minimal non-compact trie representation.

The unCompactLTJ index stores an array P of pointers, one per internal node, deployed in the same order of LOUDS. For each internal node v we store \(P[v+1] = \textsf{child}(v, 1)\), that is, a pointer to its first child, knowing that the others are consecutive, that is, \(\textsf{child}(v,i) = P[v+i]\). Each pointer uses \(\lceil \lg n \rceil \) bits, as it is a position within an array of n elements. The number of children of node v is simply \({\textsf{degree}(v)=P[v+1]-P[v]}\), assuming \(P[0]=0\) if needed. Its array L of edge labels is identical to that of \({ CompactLTJ} \), so \(\textsf{access}(v,i) = L[v+i]\) still holds.

Example 12

For our same Ex. 8 we have

$$\begin{aligned} P= & \langle 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 20, 24 \rangle \end{aligned}$$

(where \(24=|T|\) is a terminator). The root 0 has \(P[1]-P[0]=5\) children, at positions \(P[1\mathinner {.\,.}5]\). Its fifth child descends by label \(L[0+5]=\textsf{6}\) to node \(P[0+5]=9\), which has two children, since \(L[6]-L[5]=2\). Its first child descends by label \(L[P[5]+1] = L[9+1] = \textsf{8}\) to node \(P[9+1]=15\). Node 15 has \(P[11]-P[10]=5\) children, at \(L[16\mathinner {.\,.}20]=\textsf{12345}\). \(\square \)

In exchange for nearly doubling the space of CompactLTJ, unCompactLTJ has explicit pointers just like classical data structures, and does not spend time in computing addresses. As we show in the experiments, unCompactLTJ is only marginally faster than CompactLTJ, so the slowdown due to using compact data structures is mild. Still, unCompactLTJ uses half the space of Jena LTJ [17], a classic index that supports LTJ using the six tries (implemented as B+-trees).

3.5 Adaptive VEOs

Our orthogonal contribution is the study of improved VEOs on our compact LTJ tries, which deviate from the VEO defined in Section 2.3. The first improvement is the use of adaptive VEOs; the second is on the \(w_{ij}\) estimator used.

In previous work using the VEO described in Section 2.3, the VEO is fixed before running LTJ. The selectivity of each variable \(x_j\) is estimated beforehand, by assuming it will be the first variable to eliminate. In this case, Eq. (1) takes the minimum of the number of children in all the trie nodes we must intersect, as an estimation of the size of the resulting intersection. The estimation is much looser on the variables that will be eliminated later, because the children to intersect can differ a lot for each value of \(x_j\).

We then consider an adaptive version of the heuristic: we use the described technique to determine only the first variable to eliminate. Say we choose \(x_j\). Then, for each distinct binding \(x_j:= c\), the corresponding branch of LTJ will run the VEO algorithm again in order to determine the second variable to eliminate, now considering that \(x_j\) has been replaced by c in all the triples \(t_i\) where it appears. This should produce a more accurate estimation of the intersection sizes.

In the adaptive setting, we do not check anymore that the new variable shares a triple with a previously eliminated one; this aimed to capture the fact that those triples would be more selective when some of their positions were bound, but now we know exactly the size of those progressively bound triples. The lonely variables are still processed at the end.

3.6 CompactLTJ*: Better VEO predictors

The CompactLTJ index uses the original estimator based on the number of children of v, which is easily computed in constant time as \(w_{ij} = \textsf{degree}(v)\). We now define an alternative version, CompactLTJ*, which computes \(w_{ij}\) as the number of leaf descendants of v. This is \(w_{i,j}=N\) if v is in the first level (i.e., the root), and \(w_{i,j}=\textsf{degree}(v)\) if v is in the third level (i.e., just above the leaves). For the second level, we compute in constant time

$$\begin{aligned} w_{ij} = \textsf{child}(v+\textsf{degree}(v),1)-\textsf{child}(v,1). \end{aligned}$$

We argue that the number of descendants may be a more accurate estimation of the total work that is ahead if we bind \(x_j\) in \(t_i\), as opposed to the children, which yield the number of distinct values \(x_j\) will take without looking further.

4 Dynamic CompactLTJ

A pitfall of most compact indices that support wco query times on BGPs [3,4,5] is that they are static, that is, they do not support insertions or deletions of triples in the graph. There is a good reason for this: while basic bitvector operations like \(\textsf{select}\) and \(\textsf{selectnext}\) can be supported in constant time in the static case [11, 18, 28], a lower bound of \(\Omega (\lg n/\lg \lg n)\) when bit flips are allowed [14] imposes a significant gap between static and dynamic solutions. A gap of 10x does show up in practice [12, 41] and permeates through all the compact data structures, as bitvectors are basic components in most of them.

A recent development called adaptive dynamic bitvectors [30] offers a new tradeoff on this gap, however, that is especially relevant in our application: if a bitvector receives, on average, one bit insertion or deletion per q query operations (like \(\textsf{select}\) or \(\textsf{selectnext}\)), then all the operations run in \(O(\lg (n/q))\) amortized time. Significant time improvements are shown over a classic dynamic implementation for \(q \ge 1{,}000\) or so, and times very close to those of static implementations for about \(q \ge 10{,}000\).

Now consider our bitvector T. A graph query Q running in time \(\tilde{O}(Q^*)\) implies, essentially, that T will receive \(\tilde{O}(Q^*)\) bitvector queries, which is typically a massive amount: in our benchmark, we carry out about a billion \(\textsf{select}\) operations per query limited to 1,000 results, and nearly 40 billion without limiting the results. This implies that, even if T received one update after every graph query Q, q would be between \(10^9\) and \(10^{11}\), and thus we could expect a performance very close to that of the static scenario. Further, we expect much fewer updates per graph query in practical applications. For example, Wikidata receives about 6,000–15,000 queries and 200–500 updates per minuteFootnote 1, which corresponds roughly to 0.1 to 0.01 updates per graph query.

We now describe how the solution for bitvectors [30] is applied on T and integrated with an analogous solution for arrays, which is applied on L. The solution is adapted to the queries we need in order to obtain a dynamic version of CompactLTJ that performs almost as well as the static one. We will show that our solution is robust even under much more demanding conditions, where it receives up to 1,000 updates per graph query.

4.1 Insertions and deletion of triples

The insertion/deletion of a triple (spo) in/from G requires updating the six tries.

Example 13

Assume we want to add to the graph of Fig 1 the fact that \(\textsf{Thomson}\) also advised \(\textsf{Rutheford}\). This is expressed in a new triple \((\textsf{Rutheford},\textsf{adv},\textsf{Thomson})\) we add to G. Let us assign \(\textsf{Rutheford}\) the integer identifier 10, so the triple we wish to add is \((s,p,o) = (10,7,3)\). This triple, reading its components in the 6 possible orders, must be inserted in the six tries. In particular, we must insert (3, 10, 7) in the trie osp. \(\square \)

For insertion, we traverse the trie from the root. We first descend in the trie as much as possible with the elements of the triple t to insert, using \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) to find the correct child. Once \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\) indicates that the child to follow does not exist, we must start inserting it, until we insert the whole triple t. Algorithm 2 shows the details, which we explain by following the insertion of our example tuple. It makes use of the primitive \(\textsf{insert}(A,c,i)\) on T and L, which inserts c at position i in A; we describe its implementation later.

Example 14

The representation of trie osp from Fig. 2 is as follows

$$\begin{aligned} T= & \mathsf {00001 ~ 010101101 ~ 1011011010111} \\ L= & \mathsf {12345 ~ 563616646 ~ 7897897898978} \end{aligned}$$

We will insert tuple \(t=(3,10,7)\) in this trie. Variable exists in Alg. 2 indicates that the node to follow (initially \(v=0\), the root) exists. We take the first element of t, \(c=3\). Line 5 computes \(v' = \textsf{leap}\mathsf {(}[1,5],c\mathsf {)}=3\), where \(\textsf{access}(0,3)=c\), meaning that the child with identifier c exists. Line 7 descends to that child, \(v = \textsf{child}(0,3) = 9\). The second tuple component, \(c=10\), is sought again in line 5 with \(v'=\textsf{leap}\mathsf {(}[10,11],c\mathsf {)} = \bot \), meaning that c is larger than every child of v. Line 10 establishes its insertion point, \(v' = 12\). It then adds the new child c to the existing node v: it inserts a \(\textsf{0}\) at position \(v'-1=11\) of T and \(c=10\) at position \(v'=12\) of L, obtaining

$$\begin{aligned} T= & \mathsf {00001 ~ 01010\underline{0}1101 ~ 1011011010111} \\ L= & \mathsf {12345 ~ 563616\underline{\overset{\displaystyle 1}{0}}646 ~ 7897897898978} \end{aligned}$$

where we have underlined the insertions. Note that in case the insertion point \(v'\) is within the children of v, we instead insert the \(\textsf{0}\) at position \(T[v']\) in line 12. Line 14 descends to the (still nonexistent) child of the newly created node, \(v \leftarrow \textsf{child}(9,3)=24\), and exists is set to false.

Once the desired node does not exist, the third tuple component, \(c=7\), is inserted, in lines 17 and 18, at positions \(v+1=25\) in T. We create a leaf by inserting a \(\textsf{1}\) in T and a c in L. Line 19 descends by that inserted position to keep inserting the elements of t, though in our example we have finished. The final result is

$$\begin{aligned} T= & \mathsf {00001 ~ 01010\underline{0}1101 ~ 101101101\underline{1}0111} \\ L= & \mathsf {12345 ~ 563616\underline{\overset{\displaystyle 1}{0}}646 ~ 789789789\underline{7}8978} \end{aligned}$$

Fig. 5 highlights the path inserted on the osp trie. \(\square \)

Fig. 5
figure 5

The osp trie of Fig. 2 after inserting triple (3, 10, 7).

Algorithm 2
figure b

Inserting tuple t in trie \(\tau \).

We proceed in reverse order to delete a triple t: we enter the trie recursively, looking for each component c using \(\textsf{leap}\mathsf {(}EMPTY\mathsf {)}\), until reaching the leaf v corresponding to t. Then we remove nodes as we return from the recursion. To remove the leaf we use the primitive \(\textsf{delete}(T,v)\) and \(\textsf{delete}(L,v)\), which removes the given element from a bitvector or array and will be described later. There are three cases: (i) \(T[v]=0\), (ii) \(T[v-1,v]=\textsf{01}\), (iii) \(T[v]=\textsf{1}\) and (\(v=1\) or \(T[v-1]=\textsf{1}\)). In case (i), the encoding of the parent node is \(\textsf{0}^d \textsf{1}\) with \(d>0\), so we are not removing its only child. To obtain the new correct encoding, \(\textsf{0}^{d-1} \textsf{1}\), we perform \(\textsf{delete}(T,v)\) and \(\textsf{delete}(L,v)\), and the deletion process terminates. Case (ii) is similar, but there is a \(\textsf{1}\) at T[v], so to obtain \(\textsf{0}^{d-1} \textsf{1}\) we must do \(\textsf{delete}(T,v-1)\) instead of \(\textsf{delete}(T,v)\) (plus \(\textsf{delete}(L,v)\)). In case (iii), the signature of the parent is just \(\textsf{1}\), so we are deleting its only child. We then perform \(\textsf{delete}(T,v)\) and \(\textsf{delete}(L,v)\), and must keep removing the current node v as we return from the recursion.

4.2 Updates with partial tries

Partial tries may not require updates upon insertions and deletions in the graph. For example, if we inserted the triple (pos) in the full trie pos and Alg. 2 did not create a new node in the second level for (po), then we do not need to insert the pair (op) in the partial trie op. In case we do need to insert (op), we reuse the work done when inserting in the full trie osp and found the insertion point \(v'\) in the first level.

Example 15

The trie op in Fig. 3 is represented with just the sequences \(T'\) and \(L'\) corresponding to the red edges, which are complementary to those for osp shown in Ex. 14.

where the elements in gray are not represented. In Ex. 14 we had obtained position \(v'=3\) for the node in first level. To insert in op, we would continue the insertion in \(T'\) and \(L'\) at position \(v = \textsf{child}(0,3)=11\). In this case, the insertion is not necessary because the pair \((p,o)=(7,3)\) already exists in pos (see Fig. 2). Indeed, if we tried to insert the pair here, we would compute \(v' = \textsf{leap}\mathsf {(}[12,14],7\mathsf {)}=12\), and since \(\textsf{access}(v,1)=L[12]=7\), we would not insert it.

If, instead, o (and thus (op) and (po)) did not exist, o would be inserted in the first level of osp. We would then simply insert in the second level of op at position \(\textsf{child}(0,v')+1\), inserting \(\textsf{1}\) in \(T'\) and p in \(L'\) of op. \(\square \)

Deletions are analogous. We determine that we must delete (op) from op if the deletion of (pos) from pos deleted the node representing (po).

4.3 Dynamic representation of T and L

Adaptive dynamic bitvectors [30] represent a bit sequence as a balanced binary tree with three kinds of nodes:

  • dynamic leaf: allocates space for b elements (for a parameter b) and supports updates. Queries on the leaf are solved by sequential scanning.

  • static leaf: stores subsequences with more than b elements and does not support updates. It precomputes some extra information to speed up queries, for example to solve \(\textsf{select}\) in constant time.

  • internal node: stores pointers to both children and various statistics about the subtree rooted at the node. For example, size records the number of elements represented in the subtree and ones records the number of 1s.

To insert or delete a bit at a given position, the algorithm descends from the root of the binary tree guided by the size of the nodes, until reaching a leaf. If the leaf is dynamic, the update is applied. Otherwise, the leaf is static and the update is not possible. By a procedure called split, the leaf is recursively halved until the update reaches a dynamic leaf and then can be applied on it.

Queries proceed analogously. For example, \(\textsf{select}\) proceeds from the root to a leaf guided by the number of ones in the subtrees. When reaching a leaf, it completes the query by scanning if the leaf is dynamic, or using the precomputed structures if it is static; see Alg. 3. When a high amount of queries traverse an internal node, it is flattened into a static leaf, so as to speed up further queries.

Algorithm 3
figure c

Running \(\textsf{select}(j)\), \(\textsf{selectnext}(i)\), and \(\textsf{leap}\mathsf {(}[b,e],c\mathsf {)}\) on a binary tree representing a dynamic bitvector, rooted at v.

We will use adaptive dynamic bitvectors to represent our bitvector T of Section 3. In the implementation, they also provide a version to handle arrays of values, which we use to represent our array of identifiers L. In our compact tries, inserting/deleting a trie node boils down to updating the same positions in T and L. We then modify their implementation so as to store T and L aligned together in the leaves of the same binary tree. This is advantageous because we typically need to descend by the binary trees only once, and also save some space.

The implementation of dynamic bitvectors [30] already provides the query \(\textsf{select}\), which we use to implement \(\textsf{child}\), but we had to incorporate support for \(\textsf{selectnext}\) and \(\textsf{leap}\). See Alg. 3 again.

  • \(\textsf{selectnext}(v,i)\) is solved by running a top-down traversal, where in each internal node we go right if i is on the right child. If i is on the left child, we first try to find the answer on the left child, and if it is not there, we search the right child from the first position. If the recursion ends in a dynamic leaf, \(\textsf{selectnext}\) is solved by a sequential scan; if it ends on a static leaf, it is solved in constant time with its precomputed structures.

  • \(\textsf{leap}\mathsf {(}[b,e],c\mathsf {)}\) is invoked only when \(L[b\mathinner {.\,.}e]\) is increasing. To compute it efficiently, we maintain in each internal node v the field last, which stores the last value of L in the subtree rooted at v. We descend from the root, left or right as long as \([b\mathinner {.\,.}e]\) is completely included in one child. If we arrive at an internal node v whose children split \([b\mathinner {.\,.}e]\), then we continue to the left child if \(c \le v.\mathrm {left.last}\), otherwise we continue to the right child. Further, we immediately return from v without an answer if its subtree is completely contained in \([b\mathinner {.\,.}e]\) and \(v.\textrm{last} < c\). In the leaves (both static and dynamic), the answer is found by running an exponential search.

5 Experimental Results

We compare our compact indexing schemes with various state-of-the-art alternatives, in terms of space usage and time for evaluating various types of BGPs.

Our experiments ran on an Intel(R) Xeon(R) CPU E5-2630 at 2.30GHz, with 6 cores, 15 MB cache, 378 GB RAM.

5.1 Datasets and queries

We run a benchmark over the Wikidata graph [46], which we chose for its scale, diversity, prominence, data model (i.e., labeled edges) and real-world query logs [10, 23]. For now we assume that node and label identifiers are integers in a contiguous range, which can be obtained after a suitable preprocessing of the graph; we consider later how to deal with the actual strings. Our Wikidata graph features \(N=958{,}844{,}164\) triples, which take 10.7 GB if stored in plain form using 32 bits for the identifiers.

We consider a real-world query log [23]. In search of challenging examples, we downloaded queries that gave timeouts, and selected queries with a single BGP, obtaining 1, 295 unique queries. Those are classified into three categories: (I) 520 BGPs formed by a single triple pattern, which mostly measure the retrieval performance of the index; (II) 580 BGPs with more than one triple but only one variable appearing in more than one triple, which measure the performance of joins but do not distinguish good from bad VEOs (as long as the join variable is eliminated first, of course); (III) 195 complex BGPs, where the performance of different VEOs can be compared.

Example 16

Query \(Q_1\) from Ex. 2 is of type I, whereas query \(Q_2\) is of type III. An example of a query of type II is \(Q = \{ (6,9,x), (x,7,y) \}\), which finds Nobel prize winners and their advisors. \(\square \)

All queries were run with a timeout of 10 minutes and a limit of 1000 results (as originally proposed for WGPB [17]). This measures the time the systems need to display a reasonable number of results. We also compare the systems without the limit of results, which measures throughput in cases where we need all the results. The space of the indices is measured in bytes per triple (bpt); a plain 32-bit storage requires 12 bpt.

We leave the experiments on dynamic representations to the end of the section. Our dynamic CLTJ variants perform identically to the static ones when there are no updates, thanks to the use of dynamic adaptive bitvectors and sequences.

5.2 Compact LTJ variants

Table 1 compares the indices CompactLTJ, xCompactLTJ, and unCompactLTJ described in Section 3, calling them respectively \(\texttt {CLTJ}\), \(\texttt {xCLTJ}\), and \(\texttt {UnCLTJ}\). The versions \(\texttt {CLTJ*}\), \(\texttt {xCLTJ*}\), and \(\texttt {UnCLTJ*}\), in turn, use the VEO predictor described in Section 3.6. All of them compute the VEO in traditional form (“global VEO”) and in adaptive form (Section 3.5). No variant gave any timeout in this experiment.

Table 1 Space and query times of the compact LTJ variants, limiting results to 1000, with global and adaptive VEOs.

The space of the \(\texttt {CLTJ}\) index is just 3.3 times the size of the raw data encoded as a set of n 32-bit triples, whereas \(\texttt {UnCLTJ}\) uses 4.7 times the size (i.e., 42% more than \(\texttt {CLTJ}\)). The reward for using that 42% extra space is not significant, which shows that the space reduction obtained with \(\texttt {CLTJ}\) comes at essentially no loss in time performance. On the other hand, \(\texttt {xCLTJ}\) uses just 2.3 times the size of the raw data, which is a 30% space reduction over the space of \(\texttt {CLTJ}\). This space reduction comes at a price in time, however: the adaptive variant of \(\texttt {xCLTJ*}\) is 50% slower than that of \(\texttt {CLTJ*}\) on the average, and over twice as slow in the median. The extra time owes to the need of switching tries.

While the medians of all the different variants are in the range of 0.5–1.5 milliseconds per query, the averages show that some query strategies yield much more stable times, and thus a lower average. The large difference between average and median query times shows that, although many queries are solved fast, there are others that take much longer, and it is important to better deal with them. In particular, combining adaptive VEOs with the modified VEO predictor (Section 3.6) reduces the average query times by almost an order of magnitude, to around 40–80 milliseconds. Using adaptive VEOs alone produces a very modest improvement, and using the modified VEO predictor with global VEOs only halves the time. The sharp improvement is obtained with the combination of both techniques.

In the sequel we will use only the variants \(\texttt {xCLTJ*}\), \(\texttt {CLTJ*}\), and \(\texttt {UnCLTJ*}\) with adaptive VEOs.

5.3 Comparison with other systems

We now put our results in context by comparing our compact LTJ indices with various graph database systems:

  • Wco systems: Systems that guarantee the AGM bound.

    • Ring [5], a recent compressed in-memory representation that simulates all the 6 tries in a single data structure. Ring-large and Ring-small correspond to the versions called Ring and C-Ring, respectively, in their paper.

    • MillDB [47]: A recently developed open-source graph database. We use here a specialized version that stores six tries in the form of B+-trees and supports full LTJ, with a sophisticated (yet global) VEO. We run MillDB over a RAM disk to avoid using external memory.

    • Jena LTJ [17]: An implementation of LTJ on top of Apache Jena TDB. All six different orders on triples are indexed in B+-trees, so the search is always wco.

  • Non-wco systems: Older systems not reaching the AGM bound, yet well established and optimized.

    • RDF-3X [34]: Indexes a single table of triples in a compressed clustered B+-tree. The triples are sorted and those in each tree leaf are differentially encoded. It handles triple patterns by scanning ranges of triples and uses a query optimizer over pair-wise joins.

    • Virtuoso [13]: The graph database hosting the public DBpedia endpoint, among others. It provides a column-wise index of quads with an additional graph (g) attribute, with two full orders (psog, posg) and three partial indices (so, op, gs) optimized for patterns with constant predicates. It supports nested loop joins and hash joins.

    • Blazegraph [44]: The graph database system hosting the official Wikidata Query Service [23]. We run the system in triples mode, with B+-trees indexing orders spo, pos, and osp. It supports nested-loop joins and hash joins.

  • Beyond-wco systems: Recent systems combining wco and non-wco strategies. We could run only one at this scale.

    • UmbraDB [33]: A system based on relational tables whose query plans use binary joins but may introduce wco plans for some sub-queries. Those plans are executed with LTJ using hash-based tries that are built on the fly as needed. Queries are compiled into executable multithreaded code. Because this system builds most indexes at query time, we measured the memory usage of the process.

We exclude Graphflow [26], Kùzu [19], DuckDB [42], ADOPT [48], and EmptyHeaded [1] because our server does not have enough memory to build or run them (they did not even run on another server with 768 GB of RAM). Most of those are beyond-wco systems. Section 5.5 compares them on harder queries and a smaller graph, where their stronger join strategies can be put in action.

In all systems, the code was compiled with \(\mathsf {{g++}}\) with flags \(\mathsf {{-std=c++11}}\) and \({\mathsf {-O3}}\); some alternatives have extra flags to enable third party libraries. Systems are configured per vendor recommendations.

Table 2 shows the resulting time, space, and timeouts. A first observation is that, while the Ring variants use considerably less space than our smaller variant, \(\texttt {xCLTJ}\) (2.3–3.8 times less space, even less than the raw data), this comes at a considerable price in time performance: the Ring variants are 30–40 times slower than \(\texttt {xCLTJ*}\) on average, and 6–18 times slower in the median. While the small space of the \(\texttt {Ring}\) variants can be crucial to operate in main memory where other representations do not fit, \(\texttt {xCLTJ*}\) (and \(\texttt {CLTJ*}\)) are much faster alternatives when they fit main memory. This is in part because the compressed representation of the Ring takes \(O(\lg n)\) time to access any value, whereas CLTJ variants access them in constant time, and in part because the Ring’s simulation of trie traversal operations is considerably more complex.

Interestingly, the \(\texttt {CLTJ*}\) variants are faster than non-compact wco systems that use 6 tries represented in classic form: MillDB and Jena LTJ. The faster one, MillDB, uses 4 times the space of \(\texttt {CLTJ*}\) and is twice as slow on average and 50 times slower in the median. The classic non-wco systems are somewhat smaller, but still 2–3 times larger than \(\texttt {xCLTJ*}\) and 60–120 times slower on average. UmbraDB outperformed the non-wco systems and Jena LTJ in time, but it is dominated in time and space by MillDB.

Table 2 Space and average query times of various systems, limiting results to 1000.
Table 3 The best performing indices, separated by query type, limiting outputs to 1000 results. Times are in msec.

Table 3 shows how the times distribute across the three query types. It is interesting that MillDB is much slower than the \(\texttt {CLTJ*}\) variants only for query types I and II, which are the easy ones, whereas the average times on the hardest queries, of type III, are closer (and MillDB outperforms \(\texttt {xCLTJ*}\)). This observation, and the consistently larger median times, suggest that MillDB performs some internal setup per query that requires several tens of milliseconds. We return to this point next.

On the other hand, some non-wco indices are competitive with MillDB (and outperform Jena-LTJ) on type I queries, but worsen on type II, and worsen much more on type III, as expected from theory. Similarly, UmbraDB is outperformed by the best non-wco systems by a factor of 2 on type I queries, matches them on type II queries, and sharply outperforms them on type III queries, showing that it handles complex queries well. Still, it is 2–5 times slower than MillDB for all query types (4 times slower in total).

Table 4 Space and query times (in sec) of compact LTJ variants, with Gl(obal) and Ad(aptive) VEOs, not limiting the results. Timeouts count queries exceeding 10 min.
Fig. 6
figure 6

Average time per type-III query as a function of the output size, for both variants of \(\texttt {CLTJ*}\).

5.4 Not limiting the number of results

The case without limits in the number of answers is shown in Table 4. The times are much higher and thus the scale measures seconds. An important difference is that adaptiveness has almost no impact on the times. One reason for this is that now the cost to report so many results dominates the overall query time, thereby reducing the relative impact of using better or worse techniques to produce them.

To confirm this intuition, Fig. 6 shows the average query times as a function of their output size, for \(\texttt {CLTJ*}\) with global and adaptive VEOs. As it can be seen, the adaptive variant is much more robust than the global one. When the result is so large that the time to output it blurs the time taken to obtain it, both lines become similar. This is typically the case of queries with several lonely variables, whose binding is (wisely) left to the end and the query must output their Cartesian product. When we average over all the queries, as in Table 4, the average time is dominated by those queries with massive outputs.

The fact that a much larger fraction of the time is spent in outputting results also makes \(\texttt {xCLTJ*}\) similar in average query times to \(\texttt {CLTJ*}\), while still being 30% smaller. Its median times are still twice as large, though.

Tables 5 and 6 show the results of the best performing variants (in time or space), globally and by query type. MillDB fares better than with the limit, becoming similar to the \(\texttt {CLTJ*}\) variants and outperforming them on queries of type I and II, arguably because of the better locality of reference of the B+-trees to report many results. On queries of type III, where the query plan matters most, they all perform similarly.

5.5 Harder queries

We have compared the systems on real-life queries and at large scale. We now study more in depth the ability of various systems to handle particular query shapes that can be difficult to handle, on a smaller dataset. From our previous systems, we include the best wco systems and the beyond-wco system UmbraDB. We also include new systems that could not be run at full scale:

Table 5 Space and query times of the best performing indices, not limiting the results.
Table 6 The best performing indices, separated by query type, without limiting the results.
  • DuckDB [42]: A non-wco relational query engine that stores the tables in columnar form on disk. It uses vectorized and pipelined models to implement optimized binary join plans.

  • Kùzu [19]: A graph query engine that indexes property graphs using unsorted adjacency lists. It uses a cost-based dynamic programming to produce plans that mix wco and pairwise joins. For the joins, hash tables are created on the fly from the unsorted lists.

  • Graphflow [26]: The predecessor of Kùzu, which uses instead in-memory sorted adjacency lists.

  • ADOPT [48]: The first wco algorithm using adaptive VEOs on LTJ. It uses exploratory search and reinforcement learning to find near-optimal orders, using actual execution times as feedback on the suitability of orders. We include variants using one and 70 threads.

  • EmptyHeaded [1]: An implementation of a more general algorithm than LTJ, which applies a generalized hypertree decomposition [15] on the queries and uses a combination of wco algorithms [36] and Yannakakis’ algorithm [51]. It offers worst-case time guarantees that are stronger than the AGM bound. Triples are stored in 6 tries (all orders) in main memory.

Table 7 Space in bpt and median time in seconds (timeout is 1800) for various systems on graph \(\texttt {soc\texttt {-}LiveJournal1}\).

Those systems use too much memory on our Wikidata graph. For example, Graphflow stores one structure per predicate, which makes it usable with few predicates only: on a subset containing \(< 10\%\) of our Wikidata graph [5], it failed to build even in a machine with 730 GB of Java heap space. ADOPT did not build correctly either. EmptyHeaded runs but it uses 1810 bpt, over 10 times more than Jena LTJ. DuckDB and Kùzu exceeded the main memory space at query time, even on a machine with 768 GB of RAM.

In this section we compare them over a smaller graph used in previous work [39], \(\texttt {soc\texttt {-}LiveJournal1}\), the largest from the Stanford Large Network Dataset Collection [22], with 68, 993, 773 unlabeled edges. We test different query shapes (see previous work for a detailed description [39]) including trees (\(\mathsf {1{-}tree}\), \(\mathsf {2{-}tree}\), \(\mathsf {2{-}comb}\)), paths (\(\mathsf {3{-}path}\), \(\mathsf {4{-}path}\)), paths connecting cliques (\(\mathsf {2{-}3{-}lollipop}\), \(\mathsf {3{-}4{-}}\textsf{lollipop}\)), cliques (\(\mathsf {3{-}cliques}\), \(\mathsf {4{-}cliques}\)), and cycles (\(\mathsf {3{-}} \textsf{cycles}\), \(\mathsf {4{-}cycles}\)). We include 10 queries for each tree, path, and lollipop, and 1 for each clique and cycle. This is the same benchmark used for ADOPT [48], except that we do not force the clique and cycle variables to be different, and we choose for the constant any random value such that the query has occurrences. We set a 30-minute timeout and do not limit the number of results.

Since there are no labels, the Ring variants need not store the data for predicates, and the compact LTJ solutions store only two orders, \(\textsc {pso}\) and \(\textsc {pos}\) (ps and pos with partial tries). Graphflow is tested on the cliques and cycles only because the implementation does not support constants in the BGPs.

Table 7 shows spaces and times. Interestingly, \(\texttt {CLTJ*}\) and \(\texttt {UnCLTJ*}\) get close to the space of the compressed Ring solutions, and \(\texttt {xCLTJ*}\) uses significantly less. Graphflow, ADOPT/Kùzu and EmptyHeaded/DuckDB use 2, 3, and over 4 times more space, respectively, than \(\texttt {CLTJ*}\). MillDB uses 5.5 and UmbraDB uses over 8 times more space.

The tree and path queries are solved in around a millisecond (and in many cases a tenth of a millisecond) by the \(\texttt {CLTJ*}\) variants. The largest version, \(\texttt {UnCLTJ*}\), is only 5%–15% faster than \(\texttt {xCLTJ*}\) (with only one exception where it is 45% faster) and almost twice as large. The next best systems (the larger Ring, ADOPT, and MillDB, all of them wco), are many times slower than the \(\texttt {CLTJ*}\) variants in all cases.

More in detail, the slower Ring is 7 to 130 times slower than \(\texttt {xCLTJ*}\), and the faster Ring is 2.5 to 11 times slower. DuckDB, the non-wco system, takes stably from half to one second for all these queries, which is 3–4 orders of magnitude slower than \(\texttt {xCLTJ*}\). MillDB, the best wco system from previous experiments, is much faster than DuckDB, but still 1–2 orders of magnitude slower than \(\texttt {xCLTJ*}\). ADOPT, the remaining wco system, is 3–4 orders of magnitude slower than \(\texttt {xCLTJ*}\) in these queries (parallelization does not help in this case). The performance of beyond-wco systems is varied. The most stable one is UmbraDB, which takes 28–85 milliseconds on those queries, being 2–3 orders of magnitude slower than \(\texttt {xCLTJ*}\). Kùzu also fares well (though not as well as UmbraDB) on some queries, but much worse in others, being 3–4 orders of magnitude slower than \(\texttt {xCLTJ*}\). Only EmptyHeaded shows times comparable to those of \(\texttt {xCLTJ*}\) (at most 4 times slower) on the smaller queries, but it is still up to 3 orders of magnitude slower on the larger ones.

The lollipop shapes are harder to solve, but the \(\texttt {CLTJ*}\) variants still handle the larger one in less than 15 seconds, and are an order of magnitude faster than all the other alternatives, except UmbraDB, which is only 1.7 times slower than \(\texttt {xCLTJ*}\) on the smaller shape, and MillDB, which is about 2.5 times slower in both shapes. The next best performing systems are the large Ring and the parallel ADOPT.

EmptyHeaded finally takes over on the hardest shapes, cliques and cycles, where it is 3–6 times faster than Graphflow, 7–8 times faster than the parallel ADOPT, 30 times faster than UmbraDB, 50 times faster than the \(\texttt {CLTJ*}\) variants, and 110 times faster than MillDB, when those other systems do not timeout. We note that the \(\texttt {CLTJ*}\) variants are still faster than sequential ADOPT in these shapes.

5.6 The dynamic case

We now show how efficient are our update algorithms and, more importantly, how enabling dynamism affects the query time performance of our compact representations. We build the indices on a randomly chosen \(80\%\) of the Wikidata graph, and the remaining \(20\%\) is used for insertions. On those indices, we compute the 1, 295 queries from the Wikidata query log in different scenarios, without limiting the number of results. The scenarios are set depending on the number of updates per query: from 1000 queries between each pair of updates to 1000 updates between each pair of queries. Each update can be an insertion or deletion of a triple; the type of operation and triple are chosen at random.

Table 8 shows the individual insertion and deletion times for some systems, which are averaged from the updates performed under the scenario of 1000 updates per query. We compare \(\texttt {xCLTJ*}\) with MillDB and Virtuoso; the other systems have considerably higher update times. The update times of \(\texttt {xCLTJ*}\) are around 10 milliseconds, outperforming those of Virtuoso; MillDB is instead three times faster.

Table 8 Average time in milliseconds for each type of update in different systems.
Fig. 7
figure 7

Average time performance of some indices with different number of updates per query, without limiting the number of results (\(x<1\) means 1/x queries per update). The configuration with 0 updates is run on the static data structure; the others on the dynamic one.

Fig. 7 illustrates the query time performance obtained when queries are mixed with updates. For \(\texttt {xCLTJ*}\), the version without updates (i.e., the static case) yields an average time of 11.3 seconds per query. The performance worsens as the number of updates increases, but only mildly, up to 14.4 seconds with 1000 updates per query. Hence, even in highly dynamic scenarios, with huge amounts of updates, the average query times worsen by less than 30%. The times of Virtuoso, which uses fully dynamic data structures (not adaptive ones, like \(\texttt {xCLTJ*}\)), stay over 30 seconds per query. The times of MillDB, which also uses a fully dynamic data structure (a B+-tree), increase at about the same pace of those of \(\texttt {xCLTJ*}\), probably due to progressively more fragmented B+-tree leaves (the structure built on 80% of the triples with bulk-loading produces full leaves).

We also measure the space per triple of the dynamic version after all the updates and queries are performed. Notice that the cases are not fully comparable because the represented triples are not the same. In general, the dynamic version uses around \(10\%\) more space than the static one. As the number of updates increases to 1000, however, this overhead rises to \(18\%\). Even in such an extreme scenario, the space of \(\texttt {xCLTJ*}\) on the whole Wikidata graph would increase to 33 bpt. The conclusions we have drawn over the static version remain essentially unaltered in the dynamic case.

6 A Complete System

There is a gap between a proof-of-concept research prototype and a system that can be readily used by practitioners. Although we do not aim for full-fledged software, we have invested in closing this gap.

6.1 String identifiers

An important aspect that prototypes neglect is that the node and label identifiers in RDF are strings (IRIs and literals). Assuming that both are integers in a contiguous range greatly simplifies developments and lets researchers focus on the most important aspects of complex query processing. A real system, however, must map the query strings into those integer identifiers, and the identifiers of the resulting triples into their corresponding strings. While translating the query strings to identifiers is not time-critical because they are short, the efficiency of translating the resulting triples is relevant in queries that output many results. From the systems we have compared with, MillDB, Jena LTJ, RDF-3X, Virtuoso, Blazegraph, UmbraDB, Kùzu, and DuckDB do handle strings, while the Ring, Graphflow, ADOPT, and EmptyHeaded do not.

Storing the string identifiers is challenging when we aim at compact representations, because the strings may add up to a significant size if stored in uncompressed form. In our dataset of Section 5.1, the strings in plain form occupy 12.4 GB, more than the triples in integer-encoded form. In a static scenario, it has been shown [5] that compressed string dictionaries [25] can represent those strings in just 3.68 additional bytes per triple. The translation of each output triple takes 7–14 microseconds (the effect would be hardly noticeable in Table 2, for example, as times would grow by 7–14 milliseconds). While their static dictionary can be plugged directly into CompactLTJ, the dynamic version of CompactLTJ requires a dynamic compact dictionary where strings can be inserted and deleted. This is challenging because most of the compact static dictionary representations [25] cannot be maintained upon updates.

Static dictionaries may store the strings in some convenient order (typically lexicographic) and let the string identifiers be their position in that order. This is not possible in a dynamic dictionary, as one would need to update a massive amount of identifiers in the collection when a string is inserted or deleted. For this reason, we store both the strings \(s_i\) and their identifiers \(k_i\). Using Front Coding, we represent a dictionary \(\mathcal {D} = \{\langle s_0, k_0\rangle , \langle s_1, k_1\rangle , \dots , \langle s_n, k_n\rangle \}\), listed in lexicographic order, as follows:

$$\begin{aligned} FC [0]&= k_0 \cdot s_0,\\ FC [i]&= k_1 \cdot \texttt {lcp}(s_{i-1}, s_i) \cdot \texttt {suffix}(s_i, \texttt {lcp}(s_{i-1}, s_i)), \end{aligned}$$

where \(\cdot \) represents byte concatenation, \(\texttt {lcp}(s_i, s_j)\) is the longest common prefix between strings \(s_i\) and \(s_j\), and \(\texttt {suffix}(s_i, j)\) is the suffix of \(s_i\) starting after position j. The integers for the identifiers \(k_i\) and \(\texttt {lcp}\) are encoded with VByte [50], while \(\texttt {suffix}\) is stored in plain form.

To retrieve the string \(s_i\) we need to read all the previous ones. To reduce that cost, we split \(\mathcal {D}\) into buckets of maximum length \(\sigma _{\max }\), which is then the maximum cost to retrieve a string by starting from the first string \(s_0\) of its bucket (note that, to scan a bucket, we must iteratively obtain all its strings \(s_i\), but the time spent is \(\sigma _{\max } \le \sum _i |s_i|\)).

To easily find the identifier of a string \(s \in {\mathcal {D}}\), a binary search tree is maintained with one bucket per leaf, in left-to-right order. Each internal node stores a pointer to the bucket stored at the leftmost leaf of its right child; note that the string \(s_0\) of the bucket is stored in plain form. By comparing s with the strings \(s_0\) of the pointed buckets, we descend from the root of the tree to the leaf storing the bucket where s should be stored, and then scan the bucket. The total search time to obtain the identifier k of s (or determine that \(s \not \in {\mathcal {D}}\)) is then \(O(|s|\lg n + \sigma _{\max })\).

The most pressing operation on dictionaries, however, is the opposite: mapping an identifier k to its corresponding string s (this has to be done for each result). To optimize that operation, we store an array \(A[1\mathinner {.\,.}n]\), where A[k] stores a pointer to the bucket that contains \(\langle s, k \rangle \). In that way, s is reconstructed by just scanning one bucket until finding the identifier \(k_i = k\), and then retrieving \(s_i\), in time \(O(\sigma _{\max })\).

The dictionary supports string deletions and insertions. Deleting the string s with identifier k implies removing its pair \(\langle s_i, k_i \rangle = \langle s,k\rangle \) from the bucket pointed to by A[k]. After the removal we must reencode \(s_{i+1}\), now with respect to \(s_{i-1}\). The bucket is essentially rewritten, in time \(O(\sigma _{\max })\). The entry A[k] is now obsolete and the identifier k is free to be used by another insertion. A linked list of those free identifiers is maintained using the same obsolete entries A[k] to store the “next” pointers. Hence, after the removal of k, A[k] becomes the first element of that list. For simplicity we do not enforce policies of minimum bucket sizes, except that we remove the leaf corresponding to an empty bucket (and remove an internal node too). Deletion costs \(O(\sigma _{\max })\) time.

To insert a new string s we first find an identifier k for it. This is the first element of the linked list (which is then removed from the list), or the value \(n+1\) if the list is empty (and now \(n+1\) is the number of strings). We use the binary tree to find the bucket where to insert s, and make A[k] point to it. We scan the bucket up to the insertion point, and rewrite it from there. Concretely, if s falls between \(s_{i-1}\) and \(s_{i}\), we insert \(\langle s,k\rangle \) after \(s_{i-1}\), encoding s with respect to \(s_{i-1}\), and reencode \(s_i\) with respect to s. In case the bucket overflows, we split it by half and create a new internal node as the parent of both halves. The insertion cost is \(O(|s|\lg n + \sigma _{\max })\).

Fig. 8
figure 8

Space/time trade-off in complete systems handling strings, limiting results to 1000. To emulate a realistic dynamic scenario for the CLTJ variants, we use their query times when there are 0.1 updates per query, as in Wikidata.

In our experimental setup, the dynamic dictionary adds an extra space of 6.81 bpt, and transforms the identifiers of each output triple to strings in around 8.6 microseconds.

Fig. 8 compares the systems that handle strings, measuring query times of the CLTJ variants under a realistic dynamic scenario of 0.1 updates per query, as in Wikidata (recall Section 4). The space usage of \(\texttt {xCLTJ*}\) rises to 37.13 bpt, which is still about 2/3 of the space used by Virtuoso, the smallest system handling strings, and 4 times less than the space of MillDB. It is still 53 times faster than Virtuoso and 5% faster than MillDB. Our slightly larger alternative, \(\texttt {CLTJ*}\), now uses 53.51 bpt, which is still 19% less than Virtuoso and 3 times less than MillDB. In exchange, it is 40% faster than \(\texttt {xCLTJ*}\), 75 times faster than Virtuoso, and 48% faster than MillDB.

6.2 Public software

The code is publicly available at this Github repository: https://github.com/adriangbrandon/cltj. The software includes the benchmarks of our experimental evaluation. For each query, the benchmark outputs the identifier of the query, the number of results, and the elapsed time in nanoseconds. All of this information is limited by a semicolon. We provide the bash scripts utilized to build the indices and execute the benchmark queries.

In addition, we developed an API to simplify the usage of our software. To demonstrate how to use the API there are one example of a command line where you can interact directly with its interface, and others that show how to use the API from C++ code. More information about how to use the API is available in the Readme.md file of the repository. The datasets used in the experiments are available at https://zenodo.org/records/15117967.

7 Conclusions

We have shown that it is possible to implement the Leapfrog Triejoin (LTJ) algorithm, which solves Basic Graph Patterns on graph databases in worst-case-optimal (wco) time, within affordable space usage and without giving up on time performance. Precisely, we introduced a representation we call CompactLTJ, which uses one bit per trie edge instead of one pointer, while supporting trie navigation functionality in time similar to a classic pointer-based representation. Further, we implemented trie switching, which allows us store partial tries that retain the same CompactLTJ functionality and good performance, while slashing its space usage.

The fastest classic LTJ implementation we are aware of, MillenniumDB [47], uses about 14 times the space needed to represent the graph triples in plain form (i.e., each as three 32-bit integers). Our smallest CompactLTJ variant reduces this factor to 2.3—a 5.5-fold space reduction—while retaining MillenniumDB’s time performance, and surpassing it in many cases. Other classic representations, many of which are non-wco, use 2–3 times the space used by CompactLTJ and are 30–40 times slower.

We also implemented a dynamic version of CompactLTJ, which enables efficient insertion and deletion of triples with little overhead on top of the space and time of the static implementation. This breaks a long-standing limitation of compact representations and puts CompactLTJ on par with the functionality needed in real scenarios.

These results can change the landscape of indices for graph databases, as they show that it is feasible to implement the wco LTJ algorithm in memory within little space—much less than what is used by popular non-wco systems, in both static and dynamic scenarios. We have also explored some techniques—adaptive variable elimination orders and new predictors of the cost of choosing a variable—that make CompactLTJ considerably more robust on the bad cases of the standard solution.

Finally, we enabled CompactLTJ to handle string identifiers, not just integers, which is what most RDF systems can handle. We left a public implementation that allows researchers and practitioners use our best static and dynamic variants, handling integer or string identifiers, under various convenient modes of operation. We expect this to be useful for benchmarking, for research, and for using CompactLTJ on actual deployments.

7.1 Future work

Our experiments showed that more sophisticated “beyond-wco” indices, like Graphflow [26], ADOPT [48], and EmptyHeaded [1], were faster than CompactLTJ on some query shapes that are very hard to handle. A promising future work direction is to implement those query strategies on top of compact data structures, which could lead to even stronger indices that are space-affordable (recall that those stronger systems exceeded a generous amount of main memory when run over our dataset, so reducing their space usage is very valuable). An analogous challenge is to use compact data structures to represent factorized databases [40], using wco or non-wco strategies.

We remark that CompactLTJ runs in main memory and would not be disk-friendly. While its compactness makes it fit in memory for larger datasets, a relevant future work direction is to design compact representation formats for disk or distributed memory, where compactness translates into fewer I/Os or communication at query resolution time.

Finally, we plan to extend CompactLTJ to handle property graphs, where nodes have attributes of various types, with their own query operators. Extending the supported functionality to regular path queries (RPQs) is another significant step. While this has already been done using compact data structures [6, 7, 32], the integration with BGPs into conjunctive RPQs (CRPQs) is a significant challenge.