1 Introduction

Constructing optimal alignments between a trace and a process model is a key task in conformance checking. Unfortunately, the algorithmic complexity of alignments is a major bottleneck in practice. It can be shown that computing optimal alignments on safe and sound workflow nets is \(\textsf{PSPACE}\)-complete. One approach to overcome the intractability is to consider (syntactic) restrictions on the process models and to make use of the additional structure to speed up the alignment computation. Along these lines, we recently showed that computing optimal alignments on process trees is in \(\textsf{NP}\) and we gave a novel Mixed Integer Linear Programming (MILP) formulation which outperforms the state-of-the-art alignment algorithms in PM4Py [20]. In this work, we reconsider process trees, but with the further restriction that each activity label occurs at most once in the process tree (i.e., process trees with unique labels).

In many real-life scenarios, process models have a tree-like structure, meaning that the full process decomposes into subprocesses that are interconnected in a tree-like fashion. In process mining, this kind of process models has been formalized as the concept of process trees. Process trees have gained quite some popularity in the process mining community, most importantly, since they form the basis for a widely used family of mining algorithms, the so-called Inductive Miner [15]. It is fair to say that process trees provide a good trade-off between expressiveness and computational efficiency.

But there is more to the story: the structure of process trees that come out of the Inductive Miner have unique activity labels, meaning that each activity label occurs at most once in the process tree. This clearly is a strong restriction, but it really is this assumption which makes the Inductive Miner tractable in practice. For us, it was reason enough to reconsider the alignment problem on process trees and ask if it is possible to exploit the unique label property to speed up the alignment computations even further, in particular: can alignments on process trees with unique labels be computed in polynomial time?

In this paper, we answer this question affirmatively. We give a new efficient (polynomial-time) dynamic programming algorithm to compute optimal alignments between a trace and a process tree with unique labels. This places the alignment problem for process trees with unique labels in \(\textsf{P}\). Our key observation is that the unique label property allows us to handle the parallel operator in an efficient manner. The parallel operator models independent parallel computation and corresponds to a parallel gateway in BPMN or to the shuffle operator in formal languages. In fact, without the restriction to unique labels, the parallel operator requires the exploration of an exponential number of possible alignments which brings us to the realm of \(\textsf{NP}\) for general process trees. Besides the parallel operator, we further make use of the unique label property to speed-up computations of the sequence operator. We show how we can restrict the set of possible splits of a trace with respect to an optimal alignment. This saves a high number of recursive calls in the dynamic programming algorithm. We implemented our new algorithmic approach in form of a proof-of-concept based on the PM4Py ecosystem [4] and also evaluated it on a set of real-life benchmark logs. Our experiments show that the dynamic programming algorithm is competitive with the state-of-the-art alignment algorithms in PM4Py and even outperforms them in some cases. This underlines our belief that the structure of process models should be better taken into account when solving the alignment problem in practice.

2 Related Work

Alignments [3] are the state-of-the-art technique for conformance checking [7, 8]. Besides the (textbook) algorithm based on \(A^*\), several algorithmic approaches have been explored to compute alignments, e.g., see [5, 11, 16] for a technique based on Linear Programming (LP) to improve the \(A^*\)-heuristics, or [21] for an approximative scheme based on Mixed Integer Linear Programming (MILP). Other approaches use decomposition techniques to tackle large process model instances, e.g., see [1].

Process trees were first applied by [2, 6] in the context of genetic process discovery. Since then, process trees have proven to be a modeling language with a great balance between expressiveness and algorithmic simplicity. In particular, they form the basis of one of the most popular process discovery algorithms, the so-called Inductive Miner [13,14,15]. Thus, optimized algorithms for alignment computations on process trees have been studied. Most notably, [19] proposed an approximation algorithm which performs well on many process trees, but which does not guarantee optimality in all cases. We also like to point to our MILP formulation for the alignment problem on process trees [20].

Finally, alignments for process trees have been studied much earlier in the context of the error correction problem for regular languages with shuffle operator, e.g., see [18] (under a different term). Our new algorithmic approach can be transferred into this field as well where, by the best of our knowledge, the unique label property has not been studied before.

3 Preliminaries

Let \(\mathbb N\) (\(\mathbb N_0\)) be the set of natural numbers excluding 0 (including 0). For an n-tuple \(a \in A_1 \times \dots \times A_n\), \(\pi _i(a)\) denotes the projection on its ith element, i.e., \(\pi _i:A_1\times \dots \times A_n\rightarrow A_i,(a_1,\dots ,a_n)\mapsto a_i\).

Definition 1

(Alphabet). An alphabet \(\varSigma \) is a finite, non-empty set of labels (also referred to as activities).

Definition 2

(Sequence). Sequences with index set I over a set A are denoted by \(\sigma =\langle a_i\rangle _{i\in I}\in A^I\). The length of a sequence \(\sigma \) is written as \(|\sigma |\) and the set of all finite sequences over A is denoted by \(A^*\). For a sequence \(\sigma =\langle a_i\rangle _{i\in I}\in A^I\), \(\sum \sigma \) is a shorthand for \(\sum _{i\in I}a_i\). The restriction of a sequence \(\sigma \in A^*\) to a set \(B\subseteq A\) is the subsequence \(\sigma \vert _{B}\) of \(\sigma \) consisting of all elements in B. A function \(f:A\rightarrow B\) can be applied to a sequence \(\sigma \in A^*\) given the recursive definition \(f(\langle \rangle ):=\langle \rangle \) and \(f(\langle a\rangle \cdot \sigma ):=\langle f(a)\rangle \cdot f(\sigma )\). For a sequence of tuples \(\sigma \in (A^n)^*\), \(\pi ^*_i(\sigma )\) denotes the sequence of every ith element of its tuples, i.e., \(\pi ^*_i(\langle \rangle ):=\langle \rangle \) and \(\pi ^*_i(\langle (a_1,\dots ,a_n)\rangle \cdot \sigma ):=\langle \pi _i(a_1,\dots ,a_n)\rangle \cdot \pi ^*_i(\sigma )=\langle a_i\rangle \cdot \pi ^*_i(\sigma )\). As an important extension of \(\pi _i^*\) we write \(\pi _i^B\) for the composition of \(\pi _i^*\) with the restriction to B, i.e. \(\pi ^B_i:=\pi _i^*\vert _{B}\).

We identify languages of traces \(\mathcal L \subseteq \varSigma ^*\) with sets of (observed) behavior of a (business) process. Each trace corresponds to a single process execution (also known as a case). The symbols in the trace correspond to the events or activities that occurred. In this article, we study process trees as a modeling mechanism for business processes. Each process tree \(T\) defines a language \(\mathcal {L}(T)\subseteq \varSigma ^*\) of possible process behaviors. Before we give the definition, we recall a central operator which captures independent parallel computations.

Definition 3

(Shuffle ). For \(x,y \in \varSigma ^*\), the shuffle of x and y is

A mathematical equation is displayed, representing a complex algebraic expression. The equation includes variables, exponents, and mathematical symbols, illustrating a relationship between different mathematical terms. y

Let \(\mathcal L_1, \mathcal L_2 \subseteq \varSigma ^*\). The shuffle of \(\mathcal L_1\) and \(\mathcal L_2\) is defined as

Mathematical equation displaying the integral of the product of two functions, \( f_{x_1} \) and \( f_{x_2} \), over the domain \( \mathcal{X} \), with respect to the measure \( \mu \). The equation is used in contexts involving integration and function spaces.

Definition 4

(Process Trees). Let \(\varSigma \) be an alphabet and let \(\tau \notin \varSigma \) be the silent activity. The set of process trees (over \(\varSigma \)) is defined recursively:

  • each activity \(a\in \varSigma \) and the silent activity \(\tau \) is a process tree,

  • \({{\,\mathrm{\rightarrow }\,}}(T_1,\dots ,T_n)\), \({{\,\mathrm{\times }\,}}(T_1,\dots ,T_n)\), \({{\,\mathrm{\circlearrowleft }\,}}(T_1,T_2)\), and \({{\,\mathrm{\wedge }\,}}(T_1,\dots ,T_n)\) are process trees with \(T_1,\dots ,T_n\), \(n\in \mathbb N\) being process trees as well.

The symbols \({{\,\mathrm{\rightarrow }\,}}\) (sequence), \({{\,\mathrm{\times }\,}}\) (exclusive choice), \({{\,\mathrm{\circlearrowleft }\,}}\) (loop), and \({{\,\mathrm{\wedge }\,}}\) (parallel) are process tree operators. The language of a process tree \(T\) is denoted by \(\mathcal {L}(T)\) and is also recursively defined where

  • \(\mathcal {L}(\tau )=\lbrace \langle \rangle \rbrace \) and \(\mathcal {L}(a)=\lbrace \langle a\rangle \rbrace \),

  • \(\mathcal {L}({{\,\mathrm{\rightarrow }\,}}(T_1,\dots ,T_n))=\mathcal {L}(T_1)\cdot \ldots \cdot \mathcal {L}(T_n)\),

  • \(\mathcal {L}({{\,\mathrm{\times }\,}}(T_1,\dots ,T_n))=\mathcal {L}(T_1)\cup \ldots \cup \mathcal {L}(T_n)\),

  • \(\mathcal {L}({{\,\mathrm{\circlearrowleft }\,}}(T_1,T_2))=\mathcal {L}(T_1)\cdot (\mathcal {L}(T_2)\cdot \mathcal {L}(T_1))^*\), and

  • .

In order to simplify notation in this article, from now on, we consider the process tree operators \(\lbrace {{\,\mathrm{\rightarrow }\,}}, {{\,\mathrm{\times }\,}}, {{\,\mathrm{\circlearrowleft }\,}}, {{\,\mathrm{\wedge }\,}}\rbrace \) in their binary form only. This also allows us to use infix notation, e.g., \(T_1 {{\,\mathrm{\rightarrow }\,}}T_2\) instead of \({{\,\mathrm{\rightarrow }\,}}(T_1, T_2)\). This is no restriction, since the general n-ary version can easily be rewritten in form of binary operators (all operators are associative). For a process tree \(T\), let \( Letters (T) \subseteq \varSigma \) denote the set of all labels occurring in \(T\). Inductively, \( Letters (T)\) is defined as: \( Letters (\tau ) = \emptyset \), \( Letters (a) = \lbrace a\rbrace \) for \(a \in \varSigma \), and for all binary operators we have

$$\begin{aligned} & Letters (T_1 {{\,\mathrm{\rightarrow }\,}}T_2) = Letters (T_1 {{\,\mathrm{\times }\,}}T_2) = Letters (T_1 {{\,\mathrm{\wedge }\,}}T_2) = \\ & \quad \qquad \qquad \qquad \qquad \qquad Letters (T_1 {{\,\mathrm{\circlearrowleft }\,}}T_2) = Letters (T_1) \cup Letters (T_2). \end{aligned}$$

A process tree \(T\) has unique labels if for all binary operators \(\textsf {op} \in \lbrace {{\,\mathrm{\rightarrow }\,}}, {{\,\mathrm{\times }\,}}, {{\,\mathrm{\wedge }\,}}, {{\,\mathrm{\circlearrowleft }\,}}\rbrace \) and all subtrees \((T_1 \textsf { op } T_2)\) that occur in \(T\) we have \( Letters (T_1) \cap Letters (T_2) = \emptyset \). Note that the silent activity \(\tau \) can occur multiple times in process trees \(T\) with unique labels.

Definition 5

(Moves, Alignments). Let \(\varSigma \) be an alphabet and let \({\gg }\) be a fresh symbol not in \(\varSigma \). We use \({\gg }\) to indicate a skip in the trace or model and define \(\varSigma _{{\gg }} :=\varSigma \cup \lbrace {\gg }\rbrace \) as the alphabet extended by the skip-symbol \({\gg }\). We define \( Moves (\varSigma )\subseteq \varSigma _{\gg }\times \varSigma _{\gg }\) as the set of all moves over \(\varSigma \) given by

$$\begin{aligned} Moves (\varSigma ) :=& \qquad \lbrace (a,a) |a \in \varSigma \rbrace &\,& \textit{synchronous moves} \\ & \cup \lbrace (a, {\gg }) |a \in \varSigma \rbrace &\,& {\textit{model moves}} \\ & \cup \lbrace ({\gg }, a) |a \in \varSigma \rbrace &\,& {\textit{log moves}}. \end{aligned}$$

An alignment \(\gamma \in Moves (\varSigma )^*\) between \(w \in \varSigma ^*\) and a process tree \(T\) is a sequence of moves \(\gamma = \langle m_1, \dots , m_n\rangle \) such that \(\pi ^\varSigma _1(\gamma )= w\) and \(\pi ^\varSigma _2(\gamma ) \in \mathcal {L}(T)\).

In other words, \(\gamma \) forms an alignment if the first components of each move in \(\gamma \) yield the trace w (when we remove all skip symbols \({\gg }\)) and the second components yield a trace in the language of the process tree \(T\) (again without skip symbols). Intuitively, we aim to modify the trace w (the first component) such that it becomes a trace in the language of the process tree \(T\) (the second component). From this point of view, a log move \((a,{\gg })\) deletes the symbol a from w while a model move \(({\gg },b)\) inserts the symbol b into the trace w.

We determine the costs \(c(\gamma )\) of an alignment \(\gamma \) by summing up the costs c(m) of the individual moves m in \(\gamma \) where synchronous moves have cost 0 and, with respect the standard cost function, log and model moves have cost 1 (other cost functions are possible). The set of all alignments between a trace w and a process tree \(T\) is denoted by \(\varGamma (w,T)\). An optimal alignment \(\gamma _ opt \in \varGamma (w,T)\) is an alignment with minimal costs \(\sum c(\gamma _ opt )\) among all alignments in \(\varGamma (w,T)\).

4 Structure of Process Tree Alignments

Process trees have an inductive definition which lends itself to recursive algorithms. We next show that this inductive structure carries over to the set of alignments as well.

For a trace \(w \in \varSigma ^*\) of length n, \(w = \langle w_1, w_2, \dots , w_n\rangle \), we call a mapping \(\varphi :\lbrace 1, \dots , n\rbrace \rightarrow \lbrace 1,2\rbrace \) a factorization of w. For a factorization \(\varphi \) of w we define \(\varphi _1 \in \varSigma ^*\) as the trace that results by concatenating all symbols \(w_i\) with \(\varphi (i) = 1\). Likewise, \(\varphi _2 \in \varSigma ^*\) denotes the trace that results by concatenating all symbols \(w_i\) with \(\varphi (i) = 2\). For the special case where \(n = 0\), we only have a single factorization \(\varphi = \emptyset \) with \(\varphi _1 = \varphi _2 = \langle \rangle \). We write \(\varPhi (w)\) to denote the set of all factorizations of w. Note the connection between factorizations and the shuffle operator: for \(w, w_1, w_2 \in \varSigma ^*\) we have if and only if there exists a factorization \(\varphi \) of w such that \(\varphi _1 = w_1\) and \(\varphi _2 = w_2\). In this sense, the factorization can be seen as a kind of inverse of the shuffle operator.

Theorem 1

(Structure of Alignments over Process Trees). Let \(T_1\) and \(T_2\) be process trees and \(w \in \varSigma ^*\) be a trace. Then the following holds.

$$\begin{aligned} \varGamma (w, T_1 {{\,\mathrm{\rightarrow }\,}}T_2) &= \bigcup _{w_1 \cdot w_2 = w} \varGamma (w_1, T_1) \cdot \varGamma (w_2, T_2) \end{aligned}$$
(1)
$$\begin{aligned} \varGamma (w, T_1 {{\,\mathrm{\times }\,}}T_2) &= \varGamma (w, T_1) \cup \varGamma (w, T_2) \end{aligned}$$
(2)
(3)
$$\begin{aligned} \varGamma (w, T_1 {{\,\mathrm{\circlearrowleft }\,}}T_2) &= \smash {\bigcup _{k\in \mathbb N_0}} \lbrace \varGamma (w_0, T_1) \cdot \varGamma (y_1, T_2) \cdot \varGamma (z_1, T_1) \cdots \varGamma (y_k, T_2) \cdot \varGamma (z_k, T_1) |\nonumber \\ & \qquad \qquad w = w_0 y_1z_1 \dots y_k z_k , w_0, y_i, z_i \in \varSigma ^*, 1\le i\le k \rbrace \end{aligned}$$
(4)

Proof

Ad (1): Let \(T= T_1 {{\,\mathrm{\rightarrow }\,}}T_2\) and \(w \in \varSigma ^*\) be a trace. We show that \(\varGamma (w,T) = \bigcup _{w_1 \cdot w_2 = w} \varGamma (w_1, T_1) \cdot \varGamma (w_2, T_2)\). The direction \(\supseteq \) is obvious, so let us focus on the direction \(\subseteq \). Let \(\gamma \in \varGamma (w,T)\). Since \(T= T_1 {{\,\mathrm{\rightarrow }\,}}T_2\), we find \(y_1 \in \mathcal {L}(T_1)\) and \(y_2 \in \mathcal {L}(T_2)\) such that \(\pi ^\varSigma _2(\gamma ) = y_1 \cdot y_2\). Hence, we can write \(\gamma = \gamma _1 \cdot \gamma _2\) with \(\pi ^\varSigma _2(\gamma _1) = y_1\) and \(\pi ^\varSigma _2(\gamma _2) = y_2\). Define \(w_1 = \pi ^\varSigma _1(\gamma _1)\) and \(w_2 = \pi ^\varSigma _1(\gamma _2)\). Then we have \(w = w_1 \cdot w_2\) and \(\gamma _1 \in \varGamma (w_1, T_1)\) and \(\gamma _2 \in \varGamma (w_2, T_2)\).

Ad (2): Straightforward.

Ad (3): For \(\supseteq \), observe that for a projection operator \(\pi \) and sequences x, y we have . For the direction \(\subseteq \), let \(\gamma \in \varGamma (w,T)\). Let \(y = \pi ^\varSigma _2(\gamma )\). Since \(T= T_1 {{\,\mathrm{\wedge }\,}}T_2\), we find a factorization \(\varphi \) of y such that \(y_1 :=\varphi _1 \in \mathcal {L}(T_1)\) and \(y_2 :=\varphi _2 \in \mathcal {L}(T_2)\). We lift this factorization to a factorization of \(\gamma \) by assigning to each log move m in \(\gamma \) the value 2 (the choice of 2 is arbitrary and we could have chosen 1 as well). Call the resulting factorization \(\psi \) and let \(\gamma _1 = \psi _1\) and \(\gamma _2 = \psi _2\). Let \(w_1 = \pi ^\varSigma _1(\gamma _1)\) and \(w_2 = \pi ^\varSigma _1(\gamma _2)\). Then, since \(w = \pi ^\varSigma _1(\gamma )\). Moreover, \(\gamma _1 \in \varGamma (w_1, T_1)\) and \(\gamma _2 \in \varGamma (w_2, T_2)\) since \(\pi ^\varSigma _2(\gamma _1) = y_1\) and \(\pi ^\varSigma _2(\gamma _2) = y_2\) (we have only assigned new log moves to the second component of the alignment). This concludes the argument for the parallel operator.

Ad (4): We can get a decomposition analogously as for the sequence operator (1) using the semantics of the loop operator \(T_1{{\,\mathrm{\circlearrowleft }\,}}T_2\) as \(\mathcal {L}(T_1) \cdot (\mathcal {L}(T_2) \cdot \mathcal {L}(T_1))^*\).    \(\square \)

From Theorem 1 we can derive a recursive algorithm for computing an optimal alignment between a trace w and a process tree \(T\). Let \( Cost (w,T)\) denote the minimal costs of an alignment in \(\varGamma (w,T)\), i.e.,

$$\begin{aligned} Cost (w,T) = \min \lbrace \sum c(\gamma ) |\gamma \in \varGamma (w,T)\rbrace . \end{aligned}$$

Then, we have the following recursive procedure for computing \( Cost (w,T)\).

Theorem 2

(Recursive Computation of Alignment Costs). Let \(T_1\) and \(T_2\) be process trees and \(w \in \varSigma ^*\) a trace. Then the following holds.

The image displays a series of mathematical equations related to cost functions. The first equation defines the cost of a set T1 and T2 as the minimum of the cost of T1 and the cost of T2. The second equation expresses the cost of T1 as the minimum of the cost of T1 and the cost of T1 with T2. The third equation shows the cost of T1 and T2 as the maximum of the cost of T1 and the cost of T2. The fourth equation defines the cost of T1, T2, and T3 as the minimum of the cost of T1 and the sum of the costs of T2 and T3. The variables are defined within a set S, with constraints on the indices i and j.

Proof

This follows immediately from Theorem 1. Note that for the case of the parallel operator (3) we have dropped the condition \(\pi _1^{\varSigma }(\gamma ) = w\). This can be justified as follows. If \(\varphi \in \varPhi (w)\) and with \(\gamma _1 \in \varGamma (\varphi _1, T_1)\) and \(\gamma _2 \in \varGamma (\varphi _2, T_2)\), then the costs of all alignments in are the same (they consist of the same moves) and at least for one we have \(\pi _1^{\varSigma }(\gamma ) = w\).    \(\square \)

The missing base cases for \( Cost (w,T)\) can be determined easily.

Theorem 3

(Alignment Costs for Base Cases). Let \(w \in \varSigma ^*\) be a trace and \(a \in \varSigma \). Then \( Cost (w,\tau ) = |w|\). Moreover, \( Cost (w,a) = |w| + 1\) if a does not occur in w and \( Cost (w,a) = |w| -1\) otherwise.

5 A Dynamic Programming Algorithm

The recursive computation of the optimal alignment costs \( Cost (w,T)\) can be turned into a dynamic programming algorithm. We use the formulae from Sect. 4 and avoid recomputation for identical subproblems by storing the results of \( Cost (w,T)\) in a table which we denote by \( CostTable \), see Algorithm 1.

Algorithm 1
The image displays a block of pseudocode for a function named "cost" that calculates a cost value based on input parameters. The function includes conditional statements and loops to iterate over elements, with operations involving variables such as "X," "Y," and "T." The code features recursive calls, cost calculations, and updates to a variable "cost." The logic includes conditions for handling different cases, such as when "X" equals "Y" or "T," and involves operations like addition and subtraction. The pseudocode concludes with returning the computed cost value.

. Dynamic Programming Algorithm to Compute \( Cost (w,T)\)

As presented, Algorithm 1 has exponential runtime. First, the number of recursive calls required for the loop operator \(T_1 {{\,\mathrm{\circlearrowleft }\,}}T_2\) corresponds to the (exponential) number of decompositions of w into subtraces \(w = w_0 y_1 z_1 \ldots y_k z_k\). However, this blowup can be avoided. Consider a graph on the positions of w, \(n:=|w|\), with an edge from position \(0 \le p \le n\) to position \(n \ge q \ge p\) with costs

$$\begin{aligned} \min _{p\le r\le q} \lbrace Cost (w[p,r], T_2) + Cost (w[r,q], T_1) \rbrace . \end{aligned}$$

These are the costs of aligning the subtrace w[pq] of w against the process tree \(T_2 {{\,\mathrm{\rightarrow }\,}}T_1\). In turn, a path from p to \(n:=|w|\) corresponds to a partition of the suffix w[pn] of w into segments (each edge yields one segment) where each segment is aligned against \(T_2 {{\,\mathrm{\rightarrow }\,}}T_1\). Specifically, the costs of a cost-minimal path from p to n are the costs of an optimal alignment of w[pn] against \((\mathcal {L}(T_2) \cdot \mathcal {L}(T_1))^*\). Hence, these costs can be determined efficiently with a shortest-path algorithm. This yields polynomial runtime for the loop operator.

The second problematic case is the parallel operator \(T_1 {{\,\mathrm{\wedge }\,}}T_2\). Here, we have to consider all factorizations of the trace w into two subtraces \(w_1\) and \(w_2\) which is an exponential number (in the length of w). In contrast to the loop operator, this exponential search cannot be avoided in general (unless \(\textsf{P}= \textsf{NP}\)). However, for process trees with unique labels, the situation is different.

Process Trees with Unique Labels. Let us reconsider the case \(T= T_1 {{\,\mathrm{\wedge }\,}}T_2\) where

$$\begin{aligned} Cost (w,T) = \min _{\varphi \in \varPhi (w)} \lbrace Cost (\varphi _1, T_1) + Cost (\varphi _2, T_2) \rbrace , \end{aligned}$$

from Theorem 2 for process trees with unique labels. Let \(L_1 = Letters (T_1)\) and \(L_2 = Letters (T_2)\) be the sets of labels occurring in \(T_1\) and \(T_2\), respectively. We claim that we can restrict the set of factorizations to a singleton. Indeed, each letter \(w_i\) of w either belongs to \(L_1\) or \(L_2\) (or to none of them). For example, if \(w_i = a\), and we know that \(a \in L_1\), then it cannot reduce the alignment costs if we assign a to \(T_2\). In fact, in \(T_2\) we have to delete a anyway (log move \((a,{\gg })\)) and we could do exactly the same in \(T_1\) (without increasing costs). In other words, without loss of generality we can assume that \(\varphi (i) = 1\). We can argue analogously for letters in \(L_2\). If the letter a at position i does neither belong to \(L_1\) nor to \(L_2\), then we can assign it to \(T_1\) or \(T_2\) without changing the alignment costs (in both cases, a deletion move \((a,{\gg })\) is unavoidable). Arbitrarily, we assign such letters to \(T_2\). In conclusion, for the single factorization \(\varphi ^\star \) with \(\varphi ^\star (i)=1\) if \(w_i \in L_1\) and \(\varphi ^*(i)=2\) otherwise, we have:

$$\begin{aligned} Cost (w,T) = Cost (\varphi ^\star _1, T_1) + Cost (\varphi ^\star _2, T_2). \end{aligned}$$

With these adaptations, Algorithm 1 becomes polynomial-time. To see this, let w be the input trace, and let \(T\) denote the input tree. Let \(n=|w|\). A first observation is that the total number of entries in \( CostTable \) is bounded by \(\mathcal O(|T| \cdot n^2)\). This is because each entry \((v,T')\) in \( CostTable \) is determined by \(T'\) together with a segment w[ij], \(1\le i \le j \le n\), of the original input trace w (indeed, v either is the segment w[ij] itself or the restriction of w[ij] to letters that occur in \(T'\)). This is in contrast to the case of process trees with non-unique labels where the shuffle operator would produce an exponential number of recursive calls for its subtrees (and the corresponding traces v could not easily be described as segements of the original input trace). With the same argument, \(\mathcal O(|T| \cdot n^2)\) is a bound on the number of recursive calls of the function \(\textsc {Cost}(w,T)\).

Secondly, we bound the runtime for each call of \(\textsc {Cost}(w,T)\) (besides the recursive calls). By going through the different cases, it can be seen that the most expensive step is the shortest path computation for the loop operator. Here, we compute a cost-minimal path on a graph with \(\mathcal O(|v|)\) nodes. Since \(|v| \le n\), and since shortest paths can be computed in quadratic time in the number of vertices (e.g., by using Dijkstra), the total runtime for a call of \(\textsc {Cost}(w,T)\) is bounded by \(\mathcal O(n^2)\) (not considering the runtime for the sparked recursive calls, of course). Altogether this yields a runtime bound of \(\mathcal O (|T| \cdot n^4)\).

Theorem 4

(Dynamic Programming Algorithm for Process Trees with Unique Labels). The costs \( Cost (w,T)\) of an optimal alignment between a process tree \(T\) with unique labels and a trace w can be computed efficiently in time \(\mathcal O(|T| \cdot |w|^4)\).

6 Evaluation

We implemented our novel alignment algorithm in PythonFootnote 1 and compared its runtime against the available algorithms in PM4Py [4] on a set of real-life event logs. We like to discuss one further algorithmic idea which lead to a significant speed-up on the benchmarks. Consider the sequence operator \(T_1 {{\,\mathrm{\rightarrow }\,}}T_2\) and recall that

$$\begin{aligned} Cost (w,T) = \min \limits _{w_1 \cdot w_2 = w} \lbrace Cost (w_1, T_1) + Cost (w_2, T_2) \rbrace . \end{aligned}$$

Implemented naively, we need to check n splits of w into subtraces \(w_1\) and \(w_2\) where \(n = |w|\). Let \(L = Letters (T_1)\) and \(R = Letters (T_2)\) be the sets of labels occurring in the (left) subtree \(T_1\) and the (right) subtree \(T_2\), respectively. Because of the unique label property, \(L \cap R = \emptyset \). Let us label the letters in w with L if they belong to L and with R if they belong to R. Call the resulting \(\lbrace L,R\rbrace \)-trace \( decomp (w)\). Of course, w could contain letters that neither belong to L nor to R. Such letters a can be removed in a preprocessing step (they incur deletion costs anyway). Hence, we can assume that \( decomp (w)\) and w have the same length. We claim that it is sufficient to check only the following split positions of the trace w: \( seg = \lbrace 1, n\rbrace \cup \lbrace i: decomp (w)_i = L \wedge decomp (w)_{i+1} = R\rbrace \).

To see why, let \(i \in seg \) be a position with a flip from L- to R-labels. Then, by definition, this position is followed by R-labels. The alignment costs can only increase for a split within the upcoming R-segment. This is because to handle R-labels in the left subtree \(T_1\) we need to delete them. Moreover, if the R-part is followed by an L-segment, then it makes sense to include as many L-labels for the next split as possible (since L-labels will necessarily incur deletion costs in the right subtree \(T_2\)). Hence, the next optimal split can only be after the last L-label (either at the end of the trace or right before the next R-label).

We compared our algorithm (Dynamic) against the standard (A*-based) algorithm for computing alignments in PM4Py (Standard) and an approximation algorithm in PM4Py tailored for process trees (Approx). For each algorithm and trace variant, we took the best out of 5 repetitions (meaning the minimum required time for computing the costs of an optimal alignment). To visualize the results, we computed the performance factors for each trace variant, i.e., we took the best runtime and divided the runtime of all three algorithms by this optimal runtime (trace-variant-wise). For instance, a performance factor of 2 indicates, that the algorithm took twice as long as the best algorithm for the given trace. We set a timeout of 65 s (instead of say 60 s) to compensate for overhead and to give each algorithm the safe chance to finish its computation in one minute. If a computation hits the timeout in one of the repetitions, the algorithm is considered to have failed on the trace/model pair. In the chart below, we plotted the empirical CDF of the performance factors for each of the three algorithms. The frequencies of performance factors of some algorithms do not sum up to 1; this indicates, that the algorithms ran into timeouts on a certain fraction of instances.

Log Data and Results. The general picture is that our algorithm (Dynamic) is very close to the approximation algorithm (Approx) and, in almost all cases, clearly outperforms the standard algorithm (Standard). Let us start with the BPI Challenge 2019 event log [12]. We used the Inductive Miner [15] to discover process trees with different noise thresholds (0%, 10%, 25% and 50%) and aligned 1000 randomly chosen trace variants from the log against the resulting process trees. The CDF of the performance factors is depicted in Fig. 1. The analogous graph for the Hospital Billing event log [17] can be found in Fig. 2.

On the BPI Challenge 2017 event log [10], our algorithm is slightly superior to Approx for noise thresholds of 10% and 50%, while it is slightly below the performance of Approx for thresholds of 0% and 25%. On the BPI Challenge 2012 event log [9], our algorithm is slightly superior to Approx for noise thresholds of 0% and 10%, while it is slightly below the performance of Approx for thresholds of 25% and 50%. This is with respect to the CDF of performance factors. To give some further results, Table 1 depicts the median computation times of the three algorithms for the runs on the BPI Challenge 2012 and 2017 event logs with respect to the different noise thresholds (0%, 10%, 25%, 50%).

Table 1. Median computation times (in seconds) of Dynamic, Approx, and Standard in PM4Py for the BPI Challenge 2012 and 2017 event logs with different noise thresholds.
Fig. 1.
The image shows four line graphs comparing cumulative probability against performance factor for different configurations labeled pt50, pt25, pt10, and pt00, under the benchmark BPI2019. Each graph includes three lines representing Dynamic, Approx, and Standard methods. The Dynamic and Approx methods show higher cumulative probabilities at lower performance factors compared to the Standard method, which increases more gradually.

CDF of the performance factors of Dynamic, Approx, Standard in PM4Py on the BPI Challenge 2019 event log [12] with different noise thresholds.

Fig. 2.
The image contains four line graphs titled "Benchmark: Hospital," each representing different configurations: pt50, pt25, pt10, and pt00. The x-axis shows the Performance Factor on a logarithmic scale, and the y-axis shows Cumulative Probability. Each graph compares three methods: Dynamic (blue line), Approx (orange line), and Standard (green dotted line). In configurations pt50, pt25, and pt10, the Dynamic and Approx methods show similar steep curves reaching high cumulative probability quickly, while the Standard method lags. In configuration pt00, all methods show a more gradual increase, with the Standard method being the slowest.

CDF of the performance factors of Dynamic, Approx, Standard in PM4Py on the Hospital Billing event log [17] with different noise thresholds.

7 Conclusion

We proved that the alignment problem for process trees with unique labels can be solved in polynomial time using dynamic programming. A proof-of-concept implementation in Python demonstrates that our algorithm is competitive with (and in some cases outperforms) the existing techniques of the PM4Py library. We discussed ideas how the algorithm can be further optimized in practice.

This article is part of a broader research agenda where we try to understand better the structure and algorithmic complexity of the alignment problem. We saw an interesting, and practically relevant, class of process models, where the alignment problem can be solved in polynomial time. This is rather the exception than the rule, since the alignment problem has high complexity in general (\(\textsf{PSPACE}\)-hard for sound workflow nets). Our work leads to many questions for future research. For example, it would be interesting to study relaxations of the unique label property and study the influence of these parameters on the complexity. Also, it would be interesting to see how restrictive the unique label property really is. Can we get a characterization of the event logs that can be defined using process trees with unique labels? And, as sound workflow nets with unique labels are more powerful than process trees with unique labels (in terms of modeling power), what is the complexity of the alignment problem for sound workflow nets with unique labels?