1 Introduction

The methods from the scientific field of AI, and in particular Machine Learning (ML), are increasingly applied to tasks in socially sensitive domains. Due to their predictive power, ML models are used within banks for credit risk assessment [1], aid decisions within universities for new student admissions [2] and aid bail decision-making within courts [3]. Algorithmic decisions in these settings can have fargoing impacts, potentially increasing disparities within society. Numerous notorious examples exist of algorithms causing harm in this regard. In 2015, Google Photos new image recognition model classified some black individuals as gorillas [4]. This led to the removal of the category within Google Photos. A report by Amnesty International concluded that the Dutch Tax & Customs administration used a model for fraud prediction that discriminated against people with multiple nationalities [5].

The application of ML should clearly be done responsibly, giving rise to a field that considers the fairness of algorithmic decisions. Fair ML is a field within AI concerned with assessing and developing fair ML models. Fairness in this sense closely relates to equality between groups and individuals. The main notion within the field is that models should not be biased, that is, have tendencies to over/underperform for certain (groups of) individuals. This notion of bias is different from the canonical definition of bias in statistics, i.e. the difference between an estimator’s expected value and the true value. Essentially, similar individuals should be treated similarly, and decisions should not lead to unjust discrimination. Non-discrimination laws for AI exist within the EU [6, 7].

An additional property that responsible ML models should have, is that they are interpretable. Models of which the decision can be explained, are preferred as they aid decision-making processes affecting real people. In a loan application setting, users have the right to know how a decision came about [8]. The field of Explainable Artificial Intelligence (XAI), is concerned with building models that are interpretable and explainable. As legislation is often worked out as a set of rules, we expect DTs to be a significant portion of the critical algorithms used in governance.

Inherently, ML models use data. Thus, there is also a tension between the use of these models and privacy, especially for socially sensitive tasks. Individuals have several rights when it comes to data storage, such as the right to be removed from a database [6]. It is also beneficial for entities to guarantee privacy so that more individuals trust the entity with their data. Some data storage practices are discouraged such as the collection of several protected attributes [6]. These attributes, and thus the storage practices thereof, are sensitive. Examples include the religion, marital status, and gender of individuals. In industrial settings, numerous data leaks have occurred. Social media platforms are especially notorious for privacy violations, with Facebook even incurring data breaches on multiple occasions [9, 10]. The report by Amnesty International also concluded that the Dutch Tax & Customs Administration in the Dutch childcare benefits scandal failed to safely handle the sensitive private data of thousands of individuals, while they used a biased model [5]. This work will investigate these three pillars of Responsible AI, investigating a novel method that is at the intersection of these three themes.

To assess and improve fairness precisely, one needs the sensitive attributes of the individuals that a ML model was trained on. But these are often absent or have limited availability, due to privacy considerations. Exactly here lies the focal point of this work: the assessment of the fairness of ML models, while respecting the privacy of the individuals in the dataset. These antagonistic goals make for a novel, highly constrained, and hence difficult problem. A focus is placed on DTs, a class of interpretable models from XAI since these types of models are likely to be used in critical tasks involving humans due to the GDPR (in particular Art. 22) [6] and its national implementations. There are thus four goals we try to optimize in this work: fairness, privacy, interpretability, and predictive performance.

1.1 Research questions

The main goal of this work is to develop a method that can estimate the fairness of an interpretable model with a high accuracy while respecting privacy. A method, named Privacy-Aware Fairness Estimation of Rules (PAFER), is proposed that can estimate the fairness of a class of interpretable models, DTs, while respecting privacy. The method is thus at the intersection of these three responsible AI pillars. The research questions (RQs), along with their research subquestions, (RSQs) are:

RQ1:

What is the optimal privacy mechanism that preserves privacy and minimizes average Statistical Parity error?

RSQ1.1:

Is there a statistically significant mean difference in Absolute Statistical Parity error between the Laplacian mechanism and the Exponential mechanism?

RQ2:

Is there a statistically significant difference between the Statistical Parity errors of PAFER compared to other benchmarks for varying Decision Tree hyperparameter values?

RSQ2.1:

At what fractional minleaf value is PAFER significantly better at estimating Statistical Parity than a random baseline?

RSQ2.2:

At what fractional minleaf value is a model-level querying approach significantly better at estimating Statistical Parity than PAFER?

1.2 Outline

The remainder of the paper is organized as follows. The upcoming Section 2 will provide the theoretical background, which is followed by Section 3 that covers the related literature. Section 4 describes the novel method that is proposed in this work. Subsequently, Section 5 describes the performed experiments, their results, and thorough analysis. Finally, Section 6 concludes with limitations and future directions.

2 Preliminaries

This section discusses work related to the research objectives and provides background to the performed research. Fairness theory is described in Sections 2.12.2 provides background on interpretable models and Section 2.3 explains notions of privacy.

2.1 Fairness definitions

Fairness in an algorithmic setting relates to the way an algorithm handles different (groups of) individuals. Unjust discrimination is often the subject when examining the behavior of algorithms with respect to groups of individuals. For this work, only fairness definitions relating to supervised ML were studied, as this is the largest research area within algorithmic fairness.

In 2016, the number of papers related to fairness surged. Partly, due to the new regulations such as the European GDPR [6] and partly due to a popular article by ProPublica which examined racial disparities in recidivism prediction software [11]. Because of the young age of the field and the sudden rise in activity, numerous definitions of fairness have been proposed since. Most of the definitions also simultaneously hold multiple names; this section aims to include as many of the names for each definition.

The performance-oriented nature of the ML research field accelerated the development of fairness metrics, quantifying the fairness for a particular model. The majority of the definitions can therefore also be seen, or rewritten, as a measuring stick for the fairness of a supervised ML model. This measurement may be on a scale, which is the case for most group fairness definitions, or binary, which is the case for some causal fairness definitions.

The fairness definitions, namely the mathematical measures of fairness, can be categorized into group fairness, individual fairness and causal fairness. Considering the space limitations and the relevance to our work, in this section, we will focus on group fairness and provide the definitions of the most prominent measures used in the literature. Group fairness is the most popular type of fairness definition as it relates most closely to unjust discrimination. Individuals are grouped based on a sensitive, or protected attribute, \(A\), which partitions the population. Some attributes are protected by law, for example, gender, ethnicity and political opinions. This partition is often binary, for instance when \(A\) denotes a privileged and unprivileged group. In this subsection, we assume a binary partition for ease of notation but all mentioned definitions can be applied to \(\mathcal {K}\)-order partitions. An example of such an intersectional fairness definition is provided in Section 2.1.4.

The setting for these definitions is often the binary classification setting where \(Y \in \{0, 1\}\), with \(Y\) as the outcome. This is partly due to ease of notation, but more importantly, the binary classification setting is common in impactful prediction tasks. Examples of impactful prediction tasks are granting or not granting a loan [1], accepting or not accepting students to a university [2] and predicting recidivism after a certain period [3]. In each setting, a clear favorable (1) and unfavorable (0) outcome can be identified. Thus, we assume the binary classification setting in the following definitions.

2.1.1 Statistical parity

Statistical Parity (SP) is a decision-based definition, which compares the different positive prediction rates for each group [12]. SP, also known as demographic parity, equal acceptance rate, total variation or the independence criterion, is by far the most popular fairness definition. The mathematical definition is:

$$\begin{aligned} {\text {SP}} = p(\hat{Y}=1|A=1) - p(\hat{Y}=1|A=0), \end{aligned}$$
(1)

where \(\hat{Y}\) is the decision of the classifier. An example of SP would be the comparison of the acceptance rates of males and females to a university.

Note that (1) is the SP-difference but the SP-ratio also exists. US law adopts this definition of SP as the 80%-rule [13]. The 80%-rule states that the ratio of the acceptance rates must not be smaller than 0.8, i.e. 80%. Formally:

$$\begin{aligned} {\text {80\%-rule}} \; = 0.8 \le \frac{p(\hat{Y}=1|A=1)}{p(\hat{Y}=1|A=0)} \le 1.25, \end{aligned}$$
(2)

where the fraction is the SP-ratio. SP is easy to compute and merely uses the model’s predictions. SP therefore does not require labelled data. These advantages make it one of the most used fairness definitions.

2.1.2 Equalized odds

Another, also very common, fairness definition is the Equalized Odds (EOdd) metric [14]. It is also known as disparate mistreatment or the separation criterion. EOdd requires that the probabilities of being correctly positively classified and the probabilities of being incorrectly positively classified are equal across groups. Thus, the definition is twofold; both false positive classification probability and true positive classification probability should be equal across groups. Formally:

$$\begin{aligned} {\text {EOdd}} = p(\hat{Y}=1|Y=y,A=1) - p(\hat{Y}=1|Y=y,A=0), \; \; y \in \{0, 1\}. \end{aligned}$$
(3)

An example of applying EOdd would be to require that both white people and people of color have equal probability to be predicted to not recidivate, under both ground truth conditions separately. An advantage of EOdd is that, unlike SP, when the predictor is perfect, i.e. \(Y = \hat{Y}\), it satisfies EOdd.

2.1.3 Equality of opportunity

A relaxation of EOdd is the fairness definition Equality of Opportunity (EOpp) [14]. It just requires the equality of the probabilities of correctly predicting the positive class across groups. In other words, where EOdd requires that both true positive and false positive classification rates are equal across groups, EOpp only requires the former. Formally:

$$\begin{aligned} {\text {EOpp} = p(\hat{Y}=1|Y=1,A=1) - p(\hat{Y}=1|Y=1,A=0)}. \end{aligned}$$
(4)

An example of applying EOpp would be to just require that white people and people of color have equal probability to be predicted to not recidivate given that they did not actually end up recidivating. An advantage of EOpp is that it is not a bi-objective, and thus is more easily optimized for compared to EOdd.

2.1.4 Intersectional fairness

An extension of these previous definitions is to write them such that they can generalize to \(\mathcal {K}\)-order partitions. For SP this amounts to:

$$\begin{aligned} \text {SP} = \min \left( \frac{p(\hat{Y}=1|A=a)}{p(\hat{Y}=1|A=b)}\right) , \, \, a,b \in \{0, 1, 2, \dots , k-1\}, a \ne b. \end{aligned}$$
(5)

An example of applying this formulation of SP is a comparison between university acceptance rates for white men, white women, black men and black women. Note that this formulation also ensures that the SP value is in [0, 1], as the fraction is arranged such that the smallest ‘acceptance rate’ is in the numerator and the largest is in the denominator. A similar generalisation can be done for other group fairness metrics but we omit those here for the sake of brevity.

2.2 Interpretable models

This subsection outlines a class of models with inherently high interpretability, DTs, that are central to this work. The interpretability of a model is the degree to which the classifications and the decision-making mechanism can be interpreted. The field of XAI is concerned with building systems that can be interpreted and explained. Complex systems might need an explanation function that generates explanations for the outputs of the system. Some methods may inherently be highly interpretable, requiring no explanation method, such as DTs. Interpretability may be desired to ensure safety, gain insight, enable auditing or manage expectations.

2.2.1 Decision trees (DTs)

A DT is a type of rule-based system that can be used for classification problems. The structure of the tree is learned from a labelled dataset. DTs consist of nodes, namely branching nodes and leaf nodes. The upper branching node is the root node. To classify an instance, one starts at the root node and follows the rules which apply to the instance from branching node to branching node until no more rules can be applied. Then, one reaches a decision node, also called a leaf node. Every node holds the instances that could reach that node. Thus, the root node holds every instance. Decision nodes classify instances based on the class that represents the most individuals within that node.

There are various effective ways to determine the structure of a DT, given a labelled dataset. The most common way is to have a function that indicates what should be the splitting criterion in each branching node. These heuristic functions look at splitting criteria to partition the data in the node such that each partition is as homogeneous as possible w.r.t. class. An example of such a heuristic is entropy, intuitively defined as the degree to which the class distribution is random in a partition. A greedy process then constructs the tree, picking the best split in each individual node. Popular methods include CART [15] and C4.5 [16].

Some scholars have taken the greedy construction approach further and introduced lookahead to find different local optima. There is currently no consensus whether including lookahead results in significantly better performing DTs compared to greedy construction algorithms [17, 18]. A lookahead method by Nunes et al. outperformed greedy approaches on datasets with more than 1000 instances but their final lookahead step did not result in better performance [19].

Optimal DTs are a newer set of approaches that utilize methods from dynamic programming and mixed-integer optimisation [20, 21]. Their performance is generally better as they approach the true DT more closely than greedily constructed DTs. However, their construction is computationally heavy.

DT construction algorithms based on evolutionary principles are another line of work that are computationally heavy. These methods seek to avoid sub-optimality by defining a fitness function (often based on the height of the DT and its accuracy), and mutating various aspects of the DT such as the features, splitting criterion and regularisation methods. Popular methods in this area are HEAD-DT [22] and the method by Karabidja et al. [23].

Other methods exist such as gradient-based optimisation and Bayesian approaches. However, interpretability has to be largely sacrificed for these approaches to work. Silva et al. aim to overcome this deficit for gradient-based approaches by converting the uninterpretable DT back to a univariate DT by selecting the features with the largest weights [24]. We also appreciate the work of Nuti & Rugama for Bayesian trees, that combines the Bayesian DT approach with a greedy construction algorithm, to ensure interpretability [25].

Decision Tree Interpretability The interpretability of a DT is determined by several factors. The main factor is its height, the number of times the DT partitions the data. Very short Decision Trees are sometimes called Decision Stumps [26]. The minleaf DT hyperparameter also influences the interpretability of a DT. The minleaf value constrains how many instances should minimally hold in a leaf node. The smaller the value, the more splits are required to reach the set minleaf value. In this work, the minleaf value is expressed as a fraction of the total number of instances. We consider a small minleaf value to be 0.05 and large 0.15 or higher. Optimal DTs cannot have a tall height due to their high computational cost. Greedy DTs can be terminated early in the construction process to maintain interpretability. Closely related to height is the number of decision nodes in the tree. This also influences the interpretability of DTs, as the more decision nodes a DT has, the more complex the DT is. Finally, DTs built with numeric features might become uninterpretable because they use the same numeric feature over and over, leading to unintuitive decision boundaries.

In general, DTs are interpretable because they offer visualisations and use rules, which are easy to understand for humans [27]. Major disadvantages of DTs include their incapability of modeling linear relationships simultaneously accurately and interpretably and their sensitivity to changes in the data. Still, their performance, especially ensembles of DTs, are state-of-the-art for prediction tasks on tabular data [28].

2.3 Privacy definitions

The final main pillar of responsible AI that this work discusses is privacy. Privacy, in general, is a term that can be used in multiple contexts. In its literal sense, privacy relates to one’s ability to make personal and intimate decisions without interference. In this work, however, privacy refers to the degree of control one has over others accessing personal information about themselves. This is also known as informational privacy [29]. The less personal data others access about an individual, the more privacy the individual has. This subsection discusses several techniques to increase data privacy.

2.3.1 Differential privacy (DP)

Differential Privacy (DP) [30] is a notion that gives mathematical guarantees on the membership of individuals in a dataset. In principle, it is a promise to any individual in a dataset, namely: ‘You will not be affected, adversely or otherwise, by allowing your data to be used in any analysis of the data, no matter what other analyses, datasets, or information sources are available’ [31]. More specifically, an adversary cannot infer if an individual is in the dataset. DP can be applied when sharing data, or an analysis of the data. ML models are ways of analysing data and therefore can also promise to adhere to DP. Another guarantee that DP makes is that it is immune to post-processing, i.e., DP cannot be undone [31].

Definition The promise of DP can be mathematically guaranteed up to a parameter \(\varepsilon \) that quantifies the maximum amount of information that can be disclosed from the dataset. A lower \(\varepsilon \) guarantees more privacy. This parameter, \(\varepsilon \), is called the privacy budget. The privacy budget cannot be negative. A small privacy budget is 0.1 or less and a large budget is 1 or more. The main means of guaranteeing the promise of DP is by perturbing the data, i.e., adding noise to the data, via a randomized mechanism, \(\mathcal {A}\). By ‘mechanism’, we mean any analysis (e.g., aggregate statistics) that can be performed on data. DP deals with randomized mechanisms, which are functions whose output change stochastically for a given dataset. Because DP is based on membership inference, the formal definition compares two neighboring datasets, \(D\) and \(D'\), where \(D\) contains one more instance than \(D'\). For these datasets, \((\varepsilon , \delta )\)-DP formally is:

$$\begin{aligned} p(\mathcal {A}(D) \subseteq range(\mathcal {A})) \le \exp (\varepsilon ) \cdot p(\mathcal {A}(D') \subseteq range(\mathcal {A})) + \delta , \end{aligned}$$
(6)

where \(\mathcal {A}\) is a randomized mechanism with the domain of all possible \(D\), \(range(\mathcal {A})\) is the range of all outcomes the mechanism can have and \(\delta \) \(\in [0, 1]\) is a parameter that allows for a controlled probability that \(\mathcal {A}\) does not satisfy \(\varepsilon \)-DP. Thus, if \(\delta = 0\), \(\varepsilon \)-DP is satisfied. We note that \(\delta \) is not used as a parameter in all mechanisms. Intuitively, (6) states that whether an individual is in the dataset should affect the ratio of the randomized outcome probabilities at most by \(\exp (\varepsilon )\).

Global Sensitivity What type of noise \(\mathcal {A}\) adds depends on the query, \(q(D)\). The query is often a data analysis question, e.g., how many women are in \(D\)? In this paper we will sometimes abuse notation and write \(\mathcal {A}(D, q(D), \varepsilon )\) as \(\mathcal {A}(q(D))\) when the other parameters are apparent from the context. How much noise, depends on the difference the inclusion of one worst-case individual in the dataset makes for the query answer. This is known as the sensitivity, \(\Delta q\), how sensitive a query answer is to a change in the data [30]. Formally:

$$\begin{aligned} \Delta q = \max _{D, D'} \; ||q(D) - q(D') ||_1 , \end{aligned}$$
(7)

where \(D\) and \(D'\) are defined as in (6). This definition of sensitivity is known as the \(\ell _1\)-sensitivity or the global sensitivity. The following paragraphs describe DP-mechanisms including examples.

Randomized Response For answers with a binary response, Randomized Response may be used [32]. This procedure is \(\ln (3)\)-differentially private [31]. The procedure is as follows:

  1. 1.

    Flip a coin.

  2. 2.

    If it is heads, respond truthfully.

  3. 3.

    Else, flip another coin.

  4. 4.

    If it is heads, respond 0, else 1.

The responses 0 and 1 are placeholders for actual answers and should be mapped to the query appropriately. The procedure originates in social sciences where respondents might be not so inclined to answer truthfully with regard to criminal activities. This procedure ensures that the respondents cannot be charged for their answers.

Laplace Mechanism Several techniques exist to randomize query answers, of which the most common one is the Laplacian mechanism [30], for queries requesting real numbers. An example of such a query might be: ‘What is the average age of females in the dataset?’. The mechanism involves adding noise to a query answer, sampled from the Laplace distribution, centered at 0 and with a scale equal to \(\frac{\Delta q}{\varepsilon }\). The Laplace mechanism can be formalised as:

$$\begin{aligned} \mathcal {A}(D, q(D), \varepsilon ) = q(D) + Lap(\frac{\Delta q}{\varepsilon }), \end{aligned}$$
(8)

where \(Lap(\frac{\Delta q}{\varepsilon })\) is the added Laplacian noise. The Laplacian mechanism is particularly useful for histogram queries, in which the population in the database is disjoint. An example of such a histogram query might be: ‘How many women are born each year?’.

Exponential Mechanism A different noise schema is the Exponential mechanism [33], used for categorical, utility-related queries. An example of such a query might be: ‘What is the most convenient date to schedule this event?’ For these sorts of queries, a small amount of noise may completely destroy the utility of the query answer. A utility function, \(u_D(r)\), is defined over the categories, \(r \in \mathcal {R}\), for a certain dataset D. The exponential mechanism is sensitive with respect to the utility function, \(\Delta u\), not with respect to changes in \(r\). The exponential mechanism can be formally defined as:

$$\begin{aligned} p(\mathcal {A}(D, u, \mathcal {R}, \varepsilon ) = r) \propto \exp (\frac{\varepsilon u_D(r)}{2 \Delta u}). \end{aligned}$$
(9)

In other words, the probability of the best category being chosen is proportional to \(e^{\frac{\varepsilon u_D(r)}{2 \Delta u}}\).

Gaussian Mechanism The Gaussian mechanism adds noise based on the Gaussian distribution, with \(\mathcal {N}(0, \sigma )\). The mechanism is similar to the Laplacian mechanism in this sense. DP holds if \(\sigma \ge \sqrt{2 \ln (\frac{1.25}{\delta })}\frac{\Delta _2}{\varepsilon }\) [31]. The term \(\Delta _2\) is the global \(\ell _2\)-sensitivity; instead of using the \(\ell _1\)-norm in (7), \(\Delta _2\) uses the \(\ell _2\)-norm. The Gaussian mechanism can be deemed a more ‘natural’ type of noise, as it adds noise that is often assumed to be present in measurements. A disadvantage is that both \(\delta \) and \(\varepsilon \) must be in (0, 1), so \(\varepsilon \)-DP can never be met. A query that the Gaussian mechanism might be used for is: ‘What is the average transaction amount for second-hand cars?’. This value is likely to be normally distributed, and therefore fits the Gaussian mechanism.

3 Related work

This section discusses work related to the research objectives. Whereas the previous section discussed background related to only one pillar of Responsible AI, this section will highlight methods at the intersection of these fields. It concludes by relating the proposed method, PAFER, to the current landscape of methods.

3.1 Fair decision trees and rule sets

The earliest work on fair decision rules is done by Pedreschi et al. [34], who propose formal procedures to identify (potentially) discriminatory rules under both direct and indirect discrimination settings. They highlight the fact that task-related features can be correlated with sensitive features leading to indirect discrimination [34].

Further, the earliest work regarding fair DTs was performed by Kamiran & Calders and is now known as Discrimination Aware Decision Trees (DADT). They proposed a Heuristic-Based DT that incorporates the homogeneity of the sensitive attribute into the splitting criterion [35]. DADT also performs some post-processing s.t. certain decision nodes change their decision. This step is phrased as a KNAPSACK problem [36], and is also solved greedily.

In terms of optimal DTs, Linden et al. achieve excellent results with a method named DPFair [37]. Their work significantly improves the speed of the work of Jo et al., who formulate the optimal DT problem with an additional fairness objective [38], which is itself an extension of the work of Aghaei et al. [39].

Agarwal et al. [40] introduce an approach based on a zero-sum game between a hypothesis selector that finds the best-performing model and a fairness regulator that points out EOdd violations to them based on gradient descent. The equilibrium that is arrived at in the game is the best trade-off between EOdd and accuracy. The authors argue that the proposed method can be applied not only to DT but also other ML methods such as Neural Networks.

A line of research exemplified by Grari et al. [41] aims to provide fairness for tree ensembles. Grari et al. propose an in-processing approach for gradient boosted trees, where the gradient of an adversarial neural network trying to predict the sensitive attribute is also considered during tree ensemble construction. Note that tree ensembles are not intrinsically interpretable and thus further works in this direction are beyond the scope of our paper.

3.2 Privacy-aware decision trees

There are three main works on the construction of DTs with DP guarantees, the rest of the field is more concerned with creating decision forests that have better performance. This holds in general, not only in a privacy-constrained setting. This subsection discusses the three works in chronological order. The setting that this body of work assumes is that a DT developer has limited access to the data via a curator to whom they can send queries. The answers to these queries should be perturbed via a DP-mechanism.

Blum et al. first introduced DTs with DP [42]. It was more of a proof-of-concept; the authors rewrote the information gain splitting criterion to make it differentially private. Querying the necessary quantities for each node and adding Laplacian noise to the answers ensures DP. For the leaf nodes, the class counts are queried, as is the case for all other approaches mentioned. The method, however, requires a large privacy budget to function effectively, which in turn makes the answers to the queries noisy with smaller budgets. Moreover, it cannot handle continuous features but allows the height of the trees to be equal to the total number of features.

The improvement on this method came from offloading the bulk of the computation to the data curator [43]. The method that is proposed by Friedman & Schuster [43] simply queries for the quantities in each node and the best attribute to split on. The latter is used to construct the tree and the former to cleverly determine the termination of the tree construction. The improvement also stems from the fact that the method in [42] used overlapping queries that consumed the privacy budget inefficiently. This problem is not present in [43], where the queries for nodes for each height are non-overlapping. Friedman & Schuster used the exponential mechanism, which relies on the sensitivity of the utility function, in this case, the splitting criterion. It is experimentally verified that when the criterion is the error rate, the accuracy is the highest. This method can handle continuous variables in theory but the inclusion of them in the training set severely degrades the predictive performance. Moreover, the maximum height of the DT can be five. The method still improved performance significantly, however, due to the smarter queries and noise addition.

DTs with privacy guarantees are best represented by the work of Mohammed et al. [44]. The method, named Private Decision tree Algorithm (PDA), uses the Exponential mechanism and queries the required quantities for greedily building up the DT [44]. This approach comes at the cost of a more robust termination criterion that has less flexibility compared to the one in [43]. Through experimental evaluation, a very robust termination criterion is determined, which is stopping at a height of four. Using this termination procedure, the performance of the method is experimentally shown to outperform the previous method. However, this method excludes the possibility of using continuous features, which is not a large downside as it is discouraged for the approach in [43] that this method builds upon. For a deeper overview of DTs with privacy guarantees, the reader is referred to [45].

Table 1 Overview of methods that are similar to PAFER

3.3 Fair privacy-aware models

There is an upcoming field within responsible AI that is aimed at improving fairness, without accessing sensitive data. Prominent examples include Adversarially Reweighted Learning (ARL) and Fair Related Features (FairRF) [46, 47], respectively. While we highly value this line of work, it does not allow for the evaluation or estimation of fairness, as the field assumes sensitive attributes are entirely unavailable. Therefore, we consider these methods to be insufficient for our purpose, as we aim to provide guarantees on the degree of fairness a model exhibits, e.g., adherence to the 80%-rule.

The method most closely related to ours is named AttributeConceal and was introduced by Hamman et al.. They explore the idea of querying the group fairness metrics [48]. The scenario they assume is that ML developers have some dataset without sensitive attributes for which they build models, and therefore query SP and EOdd from a data curator. They establish that if the developers have bad intentions, they can identify a sensitive attribute of an individual using one unrealistic query, or two realistic ones. The main idea is that the models, for which they query fairness metrics, differ only on one individual, giving away their sensitive attribute via the answer. This result is then extended using any number of individuals. When the sizes of the groups differ greatly, i.e., \(|D_{A=0} |\ll |D_{A=1} |\), using compressed sensing [49], the number of queries is in \(O(|D_{A=0} |\log (\frac{N}{|D_{A=1} |}))\), with \(N = |D_{A=1} + D_{A=0} |\), the total number of instances. The authors propose a mitigation strategy named AttributeConceal, using smooth sensitivity. This is a sensitivity notion that is based on the worst-case individual in the dataset, as opposed to the theoretical worst case of global sensitivity. DP is ensured for any number of queries by adding noise to each query answer. It is experimentally verified that using AttributeConceal, an adversary can predict sensitive attributes merely as well as a random estimator.

As a post-processing method, Jagielski et al. [50], combine two fairness-enhancing approaches [14, 40]. They also consider the setting where only the protected attribute remains to be private. They adapt both fairness enhancing algorithms, optimizing for EOdd, to also adhere to DP. The hypothesis selector introduced in [40] is considered to adhere to DP if sensitive attributes are absent from its input. The fairness regulator, which is also inspired from [40], is made differentially private by adding Laplacian noise to the gradients of the gradient descent solver. The results of this approach are only satisfactory for large privacy budgets.

3.4 PAFER & related work

Table 1 shows methods from the domain of responsible AI that have similar goals to PAFER. In general, we see a lack of fair, privacy-preserving methods for rule-based methods, specifically DTs. Hamman et al. investigate the fairness of models in general without giving in on privacy [48], but the method lacks validity and granularity for auditing. In their setting, the developers do not gain an insight into what should be changed about their model to improve fairness. One class of models that lends itself well to this would be DTs, as these are modular and can be pruned, i.e., rules can be shortened or removed. DTs are the state-of-the-art for tabular data [28] and sensitive tasks are often prediction tasks for tabular data.Footnote 1 A method that can identify unfairness in a privacy-aware manner for DTs would be interpretable, fair and differentially private, respecting some of the most important pillars of responsible AI. PAFER aims to fill this gap, querying the individual rules in a DT and hence enabling the detection of the causes of unfair treatment. The next section will introduce the method.

4 Proposed method

In this section, we introduce PAFER, a novel method to estimate the fairness of DTs in a privacy-constrained manner. The following subsections dissect the proposed method, starting with Section 4.1, on the assumptions and specific scenarios for which the method is built. Subsequently, Section 4.2 provides a detailed description of the procedure, outlining the pseudocode and some theoretical properties.

4.1 Scenario

PAFER requires a specific scenario for its use. This subsection describes that scenario and discusses how common the scenario actually is.

Firstly, PAFER is made for an auditing setting, in the sense that it is a method that is assumed to be used at the end of a development cycle. PAFER does not mitigate bias, it merely estimates the fairness of the rules in a DT. Secondly, we assume that a developer has constructed a DT that makes binary decisions on a critical task (e.g., about people). Critical tasks are often binary classification problems. Prominent examples include university acceptance decision making [2], recidivism prediction [11] and loan application evaluations [1]. It is also common to use a rule-based method for such a problem [52], as rules can explain the process to individuals affected by the decision. We further assume the developer may have had access to a dataset containing individuals and some task-specific features, but this dataset does not contain a full specification of sensitive attributes on an instance level. The developer (or the algorithm auditor) wants to assess the fairness of their model using SP, which is a widely-used fairness metric. We lastly assume that a legal, trusted third party exists that knows these sensitive attributes on an instance level. Although it is common that such a party exists, setting up this exchange is difficult in an age where data becomes more and more valuable [53]. Since, however, fair and interpretable sensitive attribute agnostic classifiers are currently lacking, this assumption becomes necessary. Based on these assumptions, the fairness of the DT can be assessed, using the third party and PAFER.

4.2 Privacy-aware fairness estimation of rules: PAFER

We propose Privacy-Aware Fairness Estimation of Rules (PAFER), a method based on DP [31], that enables the calculation of SP for DTs while guaranteeing privacy. PAFER sends specifically designed queries to a third party to estimate SP. PAFER sends one query for each decision-making rule and one query for the overall composition of the sensitive attributes. The size of each (un)privileged group, along with the total number of accepted individuals from each (un)privileged group, allows us to calculate the SP. PAFER uses the general SP definition, as found in Section 2.1.4, to allow for intersectional fairness analyses. Let \(\mathcal {X}\) be the data used to train a DT, with \(x_{i}^{j}\) the jth feature of the ith individual. Let a rule be of the form \(x^1 < 5 \, \wedge \, x^2 = True\). The query then asks for the distribution of the sensitive attributes for all individuals that have properties \(x^1 < 5\) and \(x^2 = True\). In PAFER, each query is a histogram query as a person cannot be both privileged and unprivileged. The query to determine the general sensitive attribute composition of all individuals can be seen as a query for an ‘empty’ rule; a rule that applies to everyone.Footnote 2 It can also be seen as querying the root node of a DT.

4.2.1 PAFER and the privacy budget

A property of DTs is that only one rule applies to an instance. Therefore, PAFER queries each decision-making rule without having to share the privacy budget between these queries. Although we calculate a global statistic in SP, we query each decision-making rule. This is possible due to some noise cancelling out on aggregate, and, for DTs, because we can share the privacy budget over all decision-making rules. This intuition was also noted in [45].

Because PAFER queries every individual at least once, half of the privacy budget is spent on the query to determine the general sensitive attribute composition of all individuals, and the other half is spent on the remaining queries. Still, reducing the number of queries reduces the total amount of noise. PAFER therefore prunes non-distinguishing rules. A redundant rule can be formed when the splitting criterion of the DT improves but the split does not create a node with a different majority class.

4.2.2 DP mechanisms for PAFER

Three commonly used DP mechanisms are apt for PAFER, namely the Laplacian mechanism, the Exponential mechanism and the Gaussian mechanism. The Laplacian mechanism is used to perform a histogram query and thus has a sensitivity of 1 [31]. The Exponential mechanism uses a utility function such that \(u_D(r) = q(D) - |q(D) - r|\) where r ranges from zero to the number of individuals that the rule applies to, and q(D) is the true query answer. Here, too, the sensitivity is 1 as it is based on its database argument, and this count can differ by only 1 [31]. The Gaussian mechanism is also used to perform a histogram query and has a sensitivity of 2, as it uses the \(\Delta _2\)-sensitivity.

4.2.3 Invalid answer policies

The Laplacian mechanism and Gaussian mechanism add noise in such a way that invalid query answers may occur. A query answer is invalid if it is negative, or if it exceeds the total number of instances in the dataset.Footnote 3 A policy for handling these invalid query answers must be chosen. In practice, these are mappings from invalid values to valid values. These mappings can be applied on a scalar or simultaneously on multiple values, e.g., a vector or histogram. We provide several options in this subsection.

Table 2 The proposed policy options for each type of invalid query answer

Table 2 shows the available options for handling invalid query answers. A policy consists of a mapping chosen from the first column and a mapping chosen from the second in this table. The first column shows policies for negative query answers and the second column shows policies for query answers that exceed the number of individuals in the dataset. The ‘uniform’ policy replaces an invalid answer with the answer if the rule would apply to the same number of individuals from each un(privileged) group. The ‘total - valid’ policy requires that all other values in the histogram were valid. This allows for the calculation of the missing value by subtracting the sum of valid query answers from the total number of individuals that the query applies to.

Algorithm 1
figure e

PAFER.

4.2.4 PAFER pseudocode

Algorithm 1 shows the pseudocode for PAFER.

4.2.5 Theoretical properties of PAFER

We theoretically determine a lower and upper bound of the number of queries that PAFER requires for a k-ary DT in Theorem 1. The lower bound is equal to two, and the upper bound is \({k}^{h - 1} + 1\), dependent on the height of the DT, h. Note that PAFER removes redundant rules to reduce the number of rules. The larger the number of rules, the more noise is added on aggregate.

Theorem 1

The number of queries required by PAFER to estimate SP for a k-ary DT is lower bounded by 2 and upper bounded by \({k}^{h - 1} + 1\).

Proof

Let D and A denote the dataset and the sensitive attribute in question, respectively. Assume that we have constructed a k-ary DT for a binary classification task. Further, let the height (i.e., depth) of this DT be h. To estimate SP for the sensitive attribute A, we need the total size of each sensitive group \(a \in range(A)\), namely, \(|D_{A=a}|\), as well as the number of individuals from each sensitive group that is classified favorably by the DT, namely \(|D_{A=a, \hat{Y}=1}|\). By definition, the first statistic requires 1 histogram query. The latter statistic requires a query for each favorable decision rule in the tree. A binary tree having a single favourable rule is schematically shown in Fig. 3 in Appendix A. Only 1 histogram query is required for this tree, thus the lower bound for the number of required queries for PAFER is \(1 + 1 = 2\). A perfectly balanced tree is illustrated in Fig. 4 in Appendix A. In this case, the number of favourable decision rules in the tree is \({k}^{h-1}\). Because, each split that creates two leaf nodes adds both a favourable and an unfavourable classification rule to the DT, where we are interested only in the rules having a favourable outcome. In a perfectly balanced k-ary tree trained for a binary classification task, the number of favourable-outcome rules is equal to the number of nodes at depth \(h - 1\). This amounts to \({k}^{h-1}\) histogram queries. The upper bound for the number of required queries for PAFER is thus \({k}^{h-1} + 1\)\(\square \)

5 Evaluation

This section evaluates the proposed method in the previous section, PAFER. Firstly, Section 5.1 describes the experimental setup, detailing the used datasets and the two experiments. Secondly, Section 5.2 displays and discusses the results of the experiments.

5.1 Experimental setup

This section describes the experiments that answer the research questions. The first subsection describes these datasets and details their properties. The subsections thereafter describe the experiments in order, corresponding to the research question they aim to answer.

5.1.1 Datasets

The datasets form the test bed on which the experiments can be performed. We chose four publicly available datasets, namely, Adult [54], COMPAS [11], German [55] and Taiwan [56]. They are all well known in the domain of fairness for ML, and can be considered benchmark datasets. Importantly, they vary in size and all model a binary classification problem, enabling the calculation of various fairness metrics. The datasets are publicly available and pseudonymized; every privacy concern is thus merely for the sake of argument. Table 3 shows some other important characteristics of each dataset.

Table 3 Properties of the three chosen publicly available datasets

Pre-processing This paragraph describes each pre-processing step for every chosen dataset. Some pre-processing steps were taken for all datasets. In every dataset, the sensitive attributes were separated from the training set. Every sensitive attribute except age was binarized, distinguishing between privileged and unprivileged groups. The privileged individuals were ‘white men’ who lived in their original country of birth, and the unprivileged individuals were those who were not male, not white or lived abroad. We now detail the pre-processing steps that are dataset-specific.

Adult. The Adult dataset comes with a predetermined train and test set. The same pre-processing steps were performed on each one. Rows that contained missing values were removed. The “fnlwgt” column, which stands for “final weight” was removed as it is a relic from a previously trained model and unrelated features might cause overfitting. The final number of rows was 30162 for the train set and 15060 for the test set.

Taiwan. The Taiwan loan default dataset has no missing values and contains 30000 instances. The training and test sets have 20000 and 10000 instances, respectively.

COMPAS. The COMPAS article analyzes two datasets, one for general recidivism and one for violent recidivism [11]. Only the dataset for general recidivism was used. This is a dataset with a large number of features (53), but by following the feature selection steps from the article,Footnote 4 this number is reduced to eleven, of which three are sensitive attributes. The other pre-processing step in the article is to remove cases in which the arrest date and COMPAS screening date are more than thirty days apart. The features that contain dates are then converted to just the year, rounded down. Missing values are imputed with the median value for that feature. Replacing missing values with the median value ensures that no out-of-the-ordinary values are added to the dataset. The final number of instances was 4115 for the train set and 2057 for the test set, totalling 6172.

German. The German dataset has no missing values. The gender attribute is encoded in the marital status attribute, which required separation. The final number of rows is 667 for the train set and 333 for the test set, totalling 1000 rows.

5.1.2 Experiment 1: comparison of DP mechanisms for PAFER

Experiment 1 was constructed such that it answered RQ1; what DP mechanism is optimal for what privacy budget? The best performing shallow DT was constructed for each dataset, using the CART algorithm [15] with grid search and cross-validation, optimizing for balanced accuracy. The height of the DT, the number of leaf nodes and the number of selected features were varied. The parameter space can be described as {2, 3, 4} \(\times \) {3, 4, 5, 6, 7, 8, 9, 10, 11, 12} \(\times \) {sqrt, all, \(\log _2\)}, constituting tuples of (height, # leaf nodes, # selected features). The out-of-sample SP of each DT is provided in Table 4. The experiment was repeated fifty times with this same DT, such that the random noise, introduced by the DP mechanisms, could be averaged. Initially, we considered the Laplacian, Exponential and Gaussian mechanisms for the comparison. However, after exploratory testing, we deemed the Gaussian mechanism to perform too poorly to be included. Table 5 shows some of these preliminary results. The performance of each mechanism was measured using the Average Absolute Statistical Parity Error (AASPE), defined as follows:

$$\begin{aligned} \textrm{AASPE} = \sum _{i}^{\mathrm {\# \, runs}}\frac{1}{\mathrm {\# \, runs}} | SP_i - \widehat{SP_i} |, \end{aligned}$$
(10)

where # runs is the number of times the experiment was repeated, \(SP_i\) and \(\widehat{SP_i}\) are the true and estimated SP of the ith run, respectively. The metric was calculated out of sample, i.e., on the test set. The differences in performance were compared using an independent t-test. The privacy budget was varied such that forty equally spaced values were tested with \(\varepsilon \in (0, \frac{1}{2}]\). Initial results showed that \(\varepsilon > \frac{1}{2}\) offered very marginal improvements. Table 5 shows a summary of the preliminary results for Experiment 1. Experiment 1 was performed for both ethnicity, sex and the two combined. The former two sensitive features were encoded as a binary feature, distinguishing between a privileged (white, male) and an unprivileged (non-white, non-male) group. The latter sensitive feature was encoded as a quaternary feature, distinguishing between a privileged (white-male) and an unprivileged (non-white or non-male) group. Whenever a query answer is invalid, as described in Section 4.2.3, a policy must be chosen for calculation of the SP metric. In Experiment 1, the uniform answer approach was chosen, i.e., the size of the group was made to be proportional to the number of sensitive features and the total size. The proportion of invalid query answers, i.e., \(\frac{\mathrm {\# \; invalid \; answers}}{\mathrm {\# \; total \; answers}}\), was also tracked during this experiment. This invalid value ratio provides some indication of how much noise is added to the query answers.

Table 4 The out-of-sample SP of each constructed DT in Experiment 1
Table 5 Table captionPreliminary results for Experiment 1 with larger privacy budgets

5.1.3 Experiment 2: comparison of different DTs for PAFER

Experiment 2 was constructed in such a way that it answered RQ2; what is the effect of DT hyperparameters on the performance of PAFER? The minleaf value was varied such that eighty equally spaced values were tested with \(\texttt {minleaf} \in (0, \frac{1}{5}]\). In the initial results, shown in Table 6, when minleaf \( {> } \frac{1}{5}\), the same split was repeatedly chosen for each dataset. Even with minleaf \(< \frac{1}{2}\), a risk still occurs that one numerical feature is split over and over, which hinders interpretability. Therefore, each numerical feature is categorized by generating five equal-width bins. The privacy budget was defined such that \(\varepsilon \in \{\frac{1}{20}, \frac{2}{20}, \frac{3}{20}, \frac{4}{20}, \frac{5}{20}\}\). The performance was again measured in terms of AASPE. The metric was measured out of sample, i.e., on the test set. The performance for each minleaf value was averaged over fifty potentially different DTs. The same invalid query answer policy was chosen as in Experiment 1, replacing each invalid query answer with the uniformly distributed answer. The performance of PAFER was compared with a baseline that uniformly randomly guesses an SP value in the interval [0, 1). A one-sided t-test determined whether PAFER significantly outperformed the random baseline.

Table 6 Preliminary results for Experiment 2

Experiment 2.1: Interaction between \(\varepsilon \) and minleaf hyperparameters The SP metric is also popular due to its legal use in the United States, where it is used to determine compliance with the 80%-rule [13]. Thus, the UAR (Unweighted Average Recall) of PAFER was calculated for each minleaf value, to obtain an indication of whether PAFER was able to effectively measure this compliance. UAR is the average of class-wise recall scores and has a chance-level baseline score of \(\frac{1}{K}\), where K is the number of classes. It is popularly used as a performance measure in classification tasks involving class imbalance and in machine learning competitions [57, 58]. The discretisation was done by rounding each estimation down to one decimal digit, thus creating ‘classes’ that the UAR could be calculated for. To gain more intuition about the interaction between \(\varepsilon \) and minleaf value, the following metric was calculated for each combination:

$$\begin{aligned} \textrm{UAR} - \textrm{AASPE} = \sum _{c \in C} \frac{1}{|C|}\times \frac{\# \, \textrm{true} \, c}{\# c} - \sum _{i}^{\mathrm {\# \, runs}}\frac{1}{\mathrm {\# \, runs}} | SP_i - \widehat{SP_i} |{,} \end{aligned}$$
(11)

where \(C\) is the set of of classes. Ideally, AASPE is minimized and UAR is maximized, thus maximizing the metric shown in (11). Besides the metric, the experimental setup was identical to Experiment 2. Therefore, the same DTs were used for this experiment, only the metrics differed.

5.2 Results

This section describes the results of the experiments and also provides an analysis of the results. Results are ordered to match the order of the experiments.

5.2.1 Results for experiment 1

Fig. 1
figure 1

A comparison of the Laplacian and Exponential DP mechanism for different privacy budgets \(\varepsilon \) on the Adult, COMPAS and German datasets. When indicated, from the critical \(\varepsilon \) value to \(\varepsilon = \frac{1}{2}\), the Laplacian mechanism performs significantly better (\(p < .05\)) than the Exponential mechanism. The uncertainty is pictured in a lighter colour around the average

Fig. 2
figure 2

A comparison of the Laplacian and Exponential DP mechanism for different privacy budgets \(\varepsilon \) on the Taiwan dataset. When indicated, from the critical \(\varepsilon \) value to \(\varepsilon = \frac{1}{2}\), the Laplacian mechanism performs significantly better (\(p < .05\)) than the Exponential mechanism. The uncertainty is pictured in a lighter colour around the average

Figures 1 and 2 answer RQ1; the Laplacian mechanism outperforms the Exponential mechanism on eight out of the ten analyses. The Laplacian mechanism is significantly better even at very low privacy budgets (\(\varepsilon < 0.1\)). The Gaussian mechanism proved also to be of no match for the Laplacian mechanism, even at large privacy budgets. The error of the Laplacian mechanism generally decreases steadily, as the privacy budget increases. This is expected behavior. As the privacy budget increases, the amount of noise decreases. The Laplacian mechanism performs the best on the Adult, COMPAS and Taiwan datasets, because their invalid value ratio is small, especially for \(\varepsilon > \frac{1}{10}\).

The Exponential mechanism performs relatively stable across analyses, however, its performance is generally bad, with errors even reaching the maximum possible error for the German dataset. This is probably due to the design of the utility function, which does not differentiate enough between good and bad answers. Moreover, the Exponential mechanism consistently adds even more noise because it guarantees valid query answers. The Laplacian mechanism does not give these guarantees, and thus relies less on the chosen policy, as described in Section 4.2.3. The Laplacian mechanism performs somewhat decently on the intersectional analysis for the Adult dataset. This is due to it being an easy prediction task, the Laplacian mechanism starts at a similarly low error.

Plots in Figs. 1 and 2 show that the invalid value ratio consistently decreases with the privacy budget. This behavior is expected, given that the amount of noise decreases as the privacy budget increases. The invalid value ratio is the largest in the intersectional analyses because then the sensitive attributes are quaternary. The difference between the invalid value ratio progression for the Adult, Taiwan and COMPAS datasets is small, whereas the difference between COMPAS and German is large. Thus, smaller datasets only become problematic for PAFER between 6000 and 1000 rows. Experiment 2 sheds further light on this question.

For the two cases where the Exponential mechanism is competitive with the Laplacian mechanism, the invalid value ratio is also large. When the dataset is small, the sensitivity is relatively larger, and the chances of invalid query answers are larger. Note that the error is measured out-of-sample, so, for the German dataset, the histogram queries are performed on a dataset of size 333. This effect is also visible in the next experiment.

5.2.2 Results for experiment 2

Table 7 Summary results for Experiment 2, comparing AASPE of PAFER and random baseline with minleaf set to 0.01 under the all tested dataset and sensitive attribute combinations (setting)
Table 8 The AASPE scores along with their uncertainties for all tested settings

Tables 7 and 8 show p-values and corresponding summary results for Experiment 2.1, respectively, with minleaf value set to 0.01. Table 7 clearly shows that PAFER generally significantly outperforms the random baseline for this minleaf value. When the minleaf value is further reduced to 0.001 (not reported in the tables), PAFER does not outperform the random baseline in most settings on the COMPAS dataset. This is due to the ‘small’ leaf nodes, but also due to the small dataset size (\(N = 6000\)). This reduces the queried quantities even further, resulting in worse performance for PAFER. Then, the (un)privileged group sizes are closer to zero per rule, which increases the probability of invalid query answers. PAFER thus performs more poorly with a small privacy budget, but also on less interpretable DTs. When the minleaf value of a DT is small, it generally has more branches and branches are longer, as the CART algorithm is stopped later. Both of these factors worsen the interpretability of a DT [59].

As can be seen in Table 8, reporting mean and standard deviation of AASPE scores over 50 runs for each setting, external factors negatively impacting the performance of PAFER are a small dataset size and the number of (un)privileged groups. Therefore, the results for the German dataset are omitted, as PAFER does not outperform the random baseline. PAFER’s worse performance on smaller datasets, and less interpretable DTs is a clear limitation of the method.

For the sake of succinctness, the results and respective plots for Experiment 2.1 are given in Appendix B. This experiment also replicates some of the results of Experiment 1 and Experiment 2. The middle plot in Fig. 5 through Fig. 11 shows that PAFER with the Laplacian mechanism performs better for larger privacy budgets. These plots also show the previously mentioned trade-off between interpretability and performance of PAFER; the method performs worse for smaller minleaf values. Lastly, the performance is generally lower for the COMPAS dataset, which holds fewer instances. To sum up the experiments conducted to respond to RSQ2.1, for nearly all trials, there was a significant difference in error between PAFER and the random baseline.

Inspired by the model-agnostic baseline approach in [48], we compare PAFER’s performance to a holistic SP calculation by combining all favourable rules to make a single query using the Laplacian Mechanism in the DP framework. Note that since this query is at model level, it can be formulated as a model-agnostic query without knowing or having access to the model internals. Our implementation for the DT model’s SP query via combining the favourable rules is merely for the sake of computational efficiency. Table 9 reports the AASPE score ratio of this coarse, model-level approach to the AASPE of rule-based PAFER. It is expected that a model-level approach outperforms PAFER, due to differences in number of queries. Moreover, due to the properties of DP and the fact that rules partition the instances, relatively higher noise is expected per rule-based query. However, the results show that in many settings our fine-grained PAFER method not only approaches but also outperforms the coarse approach. This is especially true for shorter DTs, i.e. those with a larger minleaf value. We must note that none of these higher performances were significant (\(p < .05\)), as measured by an independent sample t-test.

Table 9 Comparison between performing joined queries and rule-level queries for varying minleaf values
Table 10 Predictive performance of PAFER in terms of UAR and balanced accuracy

To sum up, the response to RSQ2.2 also depends on the sensitive attribute under question and the dataset. The model-level querying approach signficantly outperformed PAFER on the COMPAS dataset for the ethnicity and intersectional sensitive attributes, with a minleaf value of 0.05. For values of minleaf \(> 0.05\), neither method significantly outperformed the other. In this case the results motivate the use of PAFER as it adds fine-grained, rule-level fairness estimation while maintaining similar performance.

5.3 80%-rule analysis

Since the problem of estimating fairness is a regression task, so far, all the results were reported in terms of AASPE. To ease the performance analysis, we discretise the predictions and the actual SP score into bins of 0.1 width, as mentioned in Section 5.1.3. In Table 10, we report the classification performance of PAFER in terms of balanced accuracy and UAR. While the classification can easily be used for analysing whether the corresponding DT adheres to 80%-rule, we note that it is not a binary but a multi-class classification task. The number of classes depends on the range of the ground truth SP with a maximum of 11 classes.

5.4 Rule-level auditing analysis

As PAFER provides bias analysis at rule level to spot unfair rules, we give an example in an auditing scenario. For a DT constructed on the Adult dataset, three positive rules were identified, as shown in Table 11. PAFER correctly identified that the first rule was unfavorable for one of the groups, as it caused a difference in acceptance rate of 9.4%. The method was able to correctly detect this risk of unwanted bias (absolute SP error of 0.0075). This example shows that with a modest privacy budget of 0.2, PAFER can aid policy making and identify pitfalls in protocols.

Table 11 Positively classifying rules of a DT constructed with a minleaf value of 0.02, on the Adult dataset

6 Conclusion & future work

This section concludes the work with a summary in Section 6.1, and provides suggestions for future work in Section 6.2.

6.1 Summary

This work has shed light on the trade-offs between fairness, privacy and interpretability in the context of DTs as intrinsically interpretable models, by introducing a novel, privacy-aware fairness estimation method called PAFER. There is a natural tension between the estimation of fairness and privacy, given that sensitive attributes are required to calculate fairness. This applies also to interpretable, rule-based methods. The proposed method, PAFER, alleviates some of this tension.

PAFER should be applied on a DT in a binary classification setting, at the end of a development cycle. PAFER guarantees privacy using mechanisms from DP, allowing it to measure SP for DTs.

We showed that the minimum number of required queries for PAFER is 2. We also showed that the maximum number of queries for a k-ary DT depends on the height h of the DT via \({k}^{h-1} + 1\).

In our experimental comparison of several DP mechanisms, PAFER showed to be capable of accurately estimating SP for low privacy budgets when used with the Laplacian mechanism. This confirms that the calculation of SP for DTs while respecting privacy is possible using PAFER.

Experiment 2 showed that the smaller the leaf nodes of the DT are, the worse the performance is. PAFER thus performs better for more interpretable DTs; as the smaller the minleaf value is, the less interpretable a DT is.

Future work can look into other types of DP mechanisms to use with PAFER, and other types of fairness metrics, e.g. EOdd.

6.2 Limitations & future work

This section describes some avenues that could be further explored regarding PAFER, with an eye on the limitations that became apparent from the experimental results. We suggest an extension of PAFER that can adopt two other new fairness metrics in Section 6.2.1 and suggest examining the different parameters of the PAFER algorithm in Section 6.2.2.

6.2.1 Other fairness metrics

The most obvious research avenue for PAFER is the extension to support other fairness metrics. SP is a popular, but simple metric that is not correct in every scenario. We thus propose two other group fairness metrics that are suitable for PAFER. However, with the abundance of fairness metrics, multiple other suitable metrics are bound to exist.

The EOdd metric compares the acceptance rates across (un)privileged groups and dataset labels. In our scenario (Section 4.1), we assume to know the dataset labels, as this is required for the construction of a DT. Therefore, by querying the sensitive attribute distributions for favorably classifying rules, only for those individuals for which \(Y = y\), PAFER can calculate EOdd. Since these groups are mutually exclusive, \(\varepsilon \) does not have to be shared. Since EOpp is a variant of EOdd, this can naturally also be measured using this approach. A downside is that the number of queries is multiplied by a factor of two, which hinders performance. However, this is not much of an overhead because it is only a constant factor.

6.2.2 Other input parameters

Examining the input parameters of the PAFER estimation algorithm in Algorithm 1, two clear candidates for further research become visible. These are the DP mechanism, \(\mathcal {A}\) and the model that is audited, DT. The following two paragraphs discuss these options.

The Differential Privacy mechanism The performance of other DP mechanisms can be experimentally compared to the currently examined mechanisms, using the experimental setup of Experiment 1. Experiment 2 shows that there is still room for improvement, as a random guessing baseline is competitive with the Laplacian mechanism in certain settings.

The work of Hamman et al. in [48] shows promising results for a simple SP query. They use a DP mechanism based on smooth sensitivity [60]; a sensitivity that adds data-specific noise to guarantee DP. If this DP mechanism could be adopted for histogram queries, PAFER might improve in accuracy. Currently, PAFER improves poorly on less interpretable DTs. An improvement in accuracy might also enable PAFER to audit less interpretable DTs.

The audited model PAFER, as the name suggests, is currently only suited for rule-based systems, and in particular DTs. Further research could look into the applicability of PAFER for other rule-based systems, such as fuzzy-logic rule systems [61], rule lists [62] and association rule data mining [63]. The main point of attention is the distribution of the privacy budget. For DTs, only one rule applies to each person, so PAFER can query all rules. For other rule-based methods, this might not be the case.

It has long been established that Neural Networks can be converted to DTs [64]. Applying PAFER to extracted DTs from Neural Networks could also be a future research direction. However, the Neural Network must have a low number of parameters, or else the associated DT would be very tall. DTs with a tall height work worse with PAFER, so the applicability is limited.

6.2.3 Bayesian methods on aggregate data

All the experiments in this paper were conducted using a simulated third party having access to all sensitive data at the instance level. This is technically and legally feasible only for a small set of sensitive attributes such as ‘sex’ and ‘country of birth’, as they are registered in national census databases. However, critically sensitive data, such as ethnicity, should not be kept at the individual level. These can be kept at an aggregate level following representative national surveys. Thus, to use aggregate data (e.g., univariate and bi-variate counts) effectively with PAFER, future research can investigate the applicability of Bayesian methods and a minimal set of conditional probabilities/statistics for auditing decision rules used in governance.