Privacy constrained fairness estimation for decision trees

van der Steen, Florian; Vink, Fré; Kaya, Heysem

doi:10.1007/s10489-024-05953-6

Privacy constrained fairness estimation for decision trees

Open access
Published: 13 January 2025

Volume 55, article number 308, (2025)
Cite this article

You have full access to this open access article

Download PDF

Applied Intelligence Aims and scope Submit manuscript

Privacy constrained fairness estimation for decision trees

Download PDF

899 Accesses
1 Citation
Explore all metrics

Abstract

The protection of sensitive data becomes more vital, as data increases in value and potency. Furthermore, the pressure increases from regulators and society on model developers to make their Artificial Intelligence (AI) models non-discriminatory. To boot, there is a need for interpretable, transparent AI models for high-stakes tasks. In general, measuring the fairness of any AI model requires the sensitive attributes of the individuals in the dataset, thus raising privacy concerns. In this work, the trade-offs between fairness (in terms of Statistical Parity (SP)), privacy (quantified with a budget), and interpretability are further explored in the context of Decision Trees (DTs) as intrinsically interpretable models. We propose a novel method, dubbed Privacy-Aware Fairness Estimation of Rules (PAFER), that can estimate SP in a Differential Privacy (DP)-aware manner for DTs. Our method is the first to assess algorithmic fairness on a rule-level, providing insight into sources of discrimination for policy makers. DP, making use of a third-party legal entity that securely holds this sensitive data, guarantees privacy by adding noise to the sensitive data. We experimentally compare several DP mechanisms. We show that using the Laplacian mechanism, the method is able to estimate SP with low error while guaranteeing the privacy of the individuals in the dataset with high certainty. We further show experimentally and theoretically that the method performs better for those DTs that humans generally find easier to interpret.

Graphical abstract

Automated Discovery of Trade-Off Between Utility, Privacy and Fairness in Machine Learning Models

Evaluating differentially private decision tree model over model inversion attack

Article 31 August 2021

Navigating Differential Privacy Constraints in Machine Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The methods from the scientific field of AI, and in particular Machine Learning (ML), are increasingly applied to tasks in socially sensitive domains. Due to their predictive power, ML models are used within banks for credit risk assessment [1], aid decisions within universities for new student admissions [2] and aid bail decision-making within courts [3]. Algorithmic decisions in these settings can have fargoing impacts, potentially increasing disparities within society. Numerous notorious examples exist of algorithms causing harm in this regard. In 2015, Google Photos new image recognition model classified some black individuals as gorillas [4]. This led to the removal of the category within Google Photos. A report by Amnesty International concluded that the Dutch Tax & Customs administration used a model for fraud prediction that discriminated against people with multiple nationalities [5].

The application of ML should clearly be done responsibly, giving rise to a field that considers the fairness of algorithmic decisions. Fair ML is a field within AI concerned with assessing and developing fair ML models. Fairness in this sense closely relates to equality between groups and individuals. The main notion within the field is that models should not be biased, that is, have tendencies to over/underperform for certain (groups of) individuals. This notion of bias is different from the canonical definition of bias in statistics, i.e. the difference between an estimator’s expected value and the true value. Essentially, similar individuals should be treated similarly, and decisions should not lead to unjust discrimination. Non-discrimination laws for AI exist within the EU [6, 7].

An additional property that responsible ML models should have, is that they are interpretable. Models of which the decision can be explained, are preferred as they aid decision-making processes affecting real people. In a loan application setting, users have the right to know how a decision came about [8]. The field of Explainable Artificial Intelligence (XAI), is concerned with building models that are interpretable and explainable. As legislation is often worked out as a set of rules, we expect DTs to be a significant portion of the critical algorithms used in governance.

Inherently, ML models use data. Thus, there is also a tension between the use of these models and privacy, especially for socially sensitive tasks. Individuals have several rights when it comes to data storage, such as the right to be removed from a database [6]. It is also beneficial for entities to guarantee privacy so that more individuals trust the entity with their data. Some data storage practices are discouraged such as the collection of several protected attributes [6]. These attributes, and thus the storage practices thereof, are sensitive. Examples include the religion, marital status, and gender of individuals. In industrial settings, numerous data leaks have occurred. Social media platforms are especially notorious for privacy violations, with Facebook even incurring data breaches on multiple occasions [9, 10]. The report by Amnesty International also concluded that the Dutch Tax & Customs Administration in the Dutch childcare benefits scandal failed to safely handle the sensitive private data of thousands of individuals, while they used a biased model [5]. This work will investigate these three pillars of Responsible AI, investigating a novel method that is at the intersection of these three themes.

To assess and improve fairness precisely, one needs the sensitive attributes of the individuals that a ML model was trained on. But these are often absent or have limited availability, due to privacy considerations. Exactly here lies the focal point of this work: the assessment of the fairness of ML models, while respecting the privacy of the individuals in the dataset. These antagonistic goals make for a novel, highly constrained, and hence difficult problem. A focus is placed on DTs, a class of interpretable models from XAI since these types of models are likely to be used in critical tasks involving humans due to the GDPR (in particular Art. 22) [6] and its national implementations. There are thus four goals we try to optimize in this work: fairness, privacy, interpretability, and predictive performance.

1.1 Research questions

The main goal of this work is to develop a method that can estimate the fairness of an interpretable model with a high accuracy while respecting privacy. A method, named Privacy-Aware Fairness Estimation of Rules (PAFER), is proposed that can estimate the fairness of a class of interpretable models, DTs, while respecting privacy. The method is thus at the intersection of these three responsible AI pillars. The research questions (RQs), along with their research subquestions, (RSQs) are:

RQ1:

What is the optimal privacy mechanism that preserves privacy and minimizes average Statistical Parity error?

RSQ1.1:: Is there a statistically significant mean difference in Absolute Statistical Parity error between the Laplacian mechanism and the Exponential mechanism?

RQ2:

Is there a statistically significant difference between the Statistical Parity errors of PAFER compared to other benchmarks for varying Decision Tree hyperparameter values?

RSQ2.1:: At what fractional minleaf value is PAFER significantly better at estimating Statistical Parity than a random baseline?
RSQ2.2:: At what fractional minleaf value is a model-level querying approach significantly better at estimating Statistical Parity than PAFER?

1.2 Outline

The remainder of the paper is organized as follows. The upcoming Section 2 will provide the theoretical background, which is followed by Section 3 that covers the related literature. Section 4 describes the novel method that is proposed in this work. Subsequently, Section 5 describes the performed experiments, their results, and thorough analysis. Finally, Section 6 concludes with limitations and future directions.

2 Preliminaries

This section discusses work related to the research objectives and provides background to the performed research. Fairness theory is described in Sections 2.1, 2.2 provides background on interpretable models and Section 2.3 explains notions of privacy.

2.1 Fairness definitions

Fairness in an algorithmic setting relates to the way an algorithm handles different (groups of) individuals. Unjust discrimination is often the subject when examining the behavior of algorithms with respect to groups of individuals. For this work, only fairness definitions relating to supervised ML were studied, as this is the largest research area within algorithmic fairness.

In 2016, the number of papers related to fairness surged. Partly, due to the new regulations such as the European GDPR [6] and partly due to a popular article by ProPublica which examined racial disparities in recidivism prediction software [11]. Because of the young age of the field and the sudden rise in activity, numerous definitions of fairness have been proposed since. Most of the definitions also simultaneously hold multiple names; this section aims to include as many of the names for each definition.

The performance-oriented nature of the ML research field accelerated the development of fairness metrics, quantifying the fairness for a particular model. The majority of the definitions can therefore also be seen, or rewritten, as a measuring stick for the fairness of a supervised ML model. This measurement may be on a scale, which is the case for most group fairness definitions, or binary, which is the case for some causal fairness definitions.

The fairness definitions, namely the mathematical measures of fairness, can be categorized into group fairness, individual fairness and causal fairness. Considering the space limitations and the relevance to our work, in this section, we will focus on group fairness and provide the definitions of the most prominent measures used in the literature. Group fairness is the most popular type of fairness definition as it relates most closely to unjust discrimination. Individuals are grouped based on a sensitive, or protected attribute, $A$, which partitions the population. Some attributes are protected by law, for example, gender, ethnicity and political opinions. This partition is often binary, for instance when $A$ denotes a privileged and unprivileged group. In this subsection, we assume a binary partition for ease of notation but all mentioned definitions can be applied to $\mathcal {K}$-order partitions. An example of such an intersectional fairness definition is provided in Section 2.1.4.

The setting for these definitions is often the binary classification setting where $Y \in \{0, 1\}$, with $Y$ as the outcome. This is partly due to ease of notation, but more importantly, the binary classification setting is common in impactful prediction tasks. Examples of impactful prediction tasks are granting or not granting a loan [1], accepting or not accepting students to a university [2] and predicting recidivism after a certain period [3]. In each setting, a clear favorable (1) and unfavorable (0) outcome can be identified. Thus, we assume the binary classification setting in the following definitions.

2.1.1 Statistical parity

Statistical Parity (SP) is a decision-based definition, which compares the different positive prediction rates for each group [12]. SP, also known as demographic parity, equal acceptance rate, total variation or the independence criterion, is by far the most popular fairness definition. The mathematical definition is:

$$\begin{aligned} {\text {SP}} = p(\hat{Y}=1|A=1) - p(\hat{Y}=1|A=0), \end{aligned}$$

(1)

where $\hat{Y}$ is the decision of the classifier. An example of SP would be the comparison of the acceptance rates of males and females to a university.

Note that (1) is the SP-difference but the SP-ratio also exists. US law adopts this definition of SP as the 80%-rule [13]. The 80%-rule states that the ratio of the acceptance rates must not be smaller than 0.8, i.e. 80%. Formally:

$$\begin{aligned} {\text {80\%-rule}} \; = 0.8 \le \frac{p(\hat{Y}=1|A=1)}{p(\hat{Y}=1|A=0)} \le 1.25, \end{aligned}$$

(2)

where the fraction is the SP-ratio. SP is easy to compute and merely uses the model’s predictions. SP therefore does not require labelled data. These advantages make it one of the most used fairness definitions.

2.1.2 Equalized odds

Another, also very common, fairness definition is the Equalized Odds (EOdd) metric [14]. It is also known as disparate mistreatment or the separation criterion. EOdd requires that the probabilities of being correctly positively classified and the probabilities of being incorrectly positively classified are equal across groups. Thus, the definition is twofold; both false positive classification probability and true positive classification probability should be equal across groups. Formally:

$$\begin{aligned} {\text {EOdd}} = p(\hat{Y}=1|Y=y,A=1) - p(\hat{Y}=1|Y=y,A=0), \; \; y \in \{0, 1\}. \end{aligned}$$

(3)

An example of applying EOdd would be to require that both white people and people of color have equal probability to be predicted to not recidivate, under both ground truth conditions separately. An advantage of EOdd is that, unlike SP, when the predictor is perfect, i.e. $Y = \hat{Y}$, it satisfies EOdd.

2.1.3 Equality of opportunity

A relaxation of EOdd is the fairness definition Equality of Opportunity (EOpp) [14]. It just requires the equality of the probabilities of correctly predicting the positive class across groups. In other words, where EOdd requires that both true positive and false positive classification rates are equal across groups, EOpp only requires the former. Formally:

$$\begin{aligned} {\text {EOpp} = p(\hat{Y}=1|Y=1,A=1) - p(\hat{Y}=1|Y=1,A=0)}. \end{aligned}$$

(4)

An example of applying EOpp would be to just require that white people and people of color have equal probability to be predicted to not recidivate given that they did not actually end up recidivating. An advantage of EOpp is that it is not a bi-objective, and thus is more easily optimized for compared to EOdd.

2.1.4 Intersectional fairness

An extension of these previous definitions is to write them such that they can generalize to $\mathcal {K}$-order partitions. For SP this amounts to:

$$\begin{aligned} \text {SP} = \min \left( \frac{p(\hat{Y}=1|A=a)}{p(\hat{Y}=1|A=b)}\right) , \, \, a,b \in \{0, 1, 2, \dots , k-1\}, a \ne b. \end{aligned}$$

(5)

An example of applying this formulation of SP is a comparison between university acceptance rates for white men, white women, black men and black women. Note that this formulation also ensures that the SP value is in [0, 1], as the fraction is arranged such that the smallest ‘acceptance rate’ is in the numerator and the largest is in the denominator. A similar generalisation can be done for other group fairness metrics but we omit those here for the sake of brevity.

2.2 Interpretable models

This subsection outlines a class of models with inherently high interpretability, DTs, that are central to this work. The interpretability of a model is the degree to which the classifications and the decision-making mechanism can be interpreted. The field of XAI is concerned with building systems that can be interpreted and explained. Complex systems might need an explanation function that generates explanations for the outputs of the system. Some methods may inherently be highly interpretable, requiring no explanation method, such as DTs. Interpretability may be desired to ensure safety, gain insight, enable auditing or manage expectations.

2.2.1 Decision trees (DTs)

A DT is a type of rule-based system that can be used for classification problems. The structure of the tree is learned from a labelled dataset. DTs consist of nodes, namely branching nodes and leaf nodes. The upper branching node is the root node. To classify an instance, one starts at the root node and follows the rules which apply to the instance from branching node to branching node until no more rules can be applied. Then, one reaches a decision node, also called a leaf node. Every node holds the instances that could reach that node. Thus, the root node holds every instance. Decision nodes classify instances based on the class that represents the most individuals within that node.

There are various effective ways to determine the structure of a DT, given a labelled dataset. The most common way is to have a function that indicates what should be the splitting criterion in each branching node. These heuristic functions look at splitting criteria to partition the data in the node such that each partition is as homogeneous as possible w.r.t. class. An example of such a heuristic is entropy, intuitively defined as the degree to which the class distribution is random in a partition. A greedy process then constructs the tree, picking the best split in each individual node. Popular methods include CART [15] and C4.5 [16].

Some scholars have taken the greedy construction approach further and introduced lookahead to find different local optima. There is currently no consensus whether including lookahead results in significantly better performing DTs compared to greedy construction algorithms [17, 18]. A lookahead method by Nunes et al. outperformed greedy approaches on datasets with more than 1000 instances but their final lookahead step did not result in better performance [19].

Optimal DTs are a newer set of approaches that utilize methods from dynamic programming and mixed-integer optimisation [20, 21]. Their performance is generally better as they approach the true DT more closely than greedily constructed DTs. However, their construction is computationally heavy.

DT construction algorithms based on evolutionary principles are another line of work that are computationally heavy. These methods seek to avoid sub-optimality by defining a fitness function (often based on the height of the DT and its accuracy), and mutating various aspects of the DT such as the features, splitting criterion and regularisation methods. Popular methods in this area are HEAD-DT [22] and the method by Karabidja et al. [23].

Other methods exist such as gradient-based optimisation and Bayesian approaches. However, interpretability has to be largely sacrificed for these approaches to work. Silva et al. aim to overcome this deficit for gradient-based approaches by converting the uninterpretable DT back to a univariate DT by selecting the features with the largest weights [24]. We also appreciate the work of Nuti & Rugama for Bayesian trees, that combines the Bayesian DT approach with a greedy construction algorithm, to ensure interpretability [25].

Decision Tree Interpretability The interpretability of a DT is determined by several factors. The main factor is its height, the number of times the DT partitions the data. Very short Decision Trees are sometimes called Decision Stumps [26]. The minleaf DT hyperparameter also influences the interpretability of a DT. The minleaf value constrains how many instances should minimally hold in a leaf node. The smaller the value, the more splits are required to reach the set minleaf value. In this work, the minleaf value is expressed as a fraction of the total number of instances. We consider a small minleaf value to be 0.05 and large 0.15 or higher. Optimal DTs cannot have a tall height due to their high computational cost. Greedy DTs can be terminated early in the construction process to maintain interpretability. Closely related to height is the number of decision nodes in the tree. This also influences the interpretability of DTs, as the more decision nodes a DT has, the more complex the DT is. Finally, DTs built with numeric features might become uninterpretable because they use the same numeric feature over and over, leading to unintuitive decision boundaries.

In general, DTs are interpretable because they offer visualisations and use rules, which are easy to understand for humans [27]. Major disadvantages of DTs include their incapability of modeling linear relationships simultaneously accurately and interpretably and their sensitivity to changes in the data. Still, their performance, especially ensembles of DTs, are state-of-the-art for prediction tasks on tabular data [28].

2.3 Privacy definitions

The final main pillar of responsible AI that this work discusses is privacy. Privacy, in general, is a term that can be used in multiple contexts. In its literal sense, privacy relates to one’s ability to make personal and intimate decisions without interference. In this work, however, privacy refers to the degree of control one has over others accessing personal information about themselves. This is also known as informational privacy [29]. The less personal data others access about an individual, the more privacy the individual has. This subsection discusses several techniques to increase data privacy.

2.3.1 Differential privacy (DP)

Differential Privacy (DP) [30] is a notion that gives mathematical guarantees on the membership of individuals in a dataset. In principle, it is a promise to any individual in a dataset, namely: ‘You will not be affected, adversely or otherwise, by allowing your data to be used in any analysis of the data, no matter what other analyses, datasets, or information sources are available’ [31]. More specifically, an adversary cannot infer if an individual is in the dataset. DP can be applied when sharing data, or an analysis of the data. ML models are ways of analysing data and therefore can also promise to adhere to DP. Another guarantee that DP makes is that it is immune to post-processing, i.e., DP cannot be undone [31].

Definition The promise of DP can be mathematically guaranteed up to a parameter $\varepsilon $ that quantifies the maximum amount of information that can be disclosed from the dataset. A lower $\varepsilon $ guarantees more privacy. This parameter, $\varepsilon $, is called the privacy budget. The privacy budget cannot be negative. A small privacy budget is 0.1 or less and a large budget is 1 or more. The main means of guaranteeing the promise of DP is by perturbing the data, i.e., adding noise to the data, via a randomized mechanism, $\mathcal {A}$. By ‘mechanism’, we mean any analysis (e.g., aggregate statistics) that can be performed on data. DP deals with randomized mechanisms, which are functions whose output change stochastically for a given dataset. Because DP is based on membership inference, the formal definition compares two neighboring datasets, $D$ and $D'$, where $D$ contains one more instance than $D'$. For these datasets, $(\varepsilon , \delta )$-DP formally is:

$$\begin{aligned} p(\mathcal {A}(D) \subseteq range(\mathcal {A})) \le \exp (\varepsilon ) \cdot p(\mathcal {A}(D') \subseteq range(\mathcal {A})) + \delta , \end{aligned}$$

(6)

where $\mathcal {A}$ is a randomized mechanism with the domain of all possible $D$, $range(\mathcal {A})$ is the range of all outcomes the mechanism can have and $\delta $ $\in [0, 1]$ is a parameter that allows for a controlled probability that $\mathcal {A}$ does not satisfy $\varepsilon $-DP. Thus, if $\delta = 0$, $\varepsilon $-DP is satisfied. We note that $\delta $ is not used as a parameter in all mechanisms. Intuitively, (6) states that whether an individual is in the dataset should affect the ratio of the randomized outcome probabilities at most by $\exp (\varepsilon )$.

Global Sensitivity What type of noise $\mathcal {A}$ adds depends on the query, $q(D)$. The query is often a data analysis question, e.g., how many women are in $D$? In this paper we will sometimes abuse notation and write $\mathcal {A}(D, q(D), \varepsilon )$ as $\mathcal {A}(q(D))$ when the other parameters are apparent from the context. How much noise, depends on the difference the inclusion of one worst-case individual in the dataset makes for the query answer. This is known as the sensitivity, $\Delta q$, how sensitive a query answer is to a change in the data [30]. Formally:

$$\begin{aligned} \Delta q = \max _{D, D'} \; ||q(D) - q(D') ||_1 , \end{aligned}$$

(7)

where $D$ and $D'$ are defined as in (6). This definition of sensitivity is known as the $\ell _1$-sensitivity or the global sensitivity. The following paragraphs describe DP-mechanisms including examples.

Randomized Response For answers with a binary response, Randomized Response may be used [32]. This procedure is $\ln (3)$-differentially private [31]. The procedure is as follows:

1.
Flip a coin.
2.
If it is heads, respond truthfully.
3.
Else, flip another coin.
4.
If it is heads, respond 0, else 1.

The responses 0 and 1 are placeholders for actual answers and should be mapped to the query appropriately. The procedure originates in social sciences where respondents might be not so inclined to answer truthfully with regard to criminal activities. This procedure ensures that the respondents cannot be charged for their answers.

Laplace Mechanism Several techniques exist to randomize query answers, of which the most common one is the Laplacian mechanism [30], for queries requesting real numbers. An example of such a query might be: ‘What is the average age of females in the dataset?’. The mechanism involves adding noise to a query answer, sampled from the Laplace distribution, centered at 0 and with a scale equal to $\frac{\Delta q}{\varepsilon }$. The Laplace mechanism can be formalised as:

$$\begin{aligned} \mathcal {A}(D, q(D), \varepsilon ) = q(D) + Lap(\frac{\Delta q}{\varepsilon }), \end{aligned}$$

(8)

where $Lap(\frac{\Delta q}{\varepsilon })$ is the added Laplacian noise. The Laplacian mechanism is particularly useful for histogram queries, in which the population in the database is disjoint. An example of such a histogram query might be: ‘How many women are born each year?’.

Exponential Mechanism A different noise schema is the Exponential mechanism [33], used for categorical, utility-related queries. An example of such a query might be: ‘What is the most convenient date to schedule this event?’ For these sorts of queries, a small amount of noise may completely destroy the utility of the query answer. A utility function, $u_D(r)$, is defined over the categories, $r \in \mathcal {R}$, for a certain dataset D. The exponential mechanism is sensitive with respect to the utility function, $\Delta u$, not with respect to changes in $r$. The exponential mechanism can be formally defined as:

$$\begin{aligned} p(\mathcal {A}(D, u, \mathcal {R}, \varepsilon ) = r) \propto \exp (\frac{\varepsilon u_D(r)}{2 \Delta u}). \end{aligned}$$

(9)

In other words, the probability of the best category being chosen is proportional to $e^{\frac{\varepsilon u_D(r)}{2 \Delta u}}$.

Gaussian Mechanism The Gaussian mechanism adds noise based on the Gaussian distribution, with $\mathcal {N}(0, \sigma )$. The mechanism is similar to the Laplacian mechanism in this sense. DP holds if $\sigma \ge \sqrt{2 \ln (\frac{1.25}{\delta })}\frac{\Delta _2}{\varepsilon }$ [31]. The term $\Delta _2$ is the global $\ell _2$-sensitivity; instead of using the $\ell _1$-norm in (7), $\Delta _2$ uses the $\ell _2$-norm. The Gaussian mechanism can be deemed a more ‘natural’ type of noise, as it adds noise that is often assumed to be present in measurements. A disadvantage is that both $\delta $ and $\varepsilon $ must be in (0, 1), so $\varepsilon $-DP can never be met. A query that the Gaussian mechanism might be used for is: ‘What is the average transaction amount for second-hand cars?’. This value is likely to be normally distributed, and therefore fits the Gaussian mechanism.

3 Related work

This section discusses work related to the research objectives. Whereas the previous section discussed background related to only one pillar of Responsible AI, this section will highlight methods at the intersection of these fields. It concludes by relating the proposed method, PAFER, to the current landscape of methods.

3.1 Fair decision trees and rule sets

The earliest work on fair decision rules is done by Pedreschi et al. [34], who propose formal procedures to identify (potentially) discriminatory rules under both direct and indirect discrimination settings. They highlight the fact that task-related features can be correlated with sensitive features leading to indirect discrimination [34].

Further, the earliest work regarding fair DTs was performed by Kamiran & Calders and is now known as Discrimination Aware Decision Trees (DADT). They proposed a Heuristic-Based DT that incorporates the homogeneity of the sensitive attribute into the splitting criterion [35]. DADT also performs some post-processing s.t. certain decision nodes change their decision. This step is phrased as a KNAPSACK problem [36], and is also solved greedily.

In terms of optimal DTs, Linden et al. achieve excellent results with a method named DPFair [37]. Their work significantly improves the speed of the work of Jo et al., who formulate the optimal DT problem with an additional fairness objective [38], which is itself an extension of the work of Aghaei et al. [39].

Agarwal et al. [40] introduce an approach based on a zero-sum game between a hypothesis selector that finds the best-performing model and a fairness regulator that points out EOdd violations to them based on gradient descent. The equilibrium that is arrived at in the game is the best trade-off between EOdd and accuracy. The authors argue that the proposed method can be applied not only to DT but also other ML methods such as Neural Networks.

A line of research exemplified by Grari et al. [41] aims to provide fairness for tree ensembles. Grari et al. propose an in-processing approach for gradient boosted trees, where the gradient of an adversarial neural network trying to predict the sensitive attribute is also considered during tree ensemble construction. Note that tree ensembles are not intrinsically interpretable and thus further works in this direction are beyond the scope of our paper.

3.2 Privacy-aware decision trees

There are three main works on the construction of DTs with DP guarantees, the rest of the field is more concerned with creating decision forests that have better performance. This holds in general, not only in a privacy-constrained setting. This subsection discusses the three works in chronological order. The setting that this body of work assumes is that a DT developer has limited access to the data via a curator to whom they can send queries. The answers to these queries should be perturbed via a DP-mechanism.

Blum et al. first introduced DTs with DP [42]. It was more of a proof-of-concept; the authors rewrote the information gain splitting criterion to make it differentially private. Querying the necessary quantities for each node and adding Laplacian noise to the answers ensures DP. For the leaf nodes, the class counts are queried, as is the case for all other approaches mentioned. The method, however, requires a large privacy budget to function effectively, which in turn makes the answers to the queries noisy with smaller budgets. Moreover, it cannot handle continuous features but allows the height of the trees to be equal to the total number of features.

The improvement on this method came from offloading the bulk of the computation to the data curator [43]. The method that is proposed by Friedman & Schuster [43] simply queries for the quantities in each node and the best attribute to split on. The latter is used to construct the tree and the former to cleverly determine the termination of the tree construction. The improvement also stems from the fact that the method in [42] used overlapping queries that consumed the privacy budget inefficiently. This problem is not present in [43], where the queries for nodes for each height are non-overlapping. Friedman & Schuster used the exponential mechanism, which relies on the sensitivity of the utility function, in this case, the splitting criterion. It is experimentally verified that when the criterion is the error rate, the accuracy is the highest. This method can handle continuous variables in theory but the inclusion of them in the training set severely degrades the predictive performance. Moreover, the maximum height of the DT can be five. The method still improved performance significantly, however, due to the smarter queries and noise addition.

DTs with privacy guarantees are best represented by the work of Mohammed et al. [44]. The method, named Private Decision tree Algorithm (PDA), uses the Exponential mechanism and queries the required quantities for greedily building up the DT [44]. This approach comes at the cost of a more robust termination criterion that has less flexibility compared to the one in [43]. Through experimental evaluation, a very robust termination criterion is determined, which is stopping at a height of four. Using this termination procedure, the performance of the method is experimentally shown to outperform the previous method. However, this method excludes the possibility of using continuous features, which is not a large downside as it is discouraged for the approach in [43] that this method builds upon. For a deeper overview of DTs with privacy guarantees, the reader is referred to [45].

Table 1 Overview of methods that are similar to PAFER

Full size table

3.3 Fair privacy-aware models

There is an upcoming field within responsible AI that is aimed at improving fairness, without accessing sensitive data. Prominent examples include Adversarially Reweighted Learning (ARL) and Fair Related Features (FairRF) [46, 47], respectively. While we highly value this line of work, it does not allow for the evaluation or estimation of fairness, as the field assumes sensitive attributes are entirely unavailable. Therefore, we consider these methods to be insufficient for our purpose, as we aim to provide guarantees on the degree of fairness a model exhibits, e.g., adherence to the 80%-rule.

The method most closely related to ours is named AttributeConceal and was introduced by Hamman et al.. They explore the idea of querying the group fairness metrics [48]. The scenario they assume is that ML developers have some dataset without sensitive attributes for which they build models, and therefore query SP and EOdd from a data curator. They establish that if the developers have bad intentions, they can identify a sensitive attribute of an individual using one unrealistic query, or two realistic ones. The main idea is that the models, for which they query fairness metrics, differ only on one individual, giving away their sensitive attribute via the answer. This result is then extended using any number of individuals. When the sizes of the groups differ greatly, i.e., $|D_{A=0} |\ll |D_{A=1} |$, using compressed sensing [49], the number of queries is in $O(|D_{A=0} |\log (\frac{N}{|D_{A=1} |}))$, with $N = |D_{A=1} + D_{A=0} |$, the total number of instances. The authors propose a mitigation strategy named AttributeConceal, using smooth sensitivity. This is a sensitivity notion that is based on the worst-case individual in the dataset, as opposed to the theoretical worst case of global sensitivity. DP is ensured for any number of queries by adding noise to each query answer. It is experimentally verified that using AttributeConceal, an adversary can predict sensitive attributes merely as well as a random estimator.

As a post-processing method, Jagielski et al. [50], combine two fairness-enhancing approaches [14, 40]. They also consider the setting where only the protected attribute remains to be private. They adapt both fairness enhancing algorithms, optimizing for EOdd, to also adhere to DP. The hypothesis selector introduced in [40] is considered to adhere to DP if sensitive attributes are absent from its input. The fairness regulator, which is also inspired from [40], is made differentially private by adding Laplacian noise to the gradients of the gradient descent solver. The results of this approach are only satisfactory for large privacy budgets.

3.4 PAFER & related work

Table 1 shows methods from the domain of responsible AI that have similar goals to PAFER. In general, we see a lack of fair, privacy-preserving methods for rule-based methods, specifically DTs. Hamman et al. investigate the fairness of models in general without giving in on privacy [48], but the method lacks validity and granularity for auditing. In their setting, the developers do not gain an insight into what should be changed about their model to improve fairness. One class of models that lends itself well to this would be DTs, as these are modular and can be pruned, i.e., rules can be shortened or removed. DTs are the state-of-the-art for tabular data [28] and sensitive tasks are often prediction tasks for tabular data.^{Footnote 1} A method that can identify unfairness in a privacy-aware manner for DTs would be interpretable, fair and differentially private, respecting some of the most important pillars of responsible AI. PAFER aims to fill this gap, querying the individual rules in a DT and hence enabling the detection of the causes of unfair treatment. The next section will introduce the method.

4 Proposed method

In this section, we introduce PAFER, a novel method to estimate the fairness of DTs in a privacy-constrained manner. The following subsections dissect the proposed method, starting with Section 4.1, on the assumptions and specific scenarios for which the method is built. Subsequently, Section 4.2 provides a detailed description of the procedure, outlining the pseudocode and some theoretical properties.

4.1 Scenario

PAFER requires a specific scenario for its use. This subsection describes that scenario and discusses how common the scenario actually is.

Firstly, PAFER is made for an auditing setting, in the sense that it is a method that is assumed to be used at the end of a development cycle. PAFER does not mitigate bias, it merely estimates the fairness of the rules in a DT. Secondly, we assume that a developer has constructed a DT that makes binary decisions on a critical task (e.g., about people). Critical tasks are often binary classification problems. Prominent examples include university acceptance decision making [2], recidivism prediction [11] and loan application evaluations [1]. It is also common to use a rule-based method for such a problem [52], as rules can explain the process to individuals affected by the decision. We further assume the developer may have had access to a dataset containing individuals and some task-specific features, but this dataset does not contain a full specification of sensitive attributes on an instance level. The developer (or the algorithm auditor) wants to assess the fairness of their model using SP, which is a widely-used fairness metric. We lastly assume that a legal, trusted third party exists that knows these sensitive attributes on an instance level. Although it is common that such a party exists, setting up this exchange is difficult in an age where data becomes more and more valuable [53]. Since, however, fair and interpretable sensitive attribute agnostic classifiers are currently lacking, this assumption becomes necessary. Based on these assumptions, the fairness of the DT can be assessed, using the third party and PAFER.

4.2 Privacy-aware fairness estimation of rules: PAFER

We propose Privacy-Aware Fairness Estimation of Rules (PAFER), a method based on DP [31], that enables the calculation of SP for DTs while guaranteeing privacy. PAFER sends specifically designed queries to a third party to estimate SP. PAFER sends one query for each decision-making rule and one query for the overall composition of the sensitive attributes. The size of each (un)privileged group, along with the total number of accepted individuals from each (un)privileged group, allows us to calculate the SP. PAFER uses the general SP definition, as found in Section 2.1.4, to allow for intersectional fairness analyses. Let $\mathcal {X}$ be the data used to train a DT, with $x_{i}^{j}$ the jth feature of the ith individual. Let a rule be of the form $x^1 < 5 \, \wedge \, x^2 = True$. The query then asks for the distribution of the sensitive attributes for all individuals that have properties $x^1 < 5$ and $x^2 = True$. In PAFER, each query is a histogram query as a person cannot be both privileged and unprivileged. The query to determine the general sensitive attribute composition of all individuals can be seen as a query for an ‘empty’ rule; a rule that applies to everyone.^{Footnote 2} It can also be seen as querying the root node of a DT.

4.2.1 PAFER and the privacy budget

A property of DTs is that only one rule applies to an instance. Therefore, PAFER queries each decision-making rule without having to share the privacy budget between these queries. Although we calculate a global statistic in SP, we query each decision-making rule. This is possible due to some noise cancelling out on aggregate, and, for DTs, because we can share the privacy budget over all decision-making rules. This intuition was also noted in [45].

Because PAFER queries every individual at least once, half of the privacy budget is spent on the query to determine the general sensitive attribute composition of all individuals, and the other half is spent on the remaining queries. Still, reducing the number of queries reduces the total amount of noise. PAFER therefore prunes non-distinguishing rules. A redundant rule can be formed when the splitting criterion of the DT improves but the split does not create a node with a different majority class.

4.2.2 DP mechanisms for PAFER

Three commonly used DP mechanisms are apt for PAFER, namely the Laplacian mechanism, the Exponential mechanism and the Gaussian mechanism. The Laplacian mechanism is used to perform a histogram query and thus has a sensitivity of 1 [31]. The Exponential mechanism uses a utility function such that $u_D(r) = q(D) - |q(D) - r|$ where r ranges from zero to the number of individuals that the rule applies to, and q(D) is the true query answer. Here, too, the sensitivity is 1 as it is based on its database argument, and this count can differ by only 1 [31]. The Gaussian mechanism is also used to perform a histogram query and has a sensitivity of 2, as it uses the $\Delta _2$-sensitivity.

4.2.3 Invalid answer policies

The Laplacian mechanism and Gaussian mechanism add noise in such a way that invalid query answers may occur. A query answer is invalid if it is negative, or if it exceeds the total number of instances in the dataset.^{Footnote 3} A policy for handling these invalid query answers must be chosen. In practice, these are mappings from invalid values to valid values. These mappings can be applied on a scalar or simultaneously on multiple values, e.g., a vector or histogram. We provide several options in this subsection.

Table 2 The proposed policy options for each type of invalid query answer

Full size table

Table 2 shows the available options for handling invalid query answers. A policy consists of a mapping chosen from the first column and a mapping chosen from the second in this table. The first column shows policies for negative query answers and the second column shows policies for query answers that exceed the number of individuals in the dataset. The ‘uniform’ policy replaces an invalid answer with the answer if the rule would apply to the same number of individuals from each un(privileged) group. The ‘total - valid’ policy requires that all other values in the histogram were valid. This allows for the calculation of the missing value by subtracting the sum of valid query answers from the total number of individuals that the query applies to.

4.2.4 PAFER pseudocode

Algorithm 1 shows the pseudocode for PAFER.

4.2.5 Theoretical properties of PAFER

We theoretically determine a lower and upper bound of the number of queries that PAFER requires for a k-ary DT in Theorem 1. The lower bound is equal to two, and the upper bound is ${k}^{h - 1} + 1$, dependent on the height of the DT, h. Note that PAFER removes redundant rules to reduce the number of rules. The larger the number of rules, the more noise is added on aggregate.

Theorem 1

The number of queries required by PAFER to estimate SP for a k-ary DT is lower bounded by 2 and upper bounded by ${k}^{h - 1} + 1$.

Proof

Let D and A denote the dataset and the sensitive attribute in question, respectively. Assume that we have constructed a k-ary DT for a binary classification task. Further, let the height (i.e., depth) of this DT be h. To estimate SP for the sensitive attribute A, we need the total size of each sensitive group $a \in range(A)$, namely, $|D_{A=a}|$, as well as the number of individuals from each sensitive group that is classified favorably by the DT, namely $|D_{A=a, \hat{Y}=1}|$. By definition, the first statistic requires 1 histogram query. The latter statistic requires a query for each favorable decision rule in the tree. A binary tree having a single favourable rule is schematically shown in Fig. 3 in Appendix A. Only 1 histogram query is required for this tree, thus the lower bound for the number of required queries for PAFER is $1 + 1 = 2$. A perfectly balanced tree is illustrated in Fig. 4 in Appendix A. In this case, the number of favourable decision rules in the tree is ${k}^{h-1}$. Because, each split that creates two leaf nodes adds both a favourable and an unfavourable classification rule to the DT, where we are interested only in the rules having a favourable outcome. In a perfectly balanced k-ary tree trained for a binary classification task, the number of favourable-outcome rules is equal to the number of nodes at depth $h - 1$. This amounts to ${k}^{h-1}$ histogram queries. The upper bound for the number of required queries for PAFER is thus ${k}^{h-1} + 1$. $\square $

5 Evaluation

This section evaluates the proposed method in the previous section, PAFER. Firstly, Section 5.1 describes the experimental setup, detailing the used datasets and the two experiments. Secondly, Section 5.2 displays and discusses the results of the experiments.

5.1 Experimental setup

This section describes the experiments that answer the research questions. The first subsection describes these datasets and details their properties. The subsections thereafter describe the experiments in order, corresponding to the research question they aim to answer.

5.1.1 Datasets

The datasets form the test bed on which the experiments can be performed. We chose four publicly available datasets, namely, Adult [54], COMPAS [11], German [55] and Taiwan [56]. They are all well known in the domain of fairness for ML, and can be considered benchmark datasets. Importantly, they vary in size and all model a binary classification problem, enabling the calculation of various fairness metrics. The datasets are publicly available and pseudonymized; every privacy concern is thus merely for the sake of argument. Table 3 shows some other important characteristics of each dataset.

Table 3 Properties of the three chosen publicly available datasets

Full size table

Pre-processing This paragraph describes each pre-processing step for every chosen dataset. Some pre-processing steps were taken for all datasets. In every dataset, the sensitive attributes were separated from the training set. Every sensitive attribute except age was binarized, distinguishing between privileged and unprivileged groups. The privileged individuals were ‘white men’ who lived in their original country of birth, and the unprivileged individuals were those who were not male, not white or lived abroad. We now detail the pre-processing steps that are dataset-specific.

Adult. The Adult dataset comes with a predetermined train and test set. The same pre-processing steps were performed on each one. Rows that contained missing values were removed. The “fnlwgt” column, which stands for “final weight” was removed as it is a relic from a previously trained model and unrelated features might cause overfitting. The final number of rows was 30162 for the train set and 15060 for the test set.

Taiwan. The Taiwan loan default dataset has no missing values and contains 30000 instances. The training and test sets have 20000 and 10000 instances, respectively.

COMPAS. The COMPAS article analyzes two datasets, one for general recidivism and one for violent recidivism [11]. Only the dataset for general recidivism was used. This is a dataset with a large number of features (53), but by following the feature selection steps from the article,^{Footnote 4} this number is reduced to eleven, of which three are sensitive attributes. The other pre-processing step in the article is to remove cases in which the arrest date and COMPAS screening date are more than thirty days apart. The features that contain dates are then converted to just the year, rounded down. Missing values are imputed with the median value for that feature. Replacing missing values with the median value ensures that no out-of-the-ordinary values are added to the dataset. The final number of instances was 4115 for the train set and 2057 for the test set, totalling 6172.

German. The German dataset has no missing values. The gender attribute is encoded in the marital status attribute, which required separation. The final number of rows is 667 for the train set and 333 for the test set, totalling 1000 rows.

5.1.2 Experiment 1: comparison of DP mechanisms for PAFER

Experiment 1 was constructed such that it answered RQ1; what DP mechanism is optimal for what privacy budget? The best performing shallow DT was constructed for each dataset, using the CART algorithm [15] with grid search and cross-validation, optimizing for balanced accuracy. The height of the DT, the number of leaf nodes and the number of selected features were varied. The parameter space can be described as {2, 3, 4} $\times $ {3, 4, 5, 6, 7, 8, 9, 10, 11, 12} $\times $ {sqrt, all, $\log _2$}, constituting tuples of (height, # leaf nodes, # selected features). The out-of-sample SP of each DT is provided in Table 4. The experiment was repeated fifty times with this same DT, such that the random noise, introduced by the DP mechanisms, could be averaged. Initially, we considered the Laplacian, Exponential and Gaussian mechanisms for the comparison. However, after exploratory testing, we deemed the Gaussian mechanism to perform too poorly to be included. Table 5 shows some of these preliminary results. The performance of each mechanism was measured using the Average Absolute Statistical Parity Error (AASPE), defined as follows:

$$\begin{aligned} \textrm{AASPE} = \sum _{i}^{\mathrm {\# \, runs}}\frac{1}{\mathrm {\# \, runs}} | SP_i - \widehat{SP_i} |, \end{aligned}$$

(10)

where # runs is the number of times the experiment was repeated, $SP_i$ and $\widehat{SP_i}$ are the true and estimated SP of the ith run, respectively. The metric was calculated out of sample, i.e., on the test set. The differences in performance were compared using an independent t-test. The privacy budget was varied such that forty equally spaced values were tested with $\varepsilon \in (0, \frac{1}{2}]$. Initial results showed that $\varepsilon > \frac{1}{2}$ offered very marginal improvements. Table 5 shows a summary of the preliminary results for Experiment 1. Experiment 1 was performed for both ethnicity, sex and the two combined. The former two sensitive features were encoded as a binary feature, distinguishing between a privileged (white, male) and an unprivileged (non-white, non-male) group. The latter sensitive feature was encoded as a quaternary feature, distinguishing between a privileged (white-male) and an unprivileged (non-white or non-male) group. Whenever a query answer is invalid, as described in Section 4.2.3, a policy must be chosen for calculation of the SP metric. In Experiment 1, the uniform answer approach was chosen, i.e., the size of the group was made to be proportional to the number of sensitive features and the total size. The proportion of invalid query answers, i.e., $\frac{\mathrm {\# \; invalid \; answers}}{\mathrm {\# \; total \; answers}}$, was also tracked during this experiment. This invalid value ratio provides some indication of how much noise is added to the query answers.

Table 4 The out-of-sample SP of each constructed DT in Experiment 1

Full size table

Table 5 Table captionPreliminary results for Experiment 1 with larger privacy budgets

Full size table

5.1.3 Experiment 2: comparison of different DTs for PAFER

Experiment 2 was constructed in such a way that it answered RQ2; what is the effect of DT hyperparameters on the performance of PAFER? The minleaf value was varied such that eighty equally spaced values were tested with $\texttt {minleaf} \in (0, \frac{1}{5}]$. In the initial results, shown in Table 6, when minleaf $ {> } \frac{1}{5}$, the same split was repeatedly chosen for each dataset. Even with minleaf $< \frac{1}{2}$, a risk still occurs that one numerical feature is split over and over, which hinders interpretability. Therefore, each numerical feature is categorized by generating five equal-width bins. The privacy budget was defined such that $\varepsilon \in \{\frac{1}{20}, \frac{2}{20}, \frac{3}{20}, \frac{4}{20}, \frac{5}{20}\}$. The performance was again measured in terms of AASPE. The metric was measured out of sample, i.e., on the test set. The performance for each minleaf value was averaged over fifty potentially different DTs. The same invalid query answer policy was chosen as in Experiment 1, replacing each invalid query answer with the uniformly distributed answer. The performance of PAFER was compared with a baseline that uniformly randomly guesses an SP value in the interval [0, 1). A one-sided t-test determined whether PAFER significantly outperformed the random baseline.

Table 6 Preliminary results for Experiment 2

Full size table

Experiment 2.1: Interaction between $\varepsilon $ and minleaf hyperparameters The SP metric is also popular due to its legal use in the United States, where it is used to determine compliance with the 80%-rule [13]. Thus, the UAR (Unweighted Average Recall) of PAFER was calculated for each minleaf value, to obtain an indication of whether PAFER was able to effectively measure this compliance. UAR is the average of class-wise recall scores and has a chance-level baseline score of $\frac{1}{K}$, where K is the number of classes. It is popularly used as a performance measure in classification tasks involving class imbalance and in machine learning competitions [57, 58]. The discretisation was done by rounding each estimation down to one decimal digit, thus creating ‘classes’ that the UAR could be calculated for. To gain more intuition about the interaction between $\varepsilon $ and minleaf value, the following metric was calculated for each combination:

$$\begin{aligned} \textrm{UAR} - \textrm{AASPE} = \sum _{c \in C} \frac{1}{|C|}\times \frac{\# \, \textrm{true} \, c}{\# c} - \sum _{i}^{\mathrm {\# \, runs}}\frac{1}{\mathrm {\# \, runs}} | SP_i - \widehat{SP_i} |{,} \end{aligned}$$

(11)

where $C$ is the set of of classes. Ideally, AASPE is minimized and UAR is maximized, thus maximizing the metric shown in (11). Besides the metric, the experimental setup was identical to Experiment 2. Therefore, the same DTs were used for this experiment, only the metrics differed.

5.2 Results

This section describes the results of the experiments and also provides an analysis of the results. Results are ordered to match the order of the experiments.

5.2.1 Results for experiment 1

Figures 1 and 2 answer RQ1; the Laplacian mechanism outperforms the Exponential mechanism on eight out of the ten analyses. The Laplacian mechanism is significantly better even at very low privacy budgets ($\varepsilon < 0.1$). The Gaussian mechanism proved also to be of no match for the Laplacian mechanism, even at large privacy budgets. The error of the Laplacian mechanism generally decreases steadily, as the privacy budget increases. This is expected behavior. As the privacy budget increases, the amount of noise decreases. The Laplacian mechanism performs the best on the Adult, COMPAS and Taiwan datasets, because their invalid value ratio is small, especially for $\varepsilon > \frac{1}{10}$.

The Exponential mechanism performs relatively stable across analyses, however, its performance is generally bad, with errors even reaching the maximum possible error for the German dataset. This is probably due to the design of the utility function, which does not differentiate enough between good and bad answers. Moreover, the Exponential mechanism consistently adds even more noise because it guarantees valid query answers. The Laplacian mechanism does not give these guarantees, and thus relies less on the chosen policy, as described in Section 4.2.3. The Laplacian mechanism performs somewhat decently on the intersectional analysis for the Adult dataset. This is due to it being an easy prediction task, the Laplacian mechanism starts at a similarly low error.

Plots in Figs. 1 and 2 show that the invalid value ratio consistently decreases with the privacy budget. This behavior is expected, given that the amount of noise decreases as the privacy budget increases. The invalid value ratio is the largest in the intersectional analyses because then the sensitive attributes are quaternary. The difference between the invalid value ratio progression for the Adult, Taiwan and COMPAS datasets is small, whereas the difference between COMPAS and German is large. Thus, smaller datasets only become problematic for PAFER between 6000 and 1000 rows. Experiment 2 sheds further light on this question.

For the two cases where the Exponential mechanism is competitive with the Laplacian mechanism, the invalid value ratio is also large. When the dataset is small, the sensitivity is relatively larger, and the chances of invalid query answers are larger. Note that the error is measured out-of-sample, so, for the German dataset, the histogram queries are performed on a dataset of size 333. This effect is also visible in the next experiment.

5.2.2 Results for experiment 2

Table 7 Summary results for Experiment 2, comparing AASPE of PAFER and random baseline with minleaf set to 0.01 under the all tested dataset and sensitive attribute combinations (setting)

Full size table

Table 8 The AASPE scores along with their uncertainties for all tested settings

Full size table

Tables 7 and 8 show p-values and corresponding summary results for Experiment 2.1, respectively, with minleaf value set to 0.01. Table 7 clearly shows that PAFER generally significantly outperforms the random baseline for this minleaf value. When the minleaf value is further reduced to 0.001 (not reported in the tables), PAFER does not outperform the random baseline in most settings on the COMPAS dataset. This is due to the ‘small’ leaf nodes, but also due to the small dataset size ($N = 6000$). This reduces the queried quantities even further, resulting in worse performance for PAFER. Then, the (un)privileged group sizes are closer to zero per rule, which increases the probability of invalid query answers. PAFER thus performs more poorly with a small privacy budget, but also on less interpretable DTs. When the minleaf value of a DT is small, it generally has more branches and branches are longer, as the CART algorithm is stopped later. Both of these factors worsen the interpretability of a DT [59].

As can be seen in Table 8, reporting mean and standard deviation of AASPE scores over 50 runs for each setting, external factors negatively impacting the performance of PAFER are a small dataset size and the number of (un)privileged groups. Therefore, the results for the German dataset are omitted, as PAFER does not outperform the random baseline. PAFER’s worse performance on smaller datasets, and less interpretable DTs is a clear limitation of the method.

For the sake of succinctness, the results and respective plots for Experiment 2.1 are given in Appendix B. This experiment also replicates some of the results of Experiment 1 and Experiment 2. The middle plot in Fig. 5 through Fig. 11 shows that PAFER with the Laplacian mechanism performs better for larger privacy budgets. These plots also show the previously mentioned trade-off between interpretability and performance of PAFER; the method performs worse for smaller minleaf values. Lastly, the performance is generally lower for the COMPAS dataset, which holds fewer instances. To sum up the experiments conducted to respond to RSQ2.1, for nearly all trials, there was a significant difference in error between PAFER and the random baseline.

Inspired by the model-agnostic baseline approach in [48], we compare PAFER’s performance to a holistic SP calculation by combining all favourable rules to make a single query using the Laplacian Mechanism in the DP framework. Note that since this query is at model level, it can be formulated as a model-agnostic query without knowing or having access to the model internals. Our implementation for the DT model’s SP query via combining the favourable rules is merely for the sake of computational efficiency. Table 9 reports the AASPE score ratio of this coarse, model-level approach to the AASPE of rule-based PAFER. It is expected that a model-level approach outperforms PAFER, due to differences in number of queries. Moreover, due to the properties of DP and the fact that rules partition the instances, relatively higher noise is expected per rule-based query. However, the results show that in many settings our fine-grained PAFER method not only approaches but also outperforms the coarse approach. This is especially true for shorter DTs, i.e. those with a larger minleaf value. We must note that none of these higher performances were significant ($p < .05$), as measured by an independent sample t-test.

Table 9 Comparison between performing joined queries and rule-level queries for varying minleaf values

Full size table

Table 10 Predictive performance of PAFER in terms of UAR and balanced accuracy

Full size table

To sum up, the response to RSQ2.2 also depends on the sensitive attribute under question and the dataset. The model-level querying approach signficantly outperformed PAFER on the COMPAS dataset for the ethnicity and intersectional sensitive attributes, with a minleaf value of 0.05. For values of minleaf $> 0.05$, neither method significantly outperformed the other. In this case the results motivate the use of PAFER as it adds fine-grained, rule-level fairness estimation while maintaining similar performance.

5.3 80%-rule analysis

Since the problem of estimating fairness is a regression task, so far, all the results were reported in terms of AASPE. To ease the performance analysis, we discretise the predictions and the actual SP score into bins of 0.1 width, as mentioned in Section 5.1.3. In Table 10, we report the classification performance of PAFER in terms of balanced accuracy and UAR. While the classification can easily be used for analysing whether the corresponding DT adheres to 80%-rule, we note that it is not a binary but a multi-class classification task. The number of classes depends on the range of the ground truth SP with a maximum of 11 classes.

5.4 Rule-level auditing analysis

As PAFER provides bias analysis at rule level to spot unfair rules, we give an example in an auditing scenario. For a DT constructed on the Adult dataset, three positive rules were identified, as shown in Table 11. PAFER correctly identified that the first rule was unfavorable for one of the groups, as it caused a difference in acceptance rate of 9.4%. The method was able to correctly detect this risk of unwanted bias (absolute SP error of 0.0075). This example shows that with a modest privacy budget of 0.2, PAFER can aid policy making and identify pitfalls in protocols.

Table 11 Positively classifying rules of a DT constructed with a minleaf value of 0.02, on the Adult dataset

Full size table

6 Conclusion & future work

This section concludes the work with a summary in Section 6.1, and provides suggestions for future work in Section 6.2.

6.1 Summary

This work has shed light on the trade-offs between fairness, privacy and interpretability in the context of DTs as intrinsically interpretable models, by introducing a novel, privacy-aware fairness estimation method called PAFER. There is a natural tension between the estimation of fairness and privacy, given that sensitive attributes are required to calculate fairness. This applies also to interpretable, rule-based methods. The proposed method, PAFER, alleviates some of this tension.

PAFER should be applied on a DT in a binary classification setting, at the end of a development cycle. PAFER guarantees privacy using mechanisms from DP, allowing it to measure SP for DTs.

We showed that the minimum number of required queries for PAFER is 2. We also showed that the maximum number of queries for a k-ary DT depends on the height h of the DT via ${k}^{h-1} + 1$.

In our experimental comparison of several DP mechanisms, PAFER showed to be capable of accurately estimating SP for low privacy budgets when used with the Laplacian mechanism. This confirms that the calculation of SP for DTs while respecting privacy is possible using PAFER.

Experiment 2 showed that the smaller the leaf nodes of the DT are, the worse the performance is. PAFER thus performs better for more interpretable DTs; as the smaller the minleaf value is, the less interpretable a DT is.

Future work can look into other types of DP mechanisms to use with PAFER, and other types of fairness metrics, e.g. EOdd.

6.2 Limitations & future work

This section describes some avenues that could be further explored regarding PAFER, with an eye on the limitations that became apparent from the experimental results. We suggest an extension of PAFER that can adopt two other new fairness metrics in Section 6.2.1 and suggest examining the different parameters of the PAFER algorithm in Section 6.2.2.

6.2.1 Other fairness metrics

The most obvious research avenue for PAFER is the extension to support other fairness metrics. SP is a popular, but simple metric that is not correct in every scenario. We thus propose two other group fairness metrics that are suitable for PAFER. However, with the abundance of fairness metrics, multiple other suitable metrics are bound to exist.

The EOdd metric compares the acceptance rates across (un)privileged groups and dataset labels. In our scenario (Section 4.1), we assume to know the dataset labels, as this is required for the construction of a DT. Therefore, by querying the sensitive attribute distributions for favorably classifying rules, only for those individuals for which $Y = y$, PAFER can calculate EOdd. Since these groups are mutually exclusive, $\varepsilon $ does not have to be shared. Since EOpp is a variant of EOdd, this can naturally also be measured using this approach. A downside is that the number of queries is multiplied by a factor of two, which hinders performance. However, this is not much of an overhead because it is only a constant factor.

6.2.2 Other input parameters

Examining the input parameters of the PAFER estimation algorithm in Algorithm 1, two clear candidates for further research become visible. These are the DP mechanism, $\mathcal {A}$ and the model that is audited, DT. The following two paragraphs discuss these options.

The Differential Privacy mechanism The performance of other DP mechanisms can be experimentally compared to the currently examined mechanisms, using the experimental setup of Experiment 1. Experiment 2 shows that there is still room for improvement, as a random guessing baseline is competitive with the Laplacian mechanism in certain settings.

The work of Hamman et al. in [48] shows promising results for a simple SP query. They use a DP mechanism based on smooth sensitivity [60]; a sensitivity that adds data-specific noise to guarantee DP. If this DP mechanism could be adopted for histogram queries, PAFER might improve in accuracy. Currently, PAFER improves poorly on less interpretable DTs. An improvement in accuracy might also enable PAFER to audit less interpretable DTs.

The audited model PAFER, as the name suggests, is currently only suited for rule-based systems, and in particular DTs. Further research could look into the applicability of PAFER for other rule-based systems, such as fuzzy-logic rule systems [61], rule lists [62] and association rule data mining [63]. The main point of attention is the distribution of the privacy budget. For DTs, only one rule applies to each person, so PAFER can query all rules. For other rule-based methods, this might not be the case.

It has long been established that Neural Networks can be converted to DTs [64]. Applying PAFER to extracted DTs from Neural Networks could also be a future research direction. However, the Neural Network must have a low number of parameters, or else the associated DT would be very tall. DTs with a tall height work worse with PAFER, so the applicability is limited.

6.2.3 Bayesian methods on aggregate data

All the experiments in this paper were conducted using a simulated third party having access to all sensitive data at the instance level. This is technically and legally feasible only for a small set of sensitive attributes such as ‘sex’ and ‘country of birth’, as they are registered in national census databases. However, critically sensitive data, such as ethnicity, should not be kept at the individual level. These can be kept at an aggregate level following representative national surveys. Thus, to use aggregate data (e.g., univariate and bi-variate counts) effectively with PAFER, future research can investigate the applicability of Bayesian methods and a minimal set of conditional probabilities/statistics for auditing decision rules used in governance.

Data Availability

The datasets, Adult [54], COMPAS [11], German [55], and Taiwan [56] used in the study are publicly available.

Code Availability

The scripts of the proposed method and to reproduce the results are available from https://github.com/uknowwho/pafer.

Notes

Examples are credit risk assessment [1], university acceptance [2], and bail decision making [3].
In logic this rule would be a tautology, a statement that is always true, e.g. $x^1 < 5 \vee x^1 \ge 5$.
Note that is common for a histogram query answer to exceed the number of individuals in a decision node by a certain amount. We, therefore, do not deem it as an invalid query answer.
https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb

References

Zhu L, Qiu D, Ergu D, Ying C, Liu K (2019) A study on predicting loan default based on the random forest algorithm. Procedia Comput Sci 162:503–513. https://doi.org/10.1016/j.procs.2019.12.017
Article MATH Google Scholar
Bickel PJ, Hammel EA, O’Connell JW (1975) Sex bias in graduate admissions: Data from berkeley: Measuring bias is harder than is usually assumed, and the evidence is sometimes contrary to expectation. Sci 187(4175):398–404. https://doi.org/10.1126/science.187.4175.398
Article Google Scholar
Chouldechova A (2017) Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 5(2):153–163. https://doi.org/10.1089/big.2016.0047
Article MATH Google Scholar
Barr A (2015) Google Mistakenly Tags Black People as ‘Gorillas,’ Showing Limits of Algorithms. Section: Digits . https://www.wsj.com/articles/BL-DGB-42522
(2023) Xenophobic machines: Discrimination through unregulated use of algorithms in the Dutch childcare benefits scandal. https://www.amnesty.org/en/documents/eur35/4686/2021/en/ Accessed 02 Apr 2023
Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation). Off J Eur Union L 119:1–88 Accessed 14 Ap 2016
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (Text with EEA relevance) Off J Eur Union L 2024/1689 12 Jul 2024. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
(2014) Directive 2014/17/eu of the european parliament and of the council of 4 february 2014 on credit agreements for consumers relating to residential immovable property and amending directives 2008/48/ec and 2013/36/eu and regulation (eu). Off J Eur Union L 60/34:34–85 04 Feb 2014
Cadwalladr C, Graham-Harrison E (2018) Revealed: 50 million Facebook profiles harvested for Cambridge Analytica in major data breach. The Guardian, Chap News
Google Scholar
(2019) Losing Face: Two More Cases of Third-Party Facebook App Data Exposure |UpGuard. https://www.upguard.com/breaches/facebook-user-data-leak Accessed 02 Mar 2023
Mattu S, Larson J, Kirchner L, Angwin J (2016) Machine Bias . https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference On - ITCS ’12, pp 214–226. ACM Press, Cambridge, Massachusetts. https://doi.org/10.1145/2090236.2090255
Questions and answers to clarify and provide a common interpretation of the uniform guidelines on employee selection procedures. Fed Regist 44(43) 01 Mar 1979
Hardt M, Price E, Price E, Srebro N (2016) Equality of Opportunity in Supervised Learning. In: Advances in Neural Information Processing Systems, vol 29, pp 3315–3323. Curran Associates, Inc., Red Hook, New York. https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html
(1998) Breiman L (ed): Classification and Regression Trees, 1. crc press repr edn. Chapman & Hall/CRC, Boca Raton, Fla
Salzberg SL (1994) C4.5: Programs for Mach Learn by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn 16(3):235–240. https://doi.org/10.1007/BF00993309
Murthy S, Salzberg S (1995) Lookahead and pathology in decision tree induction. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence - vol 2. IJCAI’95, pp 1025–1031. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Esmeir S, Markovitch S (2007) Anytime Learning of Decision Trees. J Mach Learn Res 8(33):891–933
MATH Google Scholar
Nunes C, De Craene M, Langet H, Camara O, Jonsson A (2020) Learning decision trees through Monte Carlo tree search: An empirical evaluation. WIREs Data Min Knowl Disc 10(3):1348. https://doi.org/10.1002/widm.1348
Article Google Scholar
Demirović E, Lukina A, Hebrard E, Chan J, Bailey J, Leckie C, Ramamohanarao K, Stuckey PJ (2022) MurTree: Optimal Decision Trees via Dynamic Programming and Search. J Mach Learn Res 23(26):1–47
MathSciNet MATH Google Scholar
Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106:1039–1082. https://doi.org/10.1007/s10994-017-5633-9
Article MathSciNet MATH Google Scholar
Barros RC, Basgalupp MP, Freitas AA, De Carvalho ACPLF (2014) Evolutionary Design of Decision-Tree Algorithms Tailored to Microarray Gene Expression Data Sets. IEEE Trans Evol Comput 18(6):873–892. https://doi.org/10.1109/TEVC.2013.2291813
Article MATH Google Scholar
Karabadji NEI, Seridi H, Bousetouane F, Dhifli W, Aridhi S (2017) An evolutionary scheme for decision tree construction. Knowl-Based Syst 119:166–177. https://doi.org/10.1016/j.knosys.2016.12.011
Article Google Scholar
Silva A, Gombolay M, Killian T, Jimenez I, Son SH (2020) Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence And Statistics, pp 1855–1865. PMLR, Palermo, Italy. ISSN: 2640-3498. https://proceedings.mlr.press/v108/silva20a.html
Nuti G, Jiménez Rugama LA, Cross AI (2021) An Explainable Bayesian Decision Tree Algorithm. Front Appl Math Stat 7:598833. https://doi.org/10.3389/fams.2021.598833
Article MATH Google Scholar
Oliver JJ, Hand D (1994) Averaging over decision stumps. In: Mach Learn: ECML-94: European Conference on Mach Learn Catania, Italy, April 6–8, 1994 Proceedings 7, pp 231–241. https://doi.org/10.1007/3-540-57868-4sps61. Springer
Molnar C (2022) Interpretable Mach Learn, 2nd edn.https://christophm.github.io/interpretable-ml-book
Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G (2022) Deep neural networks and tabular data: A survey. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3229161
Article Google Scholar
Warren SD, Brandeis LD (1890) The right to privacy. Harvard Law Rev 4(5):193–220. https://doi.org/10.2307/1321160
Article MATH Google Scholar
Dwork C (2006) Differential privacy.In: 33rd International Colloquium, ICALP (2006) on Automata, Languages and Programming, pp 1–12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11787006sps1
Article Google Scholar
Dwork C, Roth A (2013) The Algorithmic Foundations of Differential Privacy. Found Trends Theor Comput Sci 9(3–4):211–407. https://doi.org/10.1561/0400000042
Article MathSciNet MATH Google Scholar
Warner SL (1965) Randomized response: A survey technique for eliminating evasive answer bias. J Am Stat Assoc 60(309):63–69. https://doi.org/10.1080/01621459.1965.10480775
Article MATH Google Scholar
McSherry F, Talwar K (2007) Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp 94–103. https://doi.org/10.1109/FOCS.2007.66. IEEE
Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware Data Mining. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 560–568. https://doi.org/10.1145/1401890.1401959
Kamiran F, Calders T, Pechenizkiy M (2010) Discrimination Aware Decision Tree Learning. In: 2010 IEEE International Conference on Data Mining, pp 869–874. IEEE, Sydney, Australia. https://doi.org/10.1109/ICDM.2010.50
Ausiello G, Crescenzi P, Gambosi G, Kann V, Marchetti-Spaccamela A, Protasi M (1999) Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-58412-1
Article MATH Google Scholar
Linden J, Weerdt M, Demirović E (2022) Fair and optimal decision trees: A dynamic programming approach. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A (eds) Advances in Neural Information Processing Systems, vol 35, pp 38899–38911. Curran Associates, Inc., Red Hook, New York. https://proceedings.neurips.cc/paper_files/paper/2022/file/fe248e22b241ae5a9adf11493c8c12bc-Paper-Conference.pdf
Jo N, Aghaei S, Benson J, Gomez A, Vayanos P (2023) Learning optimal fair decision trees: Trade-offs between interpretability, fairness, and accuracy. In: Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. AIES ’23, pp 181–192. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3600211.3604664
Aghaei S, Azizi MJ, Vayanos P (2019) Learning optimal and fair decision trees for non-discriminative decision-making. Proc AAAI Conf Artif Intell 33(01):1418–1426. https://doi.org/10.1609/aaai.v33i01.33011418
Article MATH Google Scholar
Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair classification. In: International Conference on Mach Learn, pp 60–69. PMLR. https://proceedings.mlr.press/v80/agarwal18a.html
Grari V, Ruf B, Lamprier S, Detyniecki M (2020) Achieving Fairness with Decision Trees: An Adversarial Approach. Data Sci Eng 5(2):99–110. https://doi.org/10.1007/s41019-020-00124-2
Article MATH Google Scholar
Blum A, Dwork C, McSherry F, Nissim K (2005) Practical privacy: the sulq framework. In: Proceedings of the Twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 128–138. https://dl.acm.org/doi/abs/10.1145/1065167.1065184
Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 493–502. https://dl.acm.org/doi/abs/10.1145/1835804.1835868
Mohammed N, Barouti S, Alhadidi D, Chen R (2015) Secure and Private Management of Healthcare Databases for Data Mining. In: (2015) IEEE 28th International Symposium on Computer-Based Medical Systems, pp 191–196. IEEE, Sao Carlos, Brazil. https://doi.org/10.1109/CBMS.2015.54
Fletcher S, Islam MZ (2020) Decision Tree Classification with Differential Privacy: A Survey. ACM Comput Surv 52(4):1–33. https://doi.org/10.1145/3337064
Article MATH Google Scholar
Lahoti P, Beutel A, Chen J, Lee K, Prost F, Thain N, Wang X, Chi E (2020) Fairness without Demographics through Adversarially Reweighted Learning. In: Advances in Neural Information Processing Systems, vol 33, pp 728–740. Curran Associates, Inc., Red Hook, New York. https://proceedings.neurips.cc/paper/2020/hash/07fc15c9d169ee48573edd749d25945d-Abstract.html
Zhao T, Dai E, Shu K, Wang S (2022) Towards Fair Classifiers Without Sensitive Attributes: Exploring Biases in Related Features. In: Proceedings of the Fifteenth ACM International Conference on Web Search And Data Mining, pp 1433–1442. ACM, Virtual Event AZ USA. https://doi.org/10.1145/3488560.3498493
Hamman F, Chen J, Dutta S (2023) Can querying for bias leak protected attributes? achieving privacy with smooth sensitivity. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp 1358–1368. https://doi.org/10.1145/3593013.3594086
Candès EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal proc Mag 25(2):21–30. https://doi.org/10.1109/MSP.2007.914731
Article MATH Google Scholar
Jagielski M, Kearns M, Mao J, Oprea A, Roth A, Sharifi-Malvajerdi S, Ullman J (2019) Differentially Private Fair Learning. In: Proceedings of the 36th International Conference on Machine Learning, pp 3000–3008. PMLR, Long Beach, California, USA. ISSN: 2640-3498. https://proceedings.mlr.press/v97/jagielski19a.html
Valdivia A, Sánchez-Monedero J, Casillas J (2021) How fair can we go in Mach Learn? Assessing the boundaries of accuracy and fairness. Int J Intell Syst 36(4):1619–1643. https://doi.org/10.1002/int.22354
Article Google Scholar
Wang T, Rudin C, Doshi-Velez F, Liu Y, Klampfl E, MacNeille P (2017) A Bayesian Framework for Learning Rule Sets for Interpretable Classification. J Mach Learn Res 18(70):1–37
MathSciNet MATH Google Scholar
Szczepański M (2020) Is data the new oil? competition issues in the digital economy. EPRS in-depth analysis, 1–8
Kohavi R, Becker B (2016) UCI Machine Learning Repository: Adult Data Set. https://archive.ics.uci.edu/ml/datasets/adult
Hofmann H () Statlog (German Credit Data) Data Set. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
Yeh IC, Lien Ch (2009) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst Appl 36(2, Part 1):2473–2480. https://doi.org/10.1016/j.eswa.2007.12.020
Schuller B, Steidl S, Batliner A, Bergelson E, Krajewski J, Janott C, Amatuni A, Casillas M, Seidl A, Soderstrom M, Warlaumont A.S, Hidalgo G, Schnieder S, Heiser C, Hohenhorst W, Herzog M, Schmitt M, Qian K, Zhang Y, Trigeorgis G, Tzirakis P, Zafeiriou S (2017) The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring. In: Proc. Interspeech 2017, pp 3442–3446. https://doi.org/10.21437/Interspeech.2017-43
Schuller B.W, Batliner A, Bergler C, Mascolo C, Han J, Lefter I, Kaya H, Amiriparian S, Baird A, Stappen L, Ottl S, Gerczuk M, Tzirakis P, Brown C, Chauhan J, Grammenos A, Hasthanasombat A, Spathis D, Xia T, Cicuta P, Rothkrantz LJM, Zwerts JA, Treep J, Kaandorp CS (2021) The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates. In: Proc. Interspeech 2021, pp 431–435. https://doi.org/10.21437/Interspeech.2021-19
Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, Garcia S, Gil-Lopez S, Molina D, Benjamins R, Chatila R, Herrera F (2020) Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion 58:82–115. https://doi.org/10.1016/j.inffus.2019.12.012
Article Google Scholar
Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, pp 75–84. ACM, San Diego California USA. https://doi.org/10.1145/1250790.1250803
Mendel JM (2017) Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions, 2nd edn. Springer, Cham. https://doi.org/10.1007/978-3-319-51370-6 . Accessed 20 Jun 2023
Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2018) Learning certifiably optimal rule lists for categorical data. J Mach Learn Res 18(234):1–78
Shahin M, Arakkal Peious S, Sharma R, Kaushik M, Ben Yahia S, Shah SA, Draheim D (2021) Big Data Analytics in Association Rule Mining: A Systematic Literature Review. In: 2021 the 3rd International Conference on Big Data Engineering and Technology (BDET), pp 40–49. ACM, Singapore Singapore. https://doi.org/10.1145/3474944.3474951
Schmitz GPJ, Aldrich C, Gouws FS (1999) ANN-DT: an algorithm for extraction of decision trees from artificial neural networks. IEEE Trans Neural Netw 10(6):1392–1401. https://doi.org/10.1109/72.809084
Article MATH Google Scholar

Download references

Funding

The research leading to this article was conducted during an internship at the Dutch Central Government Audit Service (ADR) as part of the Utrecht University MSc thesis study of the first author.

Author information

Authors and Affiliations

Department of Information and Computing Sciences, Utrecht University, 3584 CC, Princetonplein 5, Utrecht, The Netherlands
Florian van der Steen & Heysem Kaya
Responsible AI Team, Dutch Central Government Audit Service, Den Haag, Korte Voorhout 7, 2511 CW, The Netherlands
Fré Vink

Authors

Florian van der Steen
View author publications
Search author on:PubMed Google Scholar
Fré Vink
View author publications
Search author on:PubMed Google Scholar
Heysem Kaya
View author publications
Search author on:PubMed Google Scholar

Contributions

The research is conducted by FvdS and supervised by FV and HK. The proposed method is conceptualized by FvdS and matured in consultation with the other authors. Experiments are conducted and discussed by all three authors. The manuscript is written by FvdS, then reviewed and revised by FV and HK. All authors have read and approved the final version.

Corresponding authors

Correspondence to Florian van der Steen or Heysem Kaya.

Ethics declarations

Conflict of Interest/Competing Interests

Authors declare no conflict of interest.

Ethics Approval

Not applicable

Consent to Participate

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Figures illustrating PAFER’s theoretical properties

Figure 3 shows the lower bound for the number of favorable rules in a DT. Figure 4 shows the upper bound.

Appendix B: Results for experiment 2.1

Figure 5 through Fig. 11 show the results for Experiment 2.1. Experiment 2.1 shows that PAFER’s performance largely depends on the parameter setting, yielding b . For some datasets and sensitive attributes, PAFER performs quite well, e.g. reaching around 90% UAR, as shown in Figs. 5 and 9. For other datasets and sensitive attributes, PAFER performs rather poorly, reaching no higher than 50% UAR on the Adult dataset with the binary sex attribute, as shown in Fig. 6.

Nonetheless, a pattern emerges from Fig. 5 through Fig. 11 regarding the UAR - AASPE. Of course, PAFER performs better for privacy budgets larger than 0.15. However, PAFER also performs better for certain minleaf values. The ‘hotspot’ differs between the Adult and COMPAS dataset, minleaf = 0.1 and minleaf = 0.15, respectively, but the range seems to be from 0.07 to 0.2. The ideal scenario for PAFER thus seems to be when a privacy budget of at least 0.15 is available, and the examined DT has leaf nodes with a fractional minleaf value of at least 0.07.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

van der Steen, F., Vink, F. & Kaya, H. Privacy constrained fairness estimation for decision trees. Appl Intell 55, 308 (2025). https://doi.org/10.1007/s10489-024-05953-6

Download citation

Accepted: 09 October 2024
Published: 13 January 2025
Version of record: 13 January 2025
DOI: https://doi.org/10.1007/s10489-024-05953-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Privacy constrained fairness estimation for decision trees

Abstract

Graphical abstract

Similar content being viewed by others

Automated Discovery of Trade-Off Between Utility, Privacy and Fairness in Machine Learning Models

Evaluating differentially private decision tree model over model inversion attack

Navigating Differential Privacy Constraints in Machine Learning

Explore related subjects

1 Introduction

1.1 Research questions

1.2 Outline

2 Preliminaries

2.1 Fairness definitions

2.1.1 Statistical parity

2.1.2 Equalized odds

2.1.3 Equality of opportunity

2.1.4 Intersectional fairness

2.2 Interpretable models

2.2.1 Decision trees (DTs)

2.3 Privacy definitions

2.3.1 Differential privacy (DP)

3 Related work

3.1 Fair decision trees and rule sets

3.2 Privacy-aware decision trees

3.3 Fair privacy-aware models

3.4 PAFER & related work

4 Proposed method

4.1 Scenario

4.2 Privacy-aware fairness estimation of rules: PAFER

4.2.1 PAFER and the privacy budget

4.2.2 DP mechanisms for PAFER

4.2.3 Invalid answer policies

4.2.4 PAFER pseudocode

4.2.5 Theoretical properties of PAFER

Theorem 1

Proof

5 Evaluation

5.1 Experimental setup

5.1.1 Datasets

5.1.2 Experiment 1: comparison of DP mechanisms for PAFER

5.1.3 Experiment 2: comparison of different DTs for PAFER

5.2 Results

5.2.1 Results for experiment 1

5.2.2 Results for experiment 2

5.3 80%-rule analysis

5.4 Rule-level auditing analysis

6 Conclusion & future work

6.1 Summary

6.2 Limitations & future work

6.2.1 Other fairness metrics

6.2.2 Other input parameters

6.2.3 Bayesian methods on aggregate data

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of Interest/Competing Interests

Ethics Approval

Consent to Participate

Additional information

Publisher's Note

Appendices

Appendix A: Figures illustrating PAFER’s theoretical properties

Appendix B: Results for experiment 2.1

Rights and permissions

About this article

Cite this article

Share this article

Keywords