Analyzing Utility of Private Mechanisms
Introduction
Differentially private mechanisms release answers to queries on a sensitive database while protecting the privacy of individuals in it. These mechanisms add carefully calibrated noise to the true answers. Naturally, this means the private answers are less accurate than the true ones. In this post, we will see how to quantify the error incurred due to differentially private mechanisms on various categories of queries.
For each query type, we formally define the query and its corresponding accuracy metric. We will then look at general strategies to analyze the accuracy of a differentially private mechanism for this query.
Let us start with some notation. Suppose we have a sensitive database \(D\) which consists of \(n\) items from the data universe \(U\). A mechanism \(M\) needs to release the answers for a query \(f\) (or a set of queries \(F\)) on this database with \((\epsilon, \delta)\)-differential privacy. Our task is to analyze the error between \(M_{f}(D)\) and \(f(D)\).
Numeric Queries
Definition
A numeric query \(f: U^{n} \rightarrow \mathbb{R}^{k}\) maps a database \(D\) of \(n\) items to a vector of \(k\) real numbers. Numeric queries are sometimes referred to as linear queries. Counting queries and histogram queries are specific instantiations of numeric queries. A counting query maps each item in \(D\) to a value in \({0, 1}\) based on whether the item satisfies the query’s predicate. A histogram query counts the frequency of each type of item (from the data universe \(U\)) that is present in \(D\).
Accuracy Metric
A differentially private mechanism \(M\) is said to be \((\alpha, \beta)\)-accurate on a numeric query \(f\) if \(\Pr[|M_{f}(D) - f(D)| \geq \alpha] \leq \beta\). In other words, the probability of the difference between the mechanism’s output and the true output being at least \(\alpha\) is at most \(\beta\). Ideally, we want both \(\alpha\) and \(\beta\) to be small. If we want to evaluate the performance of \(M\) on a set of queries \(F\), we can modify the definition as \(\Pr[\max_{f \in F} |M_{f}(D) - f(D)| \geq \alpha] \leq \beta\).
Strategy
We first describe the strategy to bound the error \(\alpha\) for a single numeric query. We know that a mechanism’s error is the result of the noise it adds to the query to privately release the answer. Thus, if we bound the magnitude of noise being added, we can bound the error of the mechanism.
Note that random noise added by the mechanism can be sampled from a single random variable, a sum of independent random variables, or a sum of dependent random variables. Each random variable can follow its own probability distribution. We denote the total noise using \(Y\), which is interchangeable with the quantity \(|M_{f}(D) - f(D)|\).
It remains to define the error term \(\alpha\) for using the \((\alpha, \beta)\)-accuracy metric above. Since the mechanism’s error depends on the magnitude of \(Y\), we will state \(\alpha\) as a function of the parameters of the probability distributions used by the differentially private mechanism \(M\) to get \(Y\).
These distributions are parameterized using the global sensitivity \(GS_{f}\) of the query and the desired \(\epsilon, \delta\) privacy parameters (which can differ between the various distributions). It is also common practice to define \(\alpha\) in terms of the desired probability bound \(\beta\), e.g., including a \(\ln(\frac{1}{\beta})\) term in the expression for \(\alpha\). To summarize, \(\alpha\) is a function of \(GS_{f}, \epsilon_{i}\) and \(\delta_{i}\) for \(i\) probability distributions, and \(\beta\).
If \(Y\) is obtained using more than one distribution, we can apply an additive bound (corresponding to the probability distribution used in the mechanism) to evaluate the exact value of \(\alpha\). The example below shows how to use the additive bound from Chan et. al. to define an \(\alpha\) that depends on multiple Laplace random variables.
Once we have successfully analyzed the error from one query, extending the analysis to \(k\) offline queries is straightforward. Suppose we want to analyze the error of \(k\) queries. We can change the \(\beta\) value of each query to be \(\beta’ = \frac{\beta}{k}\). Applying a union bound over the \((\alpha, \beta’\))-probabilities of all the queries will provide us with an overall guarantee of \(\beta\).
Example
Suppose we have a counting query \(f : U^{n} \rightarrow \mathbb{N}\) that counts the number of a certain item in a database. For the sake of this example, assume we want to separately apply \(f\) to disjoint partitions \(D_{1}, D_{2}\) of a database \(D\) and release the combined count, i.e., we want to release \(f(D) = f(D_{1}) + f(D_{2})\).
To release the count with \(\epsilon\)-differential privacy, we can use the Laplace mechanism with scale parameter \(b = \frac{GS_{f}}{\epsilon}\) to add independent noise to each partition’s count. Note that the global sensitivity of the counting query is equal to 1, since changing a record in the database can change the count by 1 in the worst case. As the two partitions are disjoint, we can apply parallel composition to still get an overall privacy guarantee of \(\epsilon\). The private count is thus computed as \(M_{f}(D) = M_{f}(D_{1}) + M_{f}(D_{2})\), where each \(M_{f}(\cdot) = f(\cdot) + Lap(\frac{1}{\epsilon})\).
Now that we have described the setting, let us apply the strategy described above to analyze the utility of the mechanism \(M\) on the database \(D\). We can see that \(Y\) is the sum of independent noise drawn from two Laplace distributions.
Using the additive bound for the Laplace distribution (refer to Corollary 2.9 of the Chan et al. paper for the proof), we can define \(\alpha\) and subsequently write the \((\alpha, \beta)\)-accuracy metric for our mechanism as follows: $$ \Pr\left[Y \geq \sqrt{\sum_{i} b_{i}^{2}}\ln\left(\frac{1}{\beta}\right)\right] \leq \beta $$ Here, \(b_{i}\) denotes the scale parameter of the individual Laplace distributions used in the mechanism. In our case, each \(b_{i} = \frac{1}{\epsilon}\). Substituting to get the final value of \(\alpha\): $$ \Pr\left[Y \geq \frac{\sqrt{2}}{\epsilon}\ln\left(\frac{1}{\beta}\right)\right] \leq \beta $$
Future updates to this post will include more query types, e.g., selection queries.
References
Vadhan, S. (2017). The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, 347-450.
Chan, T. H. H., Shi, E., & Song, D. (2011). Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3), 1-24.