Differential Privacy Definitions for Streaming Data
Introduction
Standard notions of differential privacy protect data privacy in a static setting, i.e., records are not inserted, deleted, or changed in the underlying database. However, real-world databases are often dynamic in nature. Applying differentially private mechanisms to data that can be updated over time requires modeling the underlying database differently. Instead of viewing a database as a fixed collection of items, a dynamic database models data as a stream of items arriving one after the other at each discrete time step.
Each item in the data stream can be attributed to a particular user. A single user can contribute multiple items to the data stream, and these items need not be unique. Differential privacy for streaming data has two definitions - event-level differential privacy and user-level differential privacy - depending on what constitutes an individual entity whose privacy has to be protected. We will explore the two definitions in detail and some nuances corresponding to their differences in this post.
We start by formally defining a data stream. Consider the stream \(D = \{x_{1}, x_{2},\dots, x_n\}\), where an item \(x_{i} \in R \cup [\bot]\) arrives at time step \(i\). Here, \(R\) is the domain of items and \(\bot\) denotes an empty item, in case no record is inserted into the stream at time step \(i\). If each item is assigned to a user, we modify the stream to be \(D = \{(x_{1}, u_{1}), (x_{2}, u_{2}),\dots, (x_{n}, u_{n})\}\) where \(u_{i} \in U\) refers to the user who contributes the item \(x_i\) at time step \(i\). We denote the stream at a time step \(t\) as \(D_t\). Given a query function \(F(\cdot)\) that operates on a data stream, a differentially private mechanism \(M\) must release \(\tilde{F}(D_t)\) at each required time step \(t\) with minimal performance degradation while protecting privacy.
Event-Level Differential Privacy
In event-level differential privacy (event-DP), the individual entity that needs to be protected is an item \(x_{i}\) in the data stream \(D\). Neighboring databases \(D, D’\) under event-DP hence differ in one item. An example would be to obtain \(D’\) by removing \(x_{i}\) from \(D\). The differentially private mechanism \(M\) needs to release almost indistinguishable outputs whether it is operating on \(D\) or \(D’\), i.e., if the item \(x_i\) was part of the data stream or not. The level of indistinguishability is specified by the privacy parameter \(\epsilon\).
User-Level Differential Privacy
In user-level differential privacy (user-DP), the individual entity that needs to be protected is a user \(u\) who can contribute multiple items to the data stream. Neighboring databases \(D, D’\) under user-DP hence differ in one user and not one item. An example would be to obtain \(D’\) by removing all \(k\) items \(\{(x_{1}, u), \dots (x_{k}, u)\}\) associated with the user \(u\) from \(D\). The differentially private mechanism \(M\) needs to release almost indistinguishable outputs whether it is operating on \(D\) or \(D’\), i.e., if the user \(u\) was part of the data stream or not. The level of indistinguishability is once again specified by the privacy parameter \(\epsilon\).
Example: Counting Number of Clicks
Let us make the two definitions concrete with the help of an example. We want to count the number of links clicked by visitors on our website. Assume the domain of items \(R\) is the set of links \(\{\text{Link 1}, \text{Link 2}, \text{Link 3}, \text{Link 4}, \text{Link 5}\}\) and the domain of users \(U\) is the set of visitors \(\{\text{Visitor A}, \text{Visitor B}, \text{Visitor C}\}\). Suppose we record the data stream of clicks \(D = \{(\text{Link 1}, \text{Visitor A}), (\text{Link 2}, \text{Visitor B}), (\text{Link 2}, \text{Visitor A}), \allowbreak (\text{Link 4}, \text{Visitor C})\}\) over 4 time steps. We apply a differentially private mechanism \(M\) to release the number of clicks.
Under event-DP, a neighboring database \(D’\) can be obtained by removing an item from \(D\). For example, \(D’ = \{(\text{Link 1}, \text{Visitor A}), (\bot, \bot), (\text{Link 2}, \text{Visitor A}), (\text{Link 4}, \allowbreak \text{Visitor C})\}\) is a neighboring database because it does not contain the item \((\text{Link 2}, \text{Visitor B})\). An adversary observing the results of \(M\) on either \(D, D’\) will not be able to determine (with probability characterized by \(\epsilon\) whether \((\text{Link 2})\) was clicked. Note that for event-DP we do not need to consider user attributes in the data stream. We can simply write \(D = \{\text{Link 1}, \text{Link 2}, \text{Link 2}, \text{Link 4}\}\) and \(D’ = \{\text{Link 1}, \text{Link 2}, \allowbreak \text{Link 4}\}\). Additionally, we are protecting the click and not what was clicked at each time step (the same link can be clicked multiple times over the data stream).
Under user-DP, a neighboring database \(D’\) can be obtained by removing all items corresponding to a user from \(D\). For example, \(D’ = \{(\bot, \bot), (\text{Link 2}, \allowbreak \text{Visitor B}), (\bot, \bot), (\text{Link 4}, \text{Visitor C})\}\) is a neighboring database because it does not contain 2 items of \(\text{Visitor A}\). An adversary observing the results of \(M\) on either \(D, D’\) will not be able to determine (with probability characterized by \(\epsilon\) whether \(\text{Visitor A}\) was part of the data stream or not.
This example also makes it clear how event-DP is a special case of user-DP. User-DP is a stronger notion of privacy because it allows users to contribute multiple items to the data stream. A user-DP data stream can be converted to an event-DP data stream by removing user attributes as described above. For notational simplicity, we convert a user-DP data stream by setting \(u_{i} = i\), i.e., \(u_i\) is set to value of the current time step. Converted event-DP data streams for the example above will be \(D = \{(\text{Link 1}, 1), (\text{Link 2}, 2), (\text{Link 2}, 3), (\text{Link 4}, 4)\}\) and \(D’ = \{(\bot, 1), (\text{Link 2}, 2), (\bot, \allowbreak 3), (\text{Link 4}, 4)\}\).
Global Sensitivity and Database Distance
Global sensitivity is defined for the query function \(F\) that operates on a data stream \(D\), and it refers to the worst-case magnitude by which \(F(\cdot)\) changes when \(D\) is substituted for its neighbor \(D’\), over all possible pairs of databases \(D, D’\). Taking the example of counting the number of clicks, we shall compute the global sensitivity \(GS\) of the count function \(F\) under both event-DP and user-DP definitions and compare them. As neighboring databases differ in one item for event-DP, \(GS_{F, E} = 1\), since the number of clicks can change by at most 1. On the other hand, for user-DP, neighboring databases can differ in \(k\) items (as a user differs, not an item). Since global sensitivity is a worst-case metric, \(k\) can be thought of as the maximum number of contributions a user can make to the data stream. This implies \(GS_{F, U} = k\), i.e., the number of clicks can now change by at most \(k\). Unbounded global sensitivity in user-DP is a challenge for computing the privacy and utility guarantees of differentially private mechanisms. Most mechanisms that have been proposed thus require \(GS_{F, U}\) to be bounded.
We can also compare the relationship between database distances for data streams under the two definitions. This is especially useful when event-DP mechanisms are extended so they can be applied to user-DP situations. A common extension is to use an event-DP mechanism with the group privacy property of differential privacy, when \(GS_{F, U}\) is bounded. Knowing the database distance relationship will help us divide the desired privacy guarantee \(\epsilon\) by an appropriate number; this new epsilon is used by the event-DP mechanism. After using group privacy we would still get \(\epsilon\) for the resultant user-DP mechanism.
Denote the shortest distance between \(D, D’\) under event-DP using \(d_{E}(D, D’)\). Similarly, we use \(d_{U}(D, D’)\) for the distance between the databases under user-DP. For neighboring databases in event-DP, \(d_{E}(D, D’) = 1\) as we calculate the distance in item space. Note that even \(d_{U}(D, D’) = 1\) as we calculate the distance in user space, not item space, for user-DP. To define a relationship between the two database distances, we need to fix \(D, D’\) and compute the distances by converting user-DP data streams into event-DP ones. Using the count function \(F\) as an example again, we will see that \(d_{E}(D, D’) \leq GS_{F, U} \cdot d_{U}(D, D’)\).
Take the user-DP neighboring databases \(D = \{(\text{Link 1}, \text{Visitor A}), (\text{Link 2}, \allowbreak \text{Visitor B}), (\text{Link 2}, \text{Visitor A}), (\text{Link 4}, \text{Visitor C})\}\) and \(D’ = \{(\bot, \bot), (\text{Link 2}, \allowbreak \text{Visitor B}), (\bot, \bot), (\text{Link 4}, \text{Visitor C})\}\). As described above, the converted event-DP data streams are \(D = \{(\text{Link 1}, 1), (\text{Link 2}, 2), (\text{Link 2}, 3), (\text{Link 4}, 4)\}\) and \(D’ = \{(\bot, 1), (\text{Link 2}, 2), (\bot, 3), (\text{Link 4}, 4)\}\). Observe that \(GS_{F, U} = 2\) and \(d_{U}(D, D’) = 1\); \(d_{E}(D, D’) = 2\) since we compute the distance in item space. The event-DP mechanism will hence use a privacy parameter \(\frac{\epsilon}{2}\); with group privacy in place the user-DP mechanism will achieve a privacy guarantee of \(2 \cdot \frac{\epsilon}{2} = \epsilon\).
Parallel Composition
Parallel composition is one of the properties of differential privacy that is used by event-DP mechanisms to achieve better privacy guarantees. Suppose we have two differentially private mechanisms \(M_{1}, M_{2}\) that satisfy \(\epsilon_{1}\) and \(\epsilon_{2}\) differential privacy, respectively. If we take disjoint partitions \(D_{1}\) and \(D_{2}\) of the underlying database \(D\), then parallel composition states that releasing \(M_{1}(D \cap D_{1})\) and \(M_{2}(D \cap D_{2})\) will result in a privacy guarantee of \(\max(\epsilon_{1}, \epsilon_{2})\) instead of \(\epsilon_{1} + \epsilon_{2}\).
Event-DP mechanisms create disjoint partitions over the data stream by partitioning over time. Note that partitions need to be disjoint over the individual entities whose privacy is being protected. Since users can contribute multiple items in a stream, i.e., at different time steps, partitioning user-DP data streams will not result in disjoint partitions. Taking the example of counting the number of clicks, we will see why this is the case.
The event-DP version of our data stream was \(D = \{(\text{Link 1}, 1), (\text{Link 2}, \allowbreak 2), (\text{Link 2}, 3), (\text{Link 4}, 4)\}\). We can partition this stream by considering time steps \(0\) in partition \(D_1\) and time steps \(1, 2, 3\) in partition \(D_2\). These partitions are disjoint in item space because the intersection of \(D_{1}\) and \(D_2\) is empty. Now, consider \(D = \{(\text{Link 1}, \text{Visitor A}), (\text{Link 2}, \text{Visitor B}), (\text{Link 2}, \text{Visitor A}), (\text{Link 4}, \text{Visitor C})\}\) as our user-DP data stream. Partitioning over time like before but this time in user space, we see that \(\text{Visitor A}\) is in both \(D_{1}\) and \(D_2\).
Notions of Optimality
As described above, the global sensitivity \(GS_{F,U}\) of the counting function \(F\) under user-DP is equivalent to the maximum number of contributions that can be made by a user. Using worst-case optimality can severely impact the utility of differentially private mechanisms, since the sensitivity can be in the order of the size of the database. To see why, consider a new pair of neighboring databases under user-DP for counting the number of clicks. Suppose \(D_{i}\) contains a list of links solely contributed by one user \(u\). Its neighboring database \(D_{i}’ = \emptyset\) because all links are removed when one user (\(u\)) is removed. The sensitivity in this case is exactly equal to the size of the database \(|D_{i}|\). Adding Laplace noise \(\frac{GS}{\epsilon}\) in the differentially private mechanism \(M\) would severely degrade performance, especially when the neighboring databases are not worst-case neighboring databases.
Consequently, current user-DP mechanisms use a different notion of sensitivity and/or optimality to lessen performance degradation: local sensitivity and instance optimality. Like global sensitivity, local sensitivity refers to the magnitude by which \(F(\cdot)\) changes when a database is substituted for its neighbor, but this metric is defined for a fixed database \(D_{i}\) and its neighbors (not any pair). Instance optimality adds noise that is dependent on the specific database whose privacy needs to be protected. Worst-case optimality measures the expected performance of a differentially private mechanism for a fixed \(M\) over the set of databases. On the other hand, instance optimality measures the expected performance of a mechanism for a fixed database \(D_{i}\) over a family of differentially private mechanisms.
Observe that we can always find a mechanism that has perfect utility for the specific database \(D_{i}\) (e.g. \(M(D_{i}) \equiv F(D_{i})\)). This mechanism could have high error on other instances \(\bar{D}\) when \(M(D_{i}) = F(D_{i}) \neq F(\bar{D})\). Relaxed notions of neighborhood optimality mitigate this problem. Instead of finding a mechanism that just performs well on \(D_{i}\), we try to find one that performs well over all databases in the neighborhood of \(D_{i}\), where the distance \(d(D_{i}, D_{i}’)\) between the databases is at most some constant \(\rho\).
However, just considering neighborhood optimality is not enough for user-DP mechanisms. Neighboring databases in the user space can still result in sensitivity in the order of the database size: to get \(D_{i}’\) we can add a user who contributes \(O(|D_{i}|)\) items. Hence, we consider down neighbors of \(D_{i}\), i.e., databases that are subsets of \(D_{i}\) instead. Note that it is possible for down neighbors to differ in \(O(|D_{i}|)\) too. Suppose a user with \(O(|D_{i}|)\) items exists in the database. We can remove this user to get a valid down neighbor database. But if there is no such user, down neighborhood optimality allows a differentially private mechanism to not add more noise than required for \(D_{i}\). Remember that this notion of down neighbors is just for utility and not privacy. User-DP mechanisms still need to produce nearly indistinguishable outputs on any \(D, D’\); instance optimality simply allows a mechanism to adaptively add noise based on the database being protected to minimize performance degradation.
References
Dwork, C., Naor, M., Pitassi, T., & Rothblum, G. N. (2010, June). Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing (pp. 715-724).
Dong, W., Luo, Q., & Yi, K. Continual Observation under User-level Differential Privacy.
McSherry, F. D. (2009, June). Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (pp. 19-30).
Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211-407.
Fang, J., Dong, W., & Yi, K. (2022, November). Shifted Inverse: A General Mechanism for Monotonic Functions under User Differential Privacy. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (pp. 1009-1022).
Asi, H., & Duchi, J. C. (2020). Near instance-optimality in differential privacy. arXiv preprint arXiv:2005.10630.