Differential privacy/Docs/Infrastructure and framework decision-making process
Introduction and background
[edit]The Wikimedia Foundation (WMF) has been researching and investigating how we might deploy differential privacy (DP) since the beginning of 2021. We are the stewards of mountains (hundreds of terabytes) of private data, data which could provide valuable insights for researchers, editors, and other end users if safely released. The effort to deploy some form of DP on the platform is an effort to balance user safety, strict privacy and data retention policies, and open data access guidelines.
Ultimately, if we are successful in providing tools for the widespread use of differential privacy within WMF, we will be able to stop releasing aggregated, k-anonymous data (which has no meaningful measurements of privacy and is potentially vulnerable to re-identification attacks) and start releasing future-proof DP data, with measurable and accountable metrics of how much privacy is actually being lost with any given data release.
Over the last nine months, we have been successful at building proof-of-concept prototypes, and are now on the brink of several major decisions about scaling. Given that we want to be able to compute on data at a very large scale:
- Among the several systems that currently exist for large-scale, high-performance computing within WMF, which infrastructure and environment should we choose and why?
- Among the most developed differential privacy libraries/components, how should we decide on a language/computing framework?
- What are the technical and organizational constraints on those decision spaces?
Answering all of these questions will let us meaningfully direct our efforts toward a viable MVP data product. Rather than floundering with indecision and waiting for someone else to make DP automatically work on our platforms, we will be an early adopter of a modern private statistic system.
Infrastructures
[edit]Apache Spark + YARN on the analytics cluster
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Liftwing + Kubernetes
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Kubernetes on main cluster
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Final judgement
[edit]After research on the systems and discussions with members of Data Engineering, Analytics, SRE, and ML, it seems like we are starting to coalesce around using Spark + YARN on the analytics cluster as our infrastructure. Although the system is a bit older and clunkier than containerized kubernetes clusters, it will certainly be able to do the trick for our computing needs. The technical and political hurdles for using Liftwing or the main cluster as a compute resource are too great, and we have a viable option.
Important things to note with this configuration are (1) not sending identical/correlated noise to each Spark worker node and (2) returning completed computations, not computation graphs.
Frameworks
[edit]Flying solo: Developing something entirely from scratch
[edit]Pros | Cons |
---|---|
|
|
Partial components: OpenDP + Python
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Partial components: PyDP + Python
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Partial components: DP Accounting + Python
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Partial components: Other Google DP library (likely Java)
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Full pipelines: PipelineDP + Python
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Full pipelines: Privacy on Beam + Golang
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Full pipelines: Tumult Analytics
[edit]Pros | Cons |
---|---|
|
|
Documentation here
Final judgement
[edit]The likeliest outcome is probably a framework built using PyDP and DP Accounting that seeks to provide WMF with a custom solution from several predefined building blocks. Initially, this solution might only apply to a couple of large-scale contexts (e.g. country-language-page-view tuples, editor/edit counts by country and language, finance data, etc.), with OpenDP’s library dealing with smaller research datasets that do not require distributed computing.
After 6-12 months, it might be time to look at other open source data pipeline frameworks, see where OpenDP and PipelineDP are, and reconsider if we want to use their products and/or contribute to their communities.
Update: After posting the memo publicly, we were contacted by Tumult Labs about potentially using their DP engine. Talks about the plausibility of using their engine are ongoing.
Update as of 2022: We decided to use Tumult Labs' differential privacy engine — it has been running successfully on WMF servers since March 2022, and was made open source as of July 2022.
Constraints
[edit]In conversations with members of the ML/SRE team working on Liftwing, we were told that deploying DP on their infrastructure would not be possible.
While attending the OpenDP 2021 Community Meeting, we learned that OpenDP would not work on a distributed computing platform.
Minimum viable product (MVP) and timeline
[edit]Our MVP is a daily histogram made up of country-language-page-view tuples from the raw web requests of the day prior (~1.8 TB/day). The exact form of this daily data product is every single tuple of the form (country, project, page, number of pageviews)
greater than a release threshold of 89 pageviews.
We will compute this daily on the analytics clusters, using Tumult Labs' analytics package to securely conduct differentially private aggregations and manage privacy budgets. In addition to these building blocks, we are also building a general-purpose logging apparatus that takes in information differentially private data releases so that we can (qualitatively) make judgements about data releases in the aggregate.
Our timeline for reaching this MVP is August/September 2022.