Differential privacy/Docs/Infrastructure and framework decision-making process

Introduction and background

The Wikimedia Foundation (WMF) has been researching and investigating how we might deploy differential privacy (DP) since the beginning of 2021. We are the stewards of mountains (hundreds of terabytes) of private data, data which could provide valuable insights for researchers, editors, and other end users if safely released. The effort to deploy some form of DP on the platform is an effort to balance user safety, strict privacy and data retention policies, and open data access guidelines.

Ultimately, if we are successful in providing tools for the widespread use of differential privacy within WMF, we will be able to stop releasing aggregated, k-anonymous data (which has no meaningful measurements of privacy and is potentially vulnerable to re-identification attacks) and start releasing future-proof DP data, with measurable and accountable metrics of how much privacy is actually being lost with any given data release.

Over the last nine months, we have been successful at building proof-of-concept prototypes, and are now on the brink of several major decisions about scaling. Given that we want to be able to compute on data at a very large scale:

Among the several systems that currently exist for large-scale, high-performance computing within WMF, which infrastructure and environment should we choose and why?
Among the most developed differential privacy libraries/components, how should we decide on a language/computing framework?
What are the technical and organizational constraints on those decision spaces?

Answering all of these questions will let us meaningfully direct our efforts toward a viable MVP data product. Rather than floundering with indecision and waiting for someone else to make DP automatically work on our platforms, we will be an early adopter of a modern private statistic system.

Infrastructures

Apache Spark + YARN on the analytics cluster

Pros	Cons
Most established data flow pipeline at WMF Spark + YARN is a well-tested combination for divide-and-conquer, in-memory batch processing on the JVM stack Could also use Skein to make this a general purpose compute platform These systems are really well-understood by Analytics and Data Engineering teams Already have access to Analytics cluster + easy access to all needed data	Theoretically finite scaling capacity without running into resource constraints Don’t necessarily have a completely clean computing environment → can’t use languages (e.g. Golang), frameworks (e.g. PipelineDP), etc. that aren’t in line with with our systems Could limit or calcify our future development process → hard to develop streaming/local DP solutions

Documentation here

Liftwing + Kubernetes

Pros	Cons
Potentially more scalable than Spark + YARN → faster Much more flexible than Spark + YARN → can adapt to a bunch of languages and paradigms No constraints on resources which conducting large-scale data jobs	Not connected to Analytics data by design Necessary to bring in new teams to work on this system (ML/SRE) System is designed for a completely different use case (serving ML models), which could make it hard to adapt and annoying if we broke it

Documentation here

Kubernetes on main cluster

Pros	Cons
Potentially more scalable than Spark + YARN → faster Much more flexible than Spark + YARN → can adapt to a bunch of languages and paradigms Reasonable constraints on memory footprint which conducting large-scale data jobs	Not connected to Analytics data Necessary to bring in new teams to work on this system (SRE) → onboarding a project requires a formal review process, and production-grade standards a service must adhere to System is designed for a completely different use case (serving WMF projects), which could make it hard to adapt and potentially catastrophic if we broke it

Documentation here

Final judgement

After research on the systems and discussions with members of Data Engineering, Analytics, SRE, and ML, it seems like we are starting to coalesce around using Spark + YARN on the analytics cluster as our infrastructure. Although the system is a bit older and clunkier than containerized kubernetes clusters, it will certainly be able to do the trick for our computing needs. The technical and political hurdles for using Liftwing or the main cluster as a compute resource are too great, and we have a viable option.

Important things to note with this configuration are (1) not sending identical/correlated noise to each Spark worker node and (2) returning completed computations, not computation graphs.

Frameworks

Flying solo: Developing something entirely from scratch

Pros	Cons
Will definitely work on our systems without having to adjust anything on the fly Could write it in Java or Scala, which don’t have very well fleshed-out existing DP libraries	Tons of extra work to do crypto primitives right, and still could easily mess up with very large consequences Might not be trusted by community if it isn’t validated by domain experts Won’t benefit from other open source contributors adding new stuff to the library as literature updates

Partial components: OpenDP + Python

Pros	Cons
Library core is in Rust, which is a very data-safe language Highly validated and trusted methodology for adding new components to the library, requires formal mathematical proof Backed by Harvard and Microsoft → long-term institutional investments In Python with pretty simple UI, so pretty easy for a data scientist to pick up and use in a Jupyter Notebook	Relatively early stage of development → hasn’t reached package maturity Flat-out doesn’t work for distributed computing, because it relies on internal memory references for mechanism composition More geared toward scientific + research applications

Documentation here

Partial components: PyDP + Python

Pros	Cons
Library core is in C++, which is fast Core of the library created by Google’s DP team Developed by OpenMined — open source community with low bar for contribution Simple to use and parallelizable	Open source community building might not keep up with Google’s updates to C++ library → could potentially deprecate Relatively low-level, so would need to build a decent amount of stuff

Documentation here

Partial components: DP Accounting + Python

Pros	Cons
Library for composing Laplacian and Gaussian distribution for compositions of DP mechanisms Takes care of lots of finicky and annoying math that is very hard to do correctly	Doesn’t yet include ZCDP and RDP, which might be the most sustainable and simple way of thinking about privacy budgeting and accounting

Documentation here

Partial components: Other Google DP library (likely Java)

Pros	Cons
Spark operates in the JVM, which would reduce compilation and execution errors within our code	I (Hal Triedman) don’t know how to do much production-level coding in Java → limits to our team productivity

Documentation here

Full pipelines: PipelineDP + Python

Pros	Cons
Built using PyDP and DP Accounting Defines pipelines in Spark and Beam Takes care of a lot of details for us, just leaving us with issues of logging, provisioning resources, etc.	Will be at least until end of Q1 2022 before there is a functional prototype that can handle large-scale computations Being developed by Google DP for Google, which makes us reliant on their infrastructure/conceptualization of the problem

Documentation here

Full pipelines: Privacy on Beam + Golang

Pros	Cons
Built on top of Google’s Golang DP library Defines pipelines in Beam	Dockerized Golang is not supported in analytics clusters

Documentation here

Full pipelines: Tumult Analytics

Pros	Cons
Outside of the Google ecosystem Written in Python with Pyspark as a native compute engine Implements zCDP-based additive privacy accounting Proven in production with the IRS and the US Census Bureau	Not currently open source (but will released open source library in early 2022)

Documentation here

Final judgement

The likeliest outcome is probably a framework built using PyDP and DP Accounting that seeks to provide WMF with a custom solution from several predefined building blocks. Initially, this solution might only apply to a couple of large-scale contexts (e.g. country-language-page-view tuples, editor/edit counts by country and language, finance data, etc.), with OpenDP’s library dealing with smaller research datasets that do not require distributed computing.

After 6-12 months, it might be time to look at other open source data pipeline frameworks, see where OpenDP and PipelineDP are, and reconsider if we want to use their products and/or contribute to their communities.

Update: After posting the memo publicly, we were contacted by Tumult Labs about potentially using their DP engine. Talks about the plausibility of using their engine are ongoing.

Update as of 2022: We decided to use Tumult Labs' differential privacy engine — it has been running successfully on WMF servers since March 2022, and was made open source as of July 2022.

Constraints

In conversations with members of the ML/SRE team working on Liftwing, we were told that deploying DP on their infrastructure would not be possible.

While attending the OpenDP 2021 Community Meeting, we learned that OpenDP would not work on a distributed computing platform.

Minimum viable product (MVP) and timeline

Our MVP is a daily histogram made up of country-language-page-view tuples from the raw web requests of the day prior (~1.8 TB/day). The exact form of this daily data product is every single tuple of the form (country, project, page, number of pageviews) greater than a release threshold of 89 pageviews.

We will compute this daily on the analytics clusters, using Tumult Labs' analytics package to securely conduct differentially private aggregations and manage privacy budgets. In addition to these building blocks, we are also building a general-purpose logging apparatus that takes in information differentially private data releases so that we can (qualitatively) make judgements about data releases in the aggregate.

Our timeline for reaching this MVP is August/September 2022.