Differential privacy/Completed/Country-project-page/Problem statement
Since February 2022, Tumult Labs has assisted the WMF Privacy Engineering team in building a differentially private algorithm that computes a histogram of views of Wikimedia pages, grouped by project, country, and page ID. This document describes the problem and lists the technical requirements that a working solution must satisfy.
Problem description
[edit]Input data
[edit]Wikimedia collects data about visits to its website pages in a table, pageview_actor
. Each row of this table corresponds to a single visit to a page; conversely, each visit to a page creates a single row in this table. Each row of pageview_actor
contains, among other things, the following information of interest.
- The title of the page (
pageview_info[“page_title”]
, hereby calledpage_title
). - The ID of the page (
page_id
). - The project associated with the page (
pageview_info[“project”]
, hereby denotedproject
). - The country which the visit originated from (
geocoded_data[“country”]
, hereby denotedcountry
). - The date of the visit (as
day
,month
, andyear
, we simply denote itdate
). - A lossy fingerprinting field created by hashing a visitor’s IP address and UserAgent string (
actor_signature
).
Desired output
[edit]The goal is to generate a histogram of page views, grouped by page ID, project, country of origin, and date: for each tuple <page_id, project, country, date>
, we want to compute and publish the number of distinct individuals having visited the corresponding page from this specific country, on this particular date. Furthermore, we want to limit outlier contributions: we want to ensure that in each day, each individual contributes to at most <page_id, project, country, date>
tuples, for some (to be defined) value of .
Available auxiliary data
[edit]WMF already computes and publishes (via the REST API method) a histogram counting the number of rows in this table, grouped by page ID, project, and date. This histogram does not deduplicate multiple pageviews from the same person to the same page in the same day.
The full list of page IDs, associated projects, and possible countries of origin, are also available on the Wikimedia infrastructure.
Requirements
[edit]The requirements for a working solution to this problem fall in three categories: utility requirements, privacy requirements, and operational requirements.
Utility requirements
[edit]The high-level goal is to support qualitative use cases; we expect the typical user to look through a list of pages rather than perform statistical analysis on these pages (these use cases will still be best supported by NDAs allowing researchers to work directly with the private data).
Thus, we want to avoid releasing misleading data. We translate this high-level goal to a technical objective of a median relative error below 5%, and desired relative error below 50% for 95% of all published counts, with relative error defined as:
Due to the nature of the algorithm (described in more detail below), there are two additional kinds of error.
- Dropped data: some
<page_id, project, country, date>
combinations will be automatically dropped if the noisy count is small (below a release threshold ). These are analogous to “false negatives” in other branches of statistics.
- Spurious data: some
<page_id, project, country, date>
combinations may be included in the output even though the true count is zero. These are analogous to “false positives" in other branches of statistics.
We want to release as much data as possible, given the previously described error minimization goal. We capture this objective by calculating the rate of dropped data. Let be the set of all tuples, each with a noisy count and an exact count . Then drop rate is
We want to minimize the amount of spurious data. Again, let be the set of all tuples, each with a noisy count and an exact count . We can then calculate the spurious rate as
For all metrics, we consider the “ground truth” to be the metrics after contribution bounding and deduplication: the exact counts also count distinct contributions, with at most contributions from any individual during a single day.
We only want to count “real” page visits, and avoid counting e.g. redirects or bots. There is already a is_pageview
field capturing this information in the input table, and we can assume that this filtering is done before loading the data on the Tumult platform.
There is no consistency requirement: it is OK if the sum of counts for all countries having a certain <page_id, project, date>
tuple is apparently inconsistent with the data published via the REST API.
We also calculate these metrics at a continent, subcontinent, and country level to ensure that there is not disproportionate or catastrophic error in any geographic area.
Privacy requirements
[edit]The high-level goal is to protect the contribution of each user in each day with differential privacy. Since WMF does not track users (with cookies or a similar mechanism), we need a way to bound the number of contributions (to at most , for some value of ) from a single user in each day that doesn’t depend on the existence of a stable user identifier. There are two agreed-upon ways of doing so. Using actor_signature as a lossy fingerprint: we drop some rows to ensure that at most rows are associated with each actor_signature in a single day. Using a client-side system to ensure that in each day, each user agent (browser) sends at most rows that end up being counted in the histogram. Both options are implicitly making assumptions which might not be entirely correct, namely that a single user contributes to at most one actor_signature (with option 1) or one user agent (with option 2) in a single day. This might not be correct in certain cases, but taking this into account would require more complex and invasive forms of user tracking, which is not an option.
The solution should be easy to migrate from option 1 to 2. In both cases, we can assume that the filtering has already been done when the data is loaded into the Tumult platform.
Operational requirements
[edit]The pipeline must run in under 24h using the Spark resource settings for large jobs on WMF clusters. It must output data on HDFS.
The error reporting system must be able to run in a reasonable amount of time, and validate that assumptions made at mechanism design time still hold over time, as the data drifts. We must document which parameters we chose and why, so that these can be easily revised over time if necessary.
Algorithm Overview
[edit]Parameters
- Privacy parameters:
- Max daily pageviews per user:
- Pageview filter threshold:
- Suppression threshold:
- Date
Let be a fixed set of countries (obtained from wmf.geoeditors_public_monthly
)
Algorithm
- Pre-processing: for the given date, create a dataframe of pageviews — i.e., a set of
<page_id, project, country>
tuples such that each user contributes at most distinct page views per day. - Construct a set of candidate pages P —
<page_id, project>
tuples — based on information that is publicly available.[1] Keep all pages that have received at least (not necessarily distinct) page views that day. - For each
<page_id, project,country>
in : - Compute a noisy count
- If the noisy count is below threshold , omit it
- Return a dataframe of tuples
<page_id, project,country,noisy_count>
for all combinations whose noisy count is above threshold.
- ↑ As mentioned previously, the total number of daily pageviews per page is already made public via a REST API. Instead of using the REST API itself (which is not practical due to rate limiting), a query is written directly on the pageview_actor table that produces a table equivalent to the information available via the REST API (based on the documentation).