Experiment 1a: Training using Flagged Revisions as a proxy for damaging

Tracked in Phabricator:
Task T166235

In T166235, we did an experiment to see if a model trained on edits accepted through the flagged revisions interface would do any better at finding damaging edits than a model trained on the Wiki Labels damaging data. The results were not promising, with ROC-AUC falling from 0.954 to 0.900.

Hypothesis

Is data from the flagged revisions system a higher quality and more relevant to the task of finding damaging edits, than the data keyed through Wiki Labels? If so, training on the flagged revisions data should give us a fitness boost.

Methodology

Zache (talk · contribs) gave us a Quarry script to find revisions approved through the Flagged Revisions system. A simplified query was eventually used to generate a list of all approved revisions, consisting of about 50,000 rows. We labeled these as good-faith and not damaging, and gave an approved=1 label for good measure. These labeled revisions were union merged (see below) with 15,000 of the Wiki Labels that had been reserved as a training set. This merged file became our training data. The remaining 5,000 Wiki Labels observations were used for testing model health. No cross-validation was performed.

In hindsight, these flaggedrevs approved revisions were not quite right because they may have been the final edit in what could have been a chain of edits to review. This was an omission and if we end up repeating an experiment like this, we should query for only final revisions whose parent revision equals the starting revision of the reviewed chain.

A model was trained using a Makefile^[1] tweaked to build a second fiwiki.flaggedrevs.damaging model using the same parameters as the production fiwiki.damaging model, except it was fed the merged labels including flaggedrevs-approved changes as its source of true classifications. Here are test results from the two models:

Current champion damaging model

Model trained on approved Flagged Revisions

revscoring model_info models/fiwiki.damaging.gradient_boosting.model
ScikitLearnClassifier
 - type: GradientBoosting
 - params: loss="deviance", warm_start=false, balanced_sample=false, subsample=1.0, max_leaf_nodes=null, min_samples_leaf=1, center=true, balanced_sample_weight=true, min_samples_split=2, learning_rate=0.01, verbose=0, min_weight_fraction_leaf=0.0, presort="auto", max_features="log2", scale=true, random_state=null, max_depth=5, init=null, n_estimators=700
 - version: 0.3.0
 - trained: 2017-06-26T03:59:29.167423

Table:
	         ~False    ~True
	-----  --------  -------
	False     16727     2231
	True        113      904

Accuracy: 0.883
Precision:
	-----  -----
	False  0.993
	True   0.289
	-----  -----

Recall:
	-----  -----
	False  0.882
	True   0.89
	-----  -----

PR-AUC:
	-----  -----
	False  0.993
	True   0.548
	-----  -----

ROC-AUC:
	-----  -----
	False  0.95
	True   0.954
	-----  -----

revscoring model_info models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
ScikitLearnClassifier
 - type: GradientBoosting
 - params: random_state=null, verbose=0, init=null, learning_rate=0.01, min_samples_split=2, subsample=1.0, warm_start=false, center=true, min_samples_leaf=1, scale=true, loss="deviance", presort="auto", min_weight_fraction_leaf=0.0, balanced_sample=false, n_estimators=700, balanced_sample_weight=true, max_features="log2", max_leaf_nodes=null, max_depth=5
 - version: 0.0.1
 - trained: 2017-07-25T20:50:13.806134

Table:
	         ~False    ~True
	-----  --------  -------
	False      4589      138
	True        137      121

Accuracy: 0.945
Precision:
	-----  -----
	False  0.971
	True   0.467
	-----  -----

Recall:
	-----  -----
	False  0.971
	True   0.469
	-----  -----

PR-AUC:
	-----  -----
	False  0.993
	True   0.437
	-----  -----

ROC-AUC:
	-----  ---
	False  0.9
	True   0.9
	-----  ---

Two new utilities were introduced to facilitate this work:

union_merge_observations will take multiple observations files, and does a set union of any observations of the same record. For revision observations, this will merge all labels applied to each revision. This tool is now available in the revscoring repo.^[2]

normalize_column_types casts values to an expected type, in this case it was required because Quarry outputs integer 0/1 for boolean values, and our tools expect a true JSON boolean. We threw away this version of the tool because it wasn't worth the work to canonicalize it. If we end up needing it again one day, may want to combine it with a data validation step.

Experiment 1b: Refine data, omit multi-revision approvals, reverteds, and some bots

TODO in a future iteration:

Omit all bots.
Include approvals that are part of a multi-revision chain, if all changes are by the same author. Perhaps all revisions in the chain should be included in our data set.
If we can break out of scoring pure revisions, the diff between start and end is a high confidence good edit.

Methodology

Filter to single-revision approvals

Zache pointed out that Flagged Revs is often (about 1/3 of the approvals) used to merge more than one edit at a time.^[3] We can't be confident that all or any of these individual revisions are good-faith or non-damaging, only that the end product is an improvement. For example, a bad edit and its rollback might be included, and the reviewer would still approve the final article state.

I used a simple condition, that the beginning revision is the parent of the end revision. See the TODO on this work for how to correct some nuances that I missed--specifically that multiple edits by a single user probably stand a chance of being desirable edits and we should try harder to include them.

Filter out some bots

Any approvals by Zache and SeulojaBot are omitted from our set. I'm not totally clear on the reasoning, but I think these are bots reviewing other bots, and as such are edits we want to avoid.

Filter out later reverted

We ran the "autolabel" script on our approved revisions, and threw out anything with the "review reason" of "reverted edit". (TODO: link to an explanation of how that script works.)

Prepare for intermediate database tables

I split this query into pieces to make it easier to follow, and create a temporary table to store intermediate results. This is a bit annoying in Quarry and I ended up cheating, but the basic steps to replicate this approach are:

Create a user database to allow for intermediate tables.

ssh tools-login.wmflabs.org
mysql --defaults-file=replica.my.cnf -h fiwiki.labsdb
create database u4974__ores_tmp_p;

Building the results purely through Quarry might have been possible, but required some extra work to allow write access our temporary table, so I took a shortcut and ran the bulk of the queries from the console, only using Quarry to perform the fetch step.^[4]^[5]

We discover a data iceberg

In experiment 1a, I had missed that we were only parsing the newest approvals, those created since December 2016. Older approvals used a legacy log_params format, which hadn't been picked up by our query condition. Once we relaxed the condition to include the legacy format, we gained 160,000 more approvals to add to our data set. The new query also filters out multi-revision approvals and some bot approvals (those by ). Finally, we filtered out anything that was later reverted, according to the autolabel script.

Results

Current champion damaging model

Model trained on approved Flagged Revisions (2nd iteration)

ScikitLearnClassifier
 - type: GradientBoosting
 - params: loss="deviance", warm_start=false, balanced_sample=false, subsample=1.0, max_leaf_nodes=null, min_samples_leaf=1, center=true, balanced_sample_weight=true, min_samples_split=2, learning_rate=0.01, verbose=0, min_weight_fraction_leaf=0.0, presort="auto", max_features="log2", scale=true, random_state=null, max_depth=5, init=null, n_estimators=700
 - version: 0.3.0
 - trained: 2017-06-26T03:59:29.167423

Table:
	         ~False    ~True
	-----  --------  -------
	False     16727     2231
	True        113      904

Accuracy: 0.883
Precision:
	-----  -----
	False  0.993
	True   0.289
	-----  -----

Recall:
	-----  -----
	False  0.882
	True   0.89
	-----  -----

PR-AUC:
	-----  -----
	False  0.993
	True   0.548
	-----  -----

ROC-AUC:
	-----  -----
	False  0.95
	True   0.954
	-----  -----

ScikitLearnClassifier
- type: GradientBoosting
- params: max_leaf_nodes=null, warm_start=false, subsample=1.0, verbose=0, max_features="log2", random_state=null, min_samples_split=2, loss="deviance", init=null, n_estimators=700, learning_rate=0.01, balanced_sample_weight=true, scale=true, max_depth=5, center=true, min_weight_fraction_leaf=0.0, min_samples_leaf=1, presort="auto", balanced_sample=false
- version: 0.0.1
- trained: 2017-08-02T04:43:42.045973

Table:
~False    ~True
-----  --------  -------
False      4588      139
True        138      120

Accuracy: 0.944
Precision:
-----  -----
False  0.971
True   0.463
-----  -----

Recall:
-----  -----
False  0.971
True   0.465
-----  -----

PR-AUC:
-----  -----
False  0.991
True   0.401
-----  -----

ROC-AUC:
-----  -----
False  0.878
True   0.878
-----  -----

References

↑ "Makefile for ORES Flagged Revisions experiment". Gist. Retrieved 2017-07-27.
↑ "Data utils by adamwight · Pull Request #338 · wiki-ai/revscoring". GitHub. Retrieved 2017-07-27.
↑ "⚓ T166235 Flagged revs approve model to fiwiki". phabricator.wikimedia.org. Retrieved 2017-08-03.
↑ fiwiki_flaggedrevs_approvals.sql, 2017-08-01, retrieved 2017-08-02
↑ "Fiwiki good diffs - Quarry". quarry.wmflabs.org. Retrieved 2017-08-02.

[1] "Makefile for ORES Flagged Revisions experiment". Gist. Retrieved 2017-07-27.

[2] "Data utils by adamwight · Pull Request #338 · wiki-ai/revscoring". GitHub. Retrieved 2017-07-27.

[3] "⚓ T166235 Flagged revs approve model to fiwiki". phabricator.wikimedia.org. Retrieved 2017-08-03.

[4] fiwiki_flaggedrevs_approvals.sql, 2017-08-01, retrieved 2017-08-02

[5] "Fiwiki good diffs - Quarry". quarry.wmflabs.org. Retrieved 2017-08-02.

[1]

[2]

[3]

[4]

[5]