Research talk:Revision scoring as a service/Work log/2016-02-23
Add topicTuesday, February 23, 2016
[edit]OK. Today I'm trying to do what we were doing with Urdu Wikipedia but with Polish Wikipedia instead. Here's list of 500K randomly sampled edits: http://quarry.wmflabs.org/query/7543
Prelabel is running now Amir (talk) 18:24, 23 February 2016 (UTC)
OK. It's done:
(3.4)ladsgroup@ores-compute:~/editquality/datasets$ wc plwiki.prelabeled_revisions.500k_2015.tsv 499736 1933819 13821243 plwiki.prelabeled_revisions.500k_2015.tsv (3.4)ladsgroup@ores-compute:~/editquality/datasets$ cat plwiki.prelabeled_revisions.500k_2015.tsv | grep "True" | wc 82484 264812 1720937 (3.4)ladsgroup@ores-compute:~/editquality/datasets$ cat plwiki.prelabeled_revisions.500k_2015.tsv | grep "reverted" | wc 14861 59444 416108
So 16.5% of edits needs review. That's good :) and 3% are reverted.
I sampled 5K to load it up to Wikilabels:
( echo "rev_id\tneeds_review\treason"; ( cat datasets/plwiki.prelabeled_revisions.500k_2015.tsv | \ grep "True" | \ shuf -n 2500; \ cat datasets/plwiki.prelabeled_revisions.500k_2015.tsv | \ grep "False" | \ shuf -n 2500 \ ) | \ shuf \ ) > datasets/plwiki.revisions_for_review.5k_2015.tsv
Using shuffle, I extracted 20K revs to build the reverted model:
cat datasets/plwiki.sampled_revisions.500k_2015.tsv | \ shuf -n 20000 > datasets/plwiki.sampled_revisions.20k_2015.tsv
Then we should add "rev_id" to the first line and check if "rev_id" is not accidentally added to revs. (check) Then running label reverted:
cat datasets/plwiki.sampled_revisions.20k_2015.tsv | \ ./utility label_reverted \ --host https://pl.wikipedia.org \ --revert-radius 3 \ --verbose > datasets/plwiki.rev_reverted.20k_2015.tsv
It's labeling them.
Now I'm extracting features:
cat datasets/plwiki.rev_reverted.20k_2015.tsv | \ revscoring extract_features \ editquality.feature_lists.plwiki.reverted \ --host https://pl.wikipedia.org \ --include-revid \ --verbose > \ datasets/plwiki.features_reverted.20k_2015.tsv
OK. I ran tuning reports and turned out RF is the best. Strange. Everything I touch turns into RF :D
Running with best settings:
> revscoring train_test \ > revscoring.scorer_models.RF \ > editquality.feature_lists.plwiki.reverted \ > --version 0.1.0 \ > -p 'max_features="log2"' \ > -p 'criterion="entropy"' \ > -p 'min_samples_leaf=7' \ > -p 'n_estimators=640' \ > -s 'pr' -s 'roc' \ > -s 'recall_at_fpr(max_fpr=0.10)' \ > -s 'filter_rate_at_recall(min_recall=0.90)' \ > -s 'filter_rate_at_recall(min_recall=0.75)' \ > --balance-sample-weight \ > --center --scale \ > --label-type=bool > \ > models/plwiki.reverted.rf.model 2016-02-23 22:21:47,424 INFO:revscoring.utilities.train_test -- Training model... 2016-02-23 22:22:08,411 INFO:revscoring.utilities.train_test -- Testing model... ScikitLearnClassifier - type: RF - params: oob_score=false, scale=true, center=true, warm_start=false, criterion="entropy", random_state=null, max_leaf_nodes=null, class_weight=null, n_jobs=1, n_estimators=640, min_samples_leaf=7, min_weight_fraction_leaf=0.0, verbose=0, balanced_sample_weight=true, min_samples_split=2, max_depth=null, max_features="log2", bootstrap=true - version: 0.1.0 - trained: 2016-02-23T22:22:08.411624 ~False ~True ----- -------- ------- False 3723 155 True 50 66 Accuracy: 0.9486730095142714 PR-AUC: 0.327 Filter rate @ 0.9 recall: threshold=0.08, filter_rate=0.736, recall=0.905 Recall @ 0.1 false-positive rate: threshold=0.918, recall=0.017, fpr=0.0 Filter rate @ 0.75 recall: threshold=0.235, filter_rate=0.903, recall=0.75 ROC-AUC: 0.912
Look "Recall @ 0.1" Wooot! Amir (talk) 22:37, 23 February 2016 (UTC)
Size of the model:
(3.4)ladsgroup@ores-compute:~/editquality/models$ ls -Ssh | grep plwiki.reverted.rf.model 17M plwiki.reverted.rf.model
OK. We are good to go! Amir (talk) 22:51, 23 February 2016 (UTC)