Jump to content

Research talk:Automated classification of article quality/Work log/2016-06-07

Add topic
From Meta, a Wikimedia project coordination wiki

Tuesday, June 7, 2016

[edit]

Working on ruwiki stuff today.

$ cat ruwiki.observations.first_labelings.20160501.json | json2tsv label | sort | uniq
fa
ga
I
II
III
IV
sa
$ cat ruwiki.observations.first_labelings.20160501.json | grep '"fa"' | wc
   1155   12110  263524
$ cat ruwiki.observations.first_labelings.20160501.json | grep '"ga"' | wc
   1759   16733  328268
$ cat ruwiki.observations.first_labelings.20160501.json | grep '"I"' | wc
   4486   43542  871572
$ cat ruwiki.observations.first_labelings.20160501.json | grep '"II"' | wc
  14371  136840 2732236
$ cat ruwiki.observations.first_labelings.20160501.json | grep '"III"' | wc
  56042  538541 10415274
$ cat ruwiki.observations.first_labelings.20160501.json | grep '"IV"' | wc
  75315  701607 12855088
$ cat ruwiki.observations.first_labelings.20160501.json | grep '"sa"' | wc
   1432   13978  282051

So, it looks like we can get about 1.1k observations per class and keep this all balanced. --EpochFail (talk) 15:54, 7 June 2016 (UTC)Reply


$ make models/ruwiki.wp10.rf.model 
cat datasets/ruwiki.features_wp10.8k.tsv | \
        revscoring train_test \
                revscoring.scorer_models.RF \
                wikiclass.feature_lists.ruwiki.wp10 \
                --version 0.0.1 \
                -p 'n_estimators=501' \
                -p 'min_samples_leaf=8' \
                -s 'table' -s 'accuracy' -s 'roc' -s 'f1' \
                --balance-sample \
                --center --scale > \
        models/ruwiki.wp10.rf.model
2016-06-07 17:33:53,641 INFO:revscoring.utilities.train_test -- Training model...
2016-06-07 17:34:08,186 INFO:revscoring.utilities.train_test -- Testing model...
ScikitLearnClassifier
 - type: RF
 - params: random_state=null, scale=true, verbose=0, min_samples_leaf=8, n_estimators=501, n_jobs=1, center=true, criterion="gini", bootstrap=true, balanced_sample=true, min_samples_split=2, balanced_sample_weight=false, warm_start=false, class_weight=null, max_features="auto", max_depth=null, min_weight_fraction_leaf=0.0, oob_score=false, max_leaf_nodes=null
 - version: 0.0.1
 - trained: 2016-06-07T17:34:08.180792

Table:
	       ~I    ~II    ~III    ~IV    ~fa    ~ga    ~sa
	---  ----  -----  ------  -----  -----  -----  -----
	I      36     46      24      3     23     47     39
	II     28     75      31      6      8     20     37
	III     7     48     117     36      2      0     22
	IV      1      7      57    157      1      0      5
	fa      6      1       1      1    158     50      4
	ga     19      5       7      3     50    143     16
	sa      6     12       3      0      0     17    207

Accuracy: 0.561
ROC-AUC:
	-----  -----
	'I'    0.73
	'II'   0.782
	'III'  0.868
	'IV'   0.956
	'fa'   0.939
	'ga'   0.888
	'sa'   0.956
	-----  -----

F1:
	---  -----
	II   0.376
	III  0.496
	I    0.224
	IV   0.724
	ga   0.55
	fa   0.683
	sa   0.72
	---  -----

That looks like it is useful. It seems we have a low F for "I", I'd guess that this rating is between "ga" and "II". --EpochFail (talk) 19:03, 7 June 2016 (UTC)Reply