Research talk:Automated classification of article quality/Work log/2016-06-07
Add topicAppearance
Latest comment: 8 years ago by EpochFail in topic Tuesday, June 7, 2016
Tuesday, June 7, 2016
[edit]Working on ruwiki stuff today.
$ cat ruwiki.observations.first_labelings.20160501.json | json2tsv label | sort | uniq fa ga I II III IV sa $ cat ruwiki.observations.first_labelings.20160501.json | grep '"fa"' | wc 1155 12110 263524 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"ga"' | wc 1759 16733 328268 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"I"' | wc 4486 43542 871572 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"II"' | wc 14371 136840 2732236 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"III"' | wc 56042 538541 10415274 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"IV"' | wc 75315 701607 12855088 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"sa"' | wc 1432 13978 282051
So, it looks like we can get about 1.1k observations per class and keep this all balanced. --EpochFail (talk) 15:54, 7 June 2016 (UTC)
$ make models/ruwiki.wp10.rf.model cat datasets/ruwiki.features_wp10.8k.tsv | \ revscoring train_test \ revscoring.scorer_models.RF \ wikiclass.feature_lists.ruwiki.wp10 \ --version 0.0.1 \ -p 'n_estimators=501' \ -p 'min_samples_leaf=8' \ -s 'table' -s 'accuracy' -s 'roc' -s 'f1' \ --balance-sample \ --center --scale > \ models/ruwiki.wp10.rf.model 2016-06-07 17:33:53,641 INFO:revscoring.utilities.train_test -- Training model... 2016-06-07 17:34:08,186 INFO:revscoring.utilities.train_test -- Testing model... ScikitLearnClassifier - type: RF - params: random_state=null, scale=true, verbose=0, min_samples_leaf=8, n_estimators=501, n_jobs=1, center=true, criterion="gini", bootstrap=true, balanced_sample=true, min_samples_split=2, balanced_sample_weight=false, warm_start=false, class_weight=null, max_features="auto", max_depth=null, min_weight_fraction_leaf=0.0, oob_score=false, max_leaf_nodes=null - version: 0.0.1 - trained: 2016-06-07T17:34:08.180792 Table: ~I ~II ~III ~IV ~fa ~ga ~sa --- ---- ----- ------ ----- ----- ----- ----- I 36 46 24 3 23 47 39 II 28 75 31 6 8 20 37 III 7 48 117 36 2 0 22 IV 1 7 57 157 1 0 5 fa 6 1 1 1 158 50 4 ga 19 5 7 3 50 143 16 sa 6 12 3 0 0 17 207 Accuracy: 0.561 ROC-AUC: ----- ----- 'I' 0.73 'II' 0.782 'III' 0.868 'IV' 0.956 'fa' 0.939 'ga' 0.888 'sa' 0.956 ----- ----- F1: --- ----- II 0.376 III 0.496 I 0.224 IV 0.724 ga 0.55 fa 0.683 sa 0.72 --- -----
That looks like it is useful. It seems we have a low F for "I", I'd guess that this rating is between "ga" and "II". --EpochFail (talk) 19:03, 7 June 2016 (UTC)