Research talk:Automated classification of edit quality/Work log/2017-07-19

Wednesday, July 19, 2017

Today, I am going to tell you the story of how I decided to change the max_features parameter of the GradientBoostingClassifier (GBC) to improve the accuracy of the editquality model and what came out of it.

Problem

One of the issues with the current editquality model is its bias (it is leaning towards, or rather against, non-registered and new editors). To decrease this bias, it could be helpful to increase model's variance by engaging as many features as reasonable. See Bias-Variance Tradeoff for some details.

Hypothesis

So, I hypothesized that an additional potential source of bias could be max_features. What does Scikit-learn library tell us about this parameter? Here you go: "choosing max_features < n_features leads to a reduction of variance and an increase in bias." So, we use max_features=log2. Log2 is < than n_features. Which means, if we have ~10 features, log2 leaves us with ~3 randomly selected ones. What if we bring max_features to default which is None, i.e. all features will be engaged into the calculation? It promises to be a safe experiment because overfitting is unlikely to be a problem thanks to CV. Let's do this for ruwiki only.

Results

My hypothesis proved wrong (at least for ruwiki): ROC-AUC score with max_features=null for damaging model is 0.934 while the score for the model with max_features=log2 was higher - 0.936. Similar results are for goodfaith model (0.932 vs. 0.935) and reverted model (0.886 vs. 0.891).

Apparently, with all features enacted, the variance increases too much. A common practice with GBC is to check up to 30-40% of the features, which log2 essentially does. As well as sqrt actually, which is the most recommended parameter for max_features in GBC.

Below are the excerpts from the ruwiki tuning reports, ["log2"] version vs. None [null] version.

1. DAMAGING

Top scoring configurations

model	mean(scores)	std(scores)	params
GradientBoostingClassifier	0.936	0.006	max_depth=7, n_estimators=700, learning_rate=0.01, max_features="log2"
GradientBoostingClassifier	0.936	0.006	max_depth=3, n_estimators=300, learning_rate=0.1, max_features="log2"
GradientBoostingClassifier	0.935	0.007	max_depth=5, n_estimators=700, learning_rate=0.01, max_features="log2"

vs.

Top scoring configurations

model	mean(scores)	std(scores)	params
GradientBoostingClassifier	0.934	0.006	n_estimators=700, learning_rate=0.1, max_depth=1, max_features=null
GradientBoostingClassifier	0.934	0.006	n_estimators=300, learning_rate=0.1, max_depth=3, max_features=null
GradientBoostingClassifier	0.934	0.006	n_estimators=500, learning_rate=0.1, max_depth=1, max_features=null

2. GOODFAITH

RFC actually tops the list here, with 0.935, but GB with log2 is at least shows the same score sometimes.

GradientBoostingClassifier

mean(scores)	std(scores)	params
0.935	0.008	max_features="log2", max_depth=7, n_estimators=700, learning_rate=0.01
0.934	0.006	max_features="log2", max_depth=7, n_estimators=500, learning_rate=0.01
0.934	0.007	max_features="log2", max_depth=5, n_estimators=700, learning_rate=0.01

vs.

GradientBoostingClassifier

mean(scores)	std(scores)	params
0.932	0.007	learning_rate=0.01, max_depth=5, max_features=null, n_estimators=500
0.932	0.006	learning_rate=0.01, max_depth=7, max_features=null, n_estimators=500
0.932	0.007	learning_rate=0.01, max_depth=5, max_features=null, n_estimators=300

3. REVERTED

Top scoring configurations

model	mean(scores)	std(scores)	params
GradientBoostingClassifier	0.891	0.008	learning_rate=0.01, max_depth=7, n_estimators=500, max_features="log2"
GradientBoostingClassifier	0.891	0.007	learning_rate=0.01, max_depth=7, n_estimators=700, max_features="log2"
RandomForestClassifier	0.89	0.011	criterion="entropy", max_features="log2", n_estimators=320, min_samples_leaf=5

vs.

[GBC shows up way below Random Forest, not even in the top10]

GradientBoostingClassifier

mean(scores)	std(scores)	params
0.886	0.005	learning_rate=0.01, n_estimators=700, max_depth=5, max_features=null
0.884	0.004	learning_rate=0.01, n_estimators=500, max_depth=5, max_features=null
0.884	0.007	learning_rate=0.01, n_estimators=500, max_depth=7, max_features=null

Sources of inspiration:

   * http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
   * https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/