Research talk:Automated classification of edit quality/Work log/2017-07-19
Add topicWednesday, July 19, 2017
[edit]Today, I am going to tell you the story of how I decided to change the max_features parameter of the GradientBoostingClassifier (GBC) to improve the accuracy of the editquality model and what came out of it.
Problem
[edit]One of the issues with the current editquality model is its bias (it is leaning towards, or rather against, non-registered and new editors). To decrease this bias, it could be helpful to increase model's variance by engaging as many features as reasonable. See Bias-Variance Tradeoff for some details.
Hypothesis
[edit]So, I hypothesized that an additional potential source of bias could be max_features. What does Scikit-learn library tell us about this parameter? Here you go: "choosing max_features < n_features leads to a reduction of variance and an increase in bias." So, we use max_features=log2. Log2 is < than n_features. Which means, if we have ~10 features, log2 leaves us with ~3 randomly selected ones. What if we bring max_features to default which is None, i.e. all features will be engaged into the calculation? It promises to be a safe experiment because overfitting is unlikely to be a problem thanks to CV. Let's do this for ruwiki only.
Results
[edit]My hypothesis proved wrong (at least for ruwiki): ROC-AUC score with max_features=null for damaging model is 0.934 while the score for the model with max_features=log2 was higher - 0.936. Similar results are for goodfaith model (0.932 vs. 0.935) and reverted model (0.886 vs. 0.891).
Apparently, with all features enacted, the variance increases too much. A common practice with GBC is to check up to 30-40% of the features, which log2 essentially does. As well as sqrt actually, which is the most recommended parameter for max_features in GBC.
Below are the excerpts from the ruwiki tuning reports, ["log2"] version vs. None [null] version.
1. DAMAGING
Top scoring configurations
model | mean(scores) | std(scores) | params |
---|---|---|---|
GradientBoostingClassifier | 0.936 | 0.006 | max_depth=7, n_estimators=700, learning_rate=0.01, max_features="log2" |
GradientBoostingClassifier | 0.936 | 0.006 | max_depth=3, n_estimators=300, learning_rate=0.1, max_features="log2" |
GradientBoostingClassifier | 0.935 | 0.007 | max_depth=5, n_estimators=700, learning_rate=0.01, max_features="log2" |
vs.
Top scoring configurations
model | mean(scores) | std(scores) | params |
---|---|---|---|
GradientBoostingClassifier | 0.934 | 0.006 | n_estimators=700, learning_rate=0.1, max_depth=1, max_features=null |
GradientBoostingClassifier | 0.934 | 0.006 | n_estimators=300, learning_rate=0.1, max_depth=3, max_features=null |
GradientBoostingClassifier | 0.934 | 0.006 | n_estimators=500, learning_rate=0.1, max_depth=1, max_features=null |
2. GOODFAITH
RFC actually tops the list here, with 0.935, but GB with log2 is at least shows the same score sometimes.
GradientBoostingClassifier
mean(scores) | std(scores) | params |
---|---|---|
0.935 | 0.008 | max_features="log2", max_depth=7, n_estimators=700, learning_rate=0.01 |
0.934 | 0.006 | max_features="log2", max_depth=7, n_estimators=500, learning_rate=0.01 |
0.934 | 0.007 | max_features="log2", max_depth=5, n_estimators=700, learning_rate=0.01 |
vs.
GradientBoostingClassifier
mean(scores) | std(scores) | params |
---|---|---|
0.932 | 0.007 | learning_rate=0.01, max_depth=5, max_features=null, n_estimators=500 |
0.932 | 0.006 | learning_rate=0.01, max_depth=7, max_features=null, n_estimators=500 |
0.932 | 0.007 | learning_rate=0.01, max_depth=5, max_features=null, n_estimators=300 |
3. REVERTED
Top scoring configurations
model | mean(scores) | std(scores) | params |
---|---|---|---|
GradientBoostingClassifier | 0.891 | 0.008 | learning_rate=0.01, max_depth=7, n_estimators=500, max_features="log2" |
GradientBoostingClassifier | 0.891 | 0.007 | learning_rate=0.01, max_depth=7, n_estimators=700, max_features="log2" |
RandomForestClassifier | 0.89 | 0.011 | criterion="entropy", max_features="log2", n_estimators=320, min_samples_leaf=5 |
vs.
[GBC shows up way below Random Forest, not even in the top10]
GradientBoostingClassifier
mean(scores) | std(scores) | params |
---|---|---|
0.886 | 0.005 | learning_rate=0.01, n_estimators=700, max_depth=5, max_features=null |
0.884 | 0.004 | learning_rate=0.01, n_estimators=500, max_depth=5, max_features=null |
0.884 | 0.007 | learning_rate=0.01, n_estimators=500, max_depth=7, max_features=null |
Sources of inspiration:
* http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html * https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/