Research talk:Automated classification of article importance/Work log/2017-06-29
Add topicThursday, June 29, 2017
[edit]Today I'll work on a gap analysis for our WikiProjects, and look into whether I can do a global one as well.
A global dataset can be generated using wikiclass, see extract_scores.py Not sure how long it'll take for enwiki, but it shouldn't be too long.
WikiProjects
[edit]WikiProject Africa
[edit]Top | High | Mid | Low | |
---|---|---|---|---|
Top | 1,813 | 240 | 166 | 43 |
High | 4 | 1,245 | 0 | 6 |
Mid | 475 | 686 | 2,226 | 688 |
Low | 1,368 | 3,840 | 5,943 | 14,972 |
WikiProject China
[edit]Top | High | Mid | Low | |
---|---|---|---|---|
Top | 411 | 1 | 0 | 0 |
High | 67 | 1,439 | 97 | 83 |
Mid | 301 | 2,036 | 4,638 | 2,649 |
Low | 204 | 2,001 | 3,759 | 7,686 |
WikiProject Judaism
[edit]Top | High | Mid | Low | |
---|---|---|---|---|
Top | 233 | 0 | 0 | 1 |
High | 8 | 473 | 8 | 8 |
Mid | 101 | 285 | 901 | 311 |
Low | 96 | 311 | 942 | 2,647 |
WikiProject Medicine
[edit]Top | High | Mid | Low | |
---|---|---|---|---|
Top | 90 | 2 | 0 | 0 |
High | 26 | 862 | 45 | 35 |
Mid | 106 | 1,917 | 4,114 | 2,620 |
Low | 21 | 1,075 | 3,810 | 8,348 |
Note that WikiProject Medicine defines many categories of articles as "Low-importance" (e.g. all individuals). We have identified 6,619 such articles and they are not part of this table as their rating is never predicted.
WikiProject National Football League
[edit]Top | High | Mid | Low | |
---|---|---|---|---|
Top | 345 | 11 | 1 | 2 |
High | 0 | 519 | 2 | 0 |
Mid | 46 | 262 | 2,444 | 283 |
Low | 12 | 229 | 450 | 3,789 |
Note that WikiProject National Football League defines that some categories of articles should have specific importance ratings, but unlike as we did for WikiProject Medicine, these articles have not been excluded from the table. This is because the organization of the Wikidata entities related to these articles is inconsistent, making it impossible for us to correctly identify them.
WikiProject Politics
[edit]Top | High | Mid | Low | |
---|---|---|---|---|
Top | 111 | 0 | 0 | 0 |
High | 2 | 1,100 | 6 | 7 |
Mid | 206 | 923 | 2,441 | 528 |
Low | 196 | 1,997 | 3,524 | 13,205 |
Trends
[edit]One clear trend in these confusion matrices is that WikiProjects are consistently not labelling articles as "Top-importance" even though they have similar characteristics as other articles that do have this label. Our models are incredibly precise when it comes to correctly labelling that class of articles, which means that articles from other classes in the same column are prime candidates for getting their rating examined. One might be concerned about overfitting on this class given that our training data includes almost all of those articles, but this should not be a problem because we utilize oversampling in the model training (except for WikiProject Africa because of its larger size), and that oversampling is based on a k-nearest neighbors approach, where the neighbors can be from any of the classes.
The other trend in these matrices is that it is more difficult to determine the boundaries between the High-, Mid-, and Low-importance classes. This is something we've also seen previously, for example that in some projects it is not clear whether an article should have a Mid- or Low-importance rating. Because we chose WikiProjects that tend to define importance through article views and we have said views as a predictor in our models, this suggests that these WikiProjects might want to re-examine their definitions and how they apply their ratings. That could lead to both ratings that are more clearly aligned with the definition, as well as seeing the ratings being more consistently applied.