Research talk:Automated classification of article importance/Work log/2017-04-25
Add topicTuesday, April 25, 2017
[edit]Today I'll wrap up the WikiProject candidate selection by getting a list prepared for discussion. Then I'll move over to figuring out how to get view rate data.
Organic inlinks vs all inlinks
[edit]I decided to revisit my feasibility study in order to see if our conclusion that using the database to grab inlinks provided us with better signal than using only the links present in the wikitext (meaning links coming in through infoboxes, navboxes, etc… get ignored). Previously, I had found that using all links gave better results, but that might have changed now that we instead use rank percentiles for views and inlinks as predictors. To keep things simple, I decided to use the exact same dataset as I had before, and simply calculate the rank percentiles and use them as predictors. Neither the SVM nor GBM models were particularly helpful in this due to either being slow to calculate (SVM) or using too much memory (GBM), so I went with a limited size Random Forest model instead. Like before I used 10-fold cross-validation to decide on forest size and terminating node size, choosing the one that had the highest overall accuracy. Based on the training "out of bag" estimate, there is no difference between the overall accuracy of these approaches when predicting importance across all of English Wikipedia (using our feasibility study dataset). There are some changes for individual classes, which might suggest that we make slightly different decisions for individual articles. Either way, once we have a way to compute organic inlink counts, we'll want to study those again.
WikiProject candidates
[edit]List of candidate WikiProject having at least 100 non-bot edits in their project space (e.g. "WikiProject China" and all its related talk- and subpages) over the past 180 days, at least 1,000 edits to their articles in the past 180 days, and with at least 25% of their articles rated "unknown" importance:
Project name | No. of articles | % Unknown | No. of Non-bot edits |
---|---|---|---|
WikiProject Africa | 80,937 | 43.2 | 7,324 |
WikiProject Albums | 172,025 | 41.4 | 439 |
WikiProject Beauty Pageants | 5,984 | 53.3 | 127 |
WikiProject Buddhism | 4,689 | 53.6 | 137 |
WikiProject Chicago | 43,604 | 50.3 | 127 |
WikiProject China | 50,846 | 37.6 | 122 |
WikiProject Cycling | 21,210 | 50.0 | 156 |
WikiProject Dungeons & Dragons | 4,062 | 25.4 | 186 |
WikiProject Dungeons & Dragons | 4,062 | 25.4 | 186 |
WikiProject Europe | 5,091 | 48.1 | 587 |
WikiProject Historic sites | 8,553 | 38.2 | 170 |
WikiProject Horror | 12,102 | 42.7 | 122 |
WikiProject Iran | 89,620 | 76.9 | 125 |
WikiProject Judaism | 11,036 | 30.2 | 126 |
WikiProject Korea | 21,635 | 39.1 | 151 |
WikiProject Malaysia | 9,046 | 27.6 | 104 |
WikiProject Motorsport | 9,615 | 31.4 | 119 |
WikiProject National Football League | 27,766 | 68.8 | 717 |
WikiProject Olympics | 108,593 | 43.9 | 114 |
WikiProject Pharmacology | 10,957 | 40.7 | 269 |
WikiProject Politics | 47,556 | 29.1 | 225 |
WikiProject Politics of the United Kingdom | 37,165 | 52.2 | 131 |
WikiProject Rock music | 15,346 | 33.7 | 155 |
WikiProject Rugby league | 14,044 | 33.4 | 191 |
WikiProject Television | 107,565 | 34.5 | 395 |
WikiProject Television Stations | 9,060 | 36.8 | 253 |
WikiProject United Nations | 5,122 | 64.2 | 226 |
WikiProject Yugoslavia | 2,722 | 32.8 | 211 |