Research talk:Automated classification of article importance/Work log/2017-03-06
Add topicMonday, March 6, 2017
[edit]Today I'll be gathering a dataset of importance-rated articles on the English Wikipedia, aiming to answer the following research questions:
- How many articles (in total and proportion of all articles) have at least one importance rating?
- How many articles are rated by a single project?
- How many articles are rated by multiple projects?
- How many ratings are unanimous?
- How many are rated by more than one project and unanimously rated?
- What is the overlap between ratings?
- How many have more than two ratings?
In order to measure the proportion of all articles that have at least one importance rating, I need a count of the number of articles. I decided to write a simple SQL query that estimates it by counting all pages in the main namespace that are not redirects and not disambiguation pages. As of the time of writing, that number of articles is 5,073,074.
Notes on the gathered dataset
[edit]I wrote a Python script to gather a dataset of articles with importance ratings. The code makes a few assumptions about the importance ratings that need to be spelled out:
- The importance ratings we are mainly concerned with are: Top, High, Mid, and Low. "Unknown" and "NA" ratings are also recorded in the output dataset as those are in fairly common use. Lesser used ratings are ignored, such as "Related" (used by Wikipedia:WikiProject National Register of Historic Places) and "Bottom" (used by for instance by Wikipedia:WikiProject Rocketry).
- The script assumes that a WikiProject follows the standard naming conventions for categories of articles by importance. Most WikiProjects appear to do so, using the typical schema of the importance rating (e.g. "Top-importance") followed by the project name (e.g. "medicine") and suffixed with "articles" (e.g. the category is named Category:Top-importance medicine articles).
- The dataset contains talk pages that do not have associated articles. For example Talk:Tribes of Montenegro/Archive 1 has many WikiProject rating templates but does not have an article with a matching title (Tribes of Montenegro/Archive 1 does not and should not exist). If the talk page title contains "/archive" (using a case-insensitive match) the associated "talk_is_archive" column is set to 1. If the associated article page does not exist, the "page_id" and "revision_id" columns should both be -1.
- The dataset contains talk pages where the associated article page is a redirect. This is a fairly common problem amongst WikiProjects, an article page will be moved without the associated talk page also moving. In some cases valid redirects are set up but also rated with importance ratings. Our dataset contains a "is_redirect" column that is set to 1 when the article page is a redirect.
RQ1: Number of articles with importance ratings
[edit]In our dataset, the total number articles with importance ratings is defined by the number of talk pages that are not an archive, and where the associated article exists but is not a redirect. This gives us the following number of articles and proportion:
> length(impdata[talk_is_archive == 0 & page_id > 0 & is_redirect == 0]$page_id); [1] 3321150
> total_articles [1] 5073074 > 100 * length(impdata[talk_is_archive == 0 & page_id > 0 & is_redirect == 0]$page_id)/total_articles; [1] 65.46622
3,321,150 articles, or 65.46%, have at least one importance rating. Note the total number of articles comes from our Quarry SQL query as described above.
RQ2: How many articles are rated by a single project?
[edit]Our dataset contains two columns that can give us this answer, both the "importance_ratings" and "wikiprojects" list them, using either "::" or "," respectively as a separator when multiple values are present. This allows us to use the lack of these separators to identify articles that are only rated by a single project.
> length(only_articles[grep(",", only_articles$importance_ratings, invert=TRUE)]$page_id); [1] 1123925
> 100*length(only_articles[ + grep(",", only_articles$importance_ratings, invert=TRUE) + ]$page_id)/length(only_articles$page_id); [1] 33.84144
1,123,925 articles, or 33.84%, are only rated by a single WikiProject.
RQ3: How many articles are rated by multiple projects?
[edit]This is the converse of RQ2, all we need to do is to remove the "invert=TRUE" parameter to "grep":
> length(only_articles[grep(",", only_articles$importance_ratings)]$page_id); [1] 2197225
> 100*length(only_articles[ + grep(",", only_articles$importance_ratings) + ]$page_id)/length(only_articles$page_id); [1] 66.15856
2,197,225 articles, or 66.16%, are rated by multiple WikiProjects.