User:Halfak (WMF)/New page creations, deletions, and drafts
One of the new projects that I'm working on for the Growth team is an exploration of the nature of article creation for newcomers.
There are a few questions that I'd like to explore, but the first thing to do is really to just get a sense for how many articles are being created, who is creating them and what the rate of deletion is.
Datasets
[edit]In order to preserve consistency between datasets, I'll limit my analysis to all revisions, pages and log events that occurred before November 5th, 2013. The following subsections describe datasets that can be pulled from the database.
Pages
[edit]SQL
|
---|
SELECT
ar_page_id AS page_id,
ar_namespace AS page_namespace,
ar_title AS page_title,
COUNT(*) AS revisions,
MIN(ar_timestamp) AS first_revision,
MAX(ar_timestamp) AS last_revision,
TRUE AS archived,
MIN(ar_rev_id) AS first_rev_id
FROM archive
WHERE ar_timestamp < "20131105000000"
GROUP BY 1,2,3;
UNION
SELECT
page_id,
page_namespace,
page_title,
COUNT(*) AS revisions,
MIN(rev_timestamp) AS first_edit,
MAX(rev_timestamp) AS last_edit,
False AS archived,
MIN(rev_id) AS first_rev_id
FROM revision
INNER JOIN page ON page_id = rev_page
WHERE rev_timestamp < "20131105000000"
GROUP BY 1,2,3;
|
- A page was created on the date of its first revision (first_edit == creation).
- Each page with a row in the
logging
table for a page, it was_deleted. - If an
ar_page_id
appears in thearchive
table, then the page is_currently_deleted.
Events
[edit]Page creation
[edit]SQL
|
---|
SELECT
IFNULL(rev_id, ar_rev_id) AS rev_id,
IFNULL(rev_page, ar_page_id) AS page_id,
IFNULL(rev_comment, ar_comment) AS rev_comment,
IFNULL(rev_user, ar_user) AS user_id,
IFNULL(rev_user_text, ar_user_text) AS user_text,
IFNULL(rev_timestamp, ar_timestamp) AS rev_timestamp,
IFNULL(rev_deleted, ar_deleted) AS rev_deleted,
IFNULL(rev_len, ar_len) AS rev_len
FROM halfak.nov13_page
INNER JOIN revision ON first_rev_id = rev_id
INNER JOIN archive ON first_rev_id = ar_rev_id
|
- The
revision
(orarchive
) with the lowestrev_id
was the creating revision - There should not be duplication between
revision
andarchive
, but if there is, preferrevision
Page deletion
[edit]SQL
|
---|
SELECT
log_id,
log_type,
log_user AS user_id,
log_user_text AS user_text,
log_page AS page_id,
log_namespace AS page_namespace,
log_title AS page_title,
log_timestamp,
log_comment
FROM logging
WHERE
log_action = 'delete' AND
log_type = 'delete' AND
log_timestamp < "20131105000000"
|
- Each
log_action="delete" AND log_type="delete"
row inlogging
corresponds to a deletion event. - Each deletion event has a "reason" in the
log_comment
field. This reason should be classifiable. log_page IS NULL
orlog_page = 0
for all records. This might be a bug (#26122) and will make analysis difficult.
Page restorations
[edit]SQL
|
---|
SELECT
log_id,
log_type,
log_user AS user_id,
log_user_text AS user_text,
log_page AS page_id,
log_namespace AS page_namespace,
log_title AS page_title,
log_timestamp,
log_comment
FROM logging
WHERE
log_action = 'restore' AND
log_type = 'delete' AND
log_timestamp < "20131105000000";
|
- Each
log_action="restore" AND log_type="delete"
row inlogging
corresponds to a restoration event. - Each restoration event has a "reason" in the
log_comment
field. This reason should be classifiable. - When a page is restored, it's page_id changes.
log_page
stores this new page_id.
Page curation actions
[edit]SQL
|
---|
SELECT
log_id,
log_action,
log_page AS page_id,
log_comment
FROM logging
WHERE
log_type = 'pagetriage-curation' AND
log_timestamp < "20131105000000";
|
- Each
log_type = "pagetriage-curation"
row inlogging
correspends to a page curation event. - The
log_action
field represents the type of action that was taken"reviewed"
-- marks the page as "reviewed" in the Page curation system."tag"
-- marks the addition of cleanup templates like "referenced" and "uncategorized""delete"
-- adds a deletion nomination template to the article."unreviewed"
-- marks the page as "unreviewed" in the Page curation system.
Timeseries
[edit]Article status
[edit]In these plots, "now-archived" means that, when the query was run on 20131105, the page appeared in the archive
table. "Now-visible" represents pages that had survived deletion. In the case of these plots, a page is considered to be an "article" if page_namespace=0
in the page
table or ar_namespace=0
in the archive
table.
Page actions
[edit]This plot captures the rate of three relevant page actions:
- creation: The first revision is saved to particular page
- deletion: A page is deleted
- restoration: A page is undeleted
A log scale is used on the y axis to make restorations visible since they are highly uncommon.
It looks like logging didn't exist before 2005.
Page lifetime
[edit]Edit diff approximation
[edit]The following plot shows the density of time between the first and last edit to a page (approximation of the page's lifetime).
Note the spike at zero lifetime which represents the pages that were deleted after only one revision. Note that these pages couldn't have even been tagged before they were deleted.
Note also the gentle, but steady lengthening of time between creation and deletion (last edit) between 2007 and 2013. This suggests that the lifetime of deleted pages has been extending over time.
Articles for Creation
[edit]AfC was originally created in 200?[citation needed] when direct article creation for IP editors was disabled in order to allow IPs to still contribute new articles. A strange artifact of AfC is how it was designed around the creation of pre-article pages in the Wikipedia_talk namespace. This is due to the fact that when article creation was disabled for IP editors, the right to create talk pages remained.
Recently, newly registered editors have been increasingly directed to AfC with the intention that these newcomers would benefit from mentoring along with the review process built into AfC.
Historical notes
[edit]AFC change over time
[edit]AFC functioned as a place to request articles as of 2006: [1]
Submissions reviewed as of Feb 2007: [2]
Anon backstory
[edit]See 2005 [3] RFC Nov 2007: [4]
Relevant moments
[edit]WikiProject created June 2007: [5]
Article Wizard page appeared August 2009 [6]
Hypotheses
[edit]- AfC is too slow.
- Submissions don't get reviewed fast enough -- possibly due to lack of reviewers
- AfC reviews are ineffective at ensuring quality.
- Submissions are accepted, but later deleted.
- AfC doesn't improve new editor retention
- Because of the two above effects, AfC causes new editors to leave just as often as if they create their article in mainspace and risk speedy deletion
- AFC is a low-quality dump
- A significant number of drafts are unlikely to ever cross our Notability threshold
- AFC has poor reviewing
- AFC might be biased towards declining submissions, and have incorrect reviewing on the same
- AFC might hinder good faith work, eg editathons
Workflow
[edit]There are 4 main states that an AfC page can occupy:
- Draft: Pages that have not yet been submitted for review.
- Pending: Pages that are waiting for review by AfC volunteers
- Accepted: Pages that were accepted and moved to NS=0
- Declined: Pages that were declined
Any non-accepted submissions that remain idle for more than 60 days[citation needed] can be flagged with CSD criteria G13 "Abandoned Articles for creation submissions". Pages that are currently occupying these states can be discovered via categories.
- Draft: Draft AfC submissions
- Pending: Pending AfC submissions
- Accepted: Accepted AfC submissions
- Declined: Subcategories of Declined AfC submissions -- e.g., AfC submissions declined as an advertisement
- Note also that stagnant drafts can be Speedy Deleted under criteria en:WP:G13. These submissions should be considered declined.
Historical data extraction
[edit]Regular expression
|
---|
/\{\{AfC( |_)submission\|([^\|]*)|/g
|
Sadly, MediaWiki does not keep a history of categorizations that we can simply query in order to find out when an article moved from category to category. The best way we can do so is by performing text analysis on the content of AfC pages and looking for instances of {{AfC submission}}
. The template's first positional parameter denotes the state of the submission:
- Draft: "t"
- Pending: "" (empty) or "r" for "review in progress"
- Accepted: "a"
- Declined: "d"
Part of the AfC process involves appending duplicate templates to the bottom of the article (and then manually removing an other version). In this case, the template with the greatest precedence should be taken. "t" < "" < "r" < ("a", "d"). Since "accepted" and "declined" share precedence we might have a problem determining what status the article occupies if both are present. I'll arbitrarily decide that "a" < "d" and plan for analysis to see how often this occurs.
Strategy
[edit]For all articles that appear in a relevant AfC category (see above), process revision content from the first revision forward in time.
- If page is in ns=0, process corresponding talk page revisions in parallel to content page revisions.
- For each revision content, extract all instances of the AfC submission template using the Regular expression above.
- Extract first group match looking for {"t", "", "r", "a", "d"}
- Identify the most advanced (based on precedence) status
- If most advanced status is different from previous revision, record status change at revision timestamp