Jump to content

Research talk:Autoconfirmed article creation trial/Work log/2017-09-04

Add topic
From Meta, a Wikimedia project coordination wiki

Monday, September 4, 2017

[edit]

Today I'll build upon the past few days analysis of article creation and survival by looking at identifying why our article creation dataset does not appear to include deleted articles, and checking a few dozen random accounts to see if there's an easily identifiable difference between those who get their articles deleted and those who do not.

Article creation dataset

[edit]

We saw in the Sept 1 work log that our article creation dataset does not appear to include any deleted articles prior to mid-2014. There is also this suspicious dip between 2012 and 2014 that I want to investigate.

I plan to get at this by first identifying some speedy-deleted articles from that dip, then find their first revisions, and then go look at what properties those revisions have in the Data Lake to see if there's a pattern that we can use to catch those page creation events as well.

I looked at MusikAnimal's code for the article creation reports to see how speedy deletion events are logged. Searching for "WP:CSD" in the log comment does the job, so I used the following SQL query to get deleted articles that are roughly in the middle of the "dip" in our dataset:

SELECT *
FROM logging
WHERE log_type='delete'
AND log_action='delete'
AND log_namespace=0
AND log_timestamp >= '20130701000000'
AND log_timestamp < '20130702000000'
AND log_comment LIKE '%WP:CSD%'
LIMIT 50;

This gave me a list of 50 articles, which I then look up in the archive table to find their first edit. From those 50, I identified 33 revisions that created an article during the timespan we are interested in. Looking them up in the Data Lake, I found that we miss them because revision_parent_id is not NULL, it is instead correctly set to 0. I also see that all these revision have page_revision_count set to 1. I'll run a new query where I move the check for page_revision_count out to see if I can trust it, and where revision_parent_id can be either NULL or 0

The plot above shows the number of articles created per day in the new dataset. This looks much better than what we had before. There are peaks and dips, but there does not appear to be any clear plateau shifts in the dataset. The slowly decreasing trend in number of articles created seems also to have some easily identifiable patterns, e.g. that activity is higher in the spring and fall than in the summer.