Research talk:Newly registered user/Work log/2014-05-19
Add topicMonday, May 19th
[edit]On April 30th, QChris asked.
COUNT(*)
-ing the lines emitted by the two Sample queries for “local”, I obtain a difference of ~0.7% (checked with enwiki, dewiki, elwiki) between the queries. Especially, since enwiki has a count of ~150K, it looks the queries are measuring different things. What is causing this difference?
I strongly suspect that the only difference is data loss from EventLogging, but it's a good idea to check on it.
SET @start_date = '20131101';
SET @end_date = '20131102';
SELECT COUNT(*)
FROM
enwiki.logging
WHERE
`log_type` = 'newusers' /* exclude proxy registrations */
AND
`log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
`log_timestamp` BETWEEN @start_date AND @end_date;
+----------+ | COUNT(*) | +----------+ | 4310 | +----------+ 1 row in set (0.06 sec)
SELECT COUNT(*)
FROM
log.ServerSideAccountCreation_5487345 /* schema revision reference */
WHERE
`event_isSelfMade` = 1 /* exclude proxy registrations */
AND
`wiki` = 'enwiki' /* specify the project. Omit for global registrations */
AND
`timestamp` >= @start_date /* specify a datetime range */
AND
`timestamp` < @end_date; /* specify a datetime range */
+----------+ | COUNT(*) | +----------+ | 4310 | +----------+ 1 row in set (0.21 sec)
Hmm. Same counts there. Let's try a larger window.
SET @start_date = '20131101';
SET @end_date = '20131201';
SELECT COUNT(*)
FROM
enwiki.logging
WHERE
`log_type` = 'newusers' /* exclude proxy registrations */
AND
`log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
`log_timestamp` BETWEEN @start_date AND @end_date;
+----------+ | COUNT(*) | +----------+ | 150308 | +----------+ 1 row in set (1.31 sec)
SELECT COUNT(*)
FROM
log.ServerSideAccountCreation_5487345 /* schema revision reference */
WHERE
`event_isSelfMade` = 1 /* exclude proxy registrations */
AND
`wiki` = 'enwiki' /* specify the project. Omit for global registrations */
AND
`timestamp` >= @start_date /* specify a datetime range */
AND
`timestamp` < @end_date; /* specify a datetime range */
+----------+ | COUNT(*) | +----------+ | 150159 | +----------+ 1 row in set (2.69 sec)
There's our missing data. Now, let's look at those records and see when they happened.
SELECT
COUNT(*),
SUM(event_userId IS NOT NULL)
FROM
enwiki.logging
LEFT JOIN log.ServerSideAccountCreation_5487345 ON
wiki = 'enwiki' AND event_userId = log_user
WHERE
`log_type` = 'newusers' /* exclude proxy registrations */
AND
`log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
`log_timestamp` BETWEEN @start_date AND @end_date;
+----------+-------------------------------+ | COUNT(*) | SUM(event_userId IS NOT NULL) | +----------+-------------------------------+ | 150308 | 150158 | +----------+-------------------------------+ 1 row in set (3.30 sec)
There's our missing values. It looks like they just got dumped (as opposed to flagged as not isSelfMade). Let's see when they happened.
SELECT
LEFT(log_timestamp, 8) AS day,
SUM(event_userId IS NULL)
FROM
enwiki.logging
LEFT JOIN log.ServerSideAccountCreation_5487345 ON
wiki = 'enwiki' AND event_userId = log_user
WHERE
`log_type` = 'newusers' /* exclude proxy registrations */
AND
`log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
`log_timestamp` BETWEEN @start_date AND @end_date
GROUP BY 1;
+----------+---------------------------+ | day | SUM(event_userId IS NULL) | +----------+---------------------------+ | 20131101 | 0 | | 20131102 | 0 | | 20131103 | 0 | | 20131104 | 0 | | 20131105 | 0 | | 20131106 | 0 | | 20131107 | 0 | | 20131108 | 0 | | 20131109 | 31 | | 20131110 | 46 | | 20131111 | 0 | | 20131112 | 0 | | 20131113 | 0 | | 20131114 | 0 | | 20131115 | 7 | | 20131116 | 22 | | 20131117 | 0 | | 20131118 | 0 | | 20131119 | 9 | | 20131120 | 4 | | 20131121 | 30 | | 20131122 | 1 | | 20131123 | 0 | | 20131124 | 0 | | 20131125 | 0 | | 20131126 | 0 | | 20131127 | 0 | | 20131128 | 0 | | 20131129 | 0 | | 20131130 | 0 | +----------+---------------------------+ 30 rows in set (3.81 sec)
Looks like most days we don't lose data, but we *are* missing a few records scattered throughout a few days in November. --Halfak (WMF) (talk) 19:40, 19 May 2014 (UTC)