Jump to content

Research talk:Newly registered user/Work log/2014-05-19

Add topic
From Meta, a Wikimedia project coordination wiki

Monday, May 19th

[edit]

On April 30th, QChris asked.

COUNT(*)-ing the lines emitted by the two Sample queries for “local”, I obtain a difference of ~0.7% (checked with enwiki, dewiki, elwiki) between the queries. Especially, since enwiki has a count of ~150K, it looks the queries are measuring different things. What is causing this difference?

I strongly suspect that the only difference is data loss from EventLogging, but it's a good idea to check on it.

SET @start_date = '20131101';
SET @end_date = '20131102';
SELECT COUNT(*)
FROM
  enwiki.logging
WHERE
  `log_type` = 'newusers' /* exclude proxy registrations */
AND
  `log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
  `log_timestamp` BETWEEN @start_date AND @end_date;
+----------+
| COUNT(*) |
+----------+
|     4310 |
+----------+
1 row in set (0.06 sec)
SELECT COUNT(*)
FROM
  log.ServerSideAccountCreation_5487345 /* schema revision reference */
WHERE
  `event_isSelfMade` = 1 /* exclude proxy registrations */
AND
  `wiki` = 'enwiki' /* specify the project. Omit for global registrations */
AND
  `timestamp` >= @start_date /* specify a datetime range */
AND
  `timestamp` < @end_date; /* specify a datetime range */
+----------+
| COUNT(*) |
+----------+
|     4310 |
+----------+
1 row in set (0.21 sec)

Hmm. Same counts there. Let's try a larger window.

SET @start_date = '20131101';
SET @end_date = '20131201';
SELECT COUNT(*)
FROM
  enwiki.logging
WHERE
  `log_type` = 'newusers' /* exclude proxy registrations */
AND
  `log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
  `log_timestamp` BETWEEN @start_date AND @end_date;
+----------+
| COUNT(*) |
+----------+
|   150308 |
+----------+
1 row in set (1.31 sec)
SELECT COUNT(*)
FROM
  log.ServerSideAccountCreation_5487345 /* schema revision reference */
WHERE
  `event_isSelfMade` = 1 /* exclude proxy registrations */
AND
  `wiki` = 'enwiki' /* specify the project. Omit for global registrations */
AND
  `timestamp` >= @start_date /* specify a datetime range */
AND
  `timestamp` < @end_date; /* specify a datetime range */
+----------+
| COUNT(*) |
+----------+
|   150159 |
+----------+
1 row in set (2.69 sec)

There's our missing data. Now, let's look at those records and see when they happened.

SELECT
  COUNT(*),
  SUM(event_userId IS NOT NULL)
FROM
  enwiki.logging
LEFT JOIN log.ServerSideAccountCreation_5487345 ON
  wiki = 'enwiki' AND event_userId = log_user
WHERE
  `log_type` = 'newusers' /* exclude proxy registrations */
AND
  `log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
  `log_timestamp` BETWEEN @start_date AND @end_date;
+----------+-------------------------------+
| COUNT(*) | SUM(event_userId IS NOT NULL) |
+----------+-------------------------------+
|   150308 |                        150158 |
+----------+-------------------------------+
1 row in set (3.30 sec)

There's our missing values. It looks like they just got dumped (as opposed to flagged as not isSelfMade). Let's see when they happened.

SELECT
  LEFT(log_timestamp, 8) AS day,
  SUM(event_userId IS NULL)
FROM
  enwiki.logging
LEFT JOIN log.ServerSideAccountCreation_5487345 ON
  wiki = 'enwiki' AND event_userId = log_user
WHERE
  `log_type` = 'newusers' /* exclude proxy registrations */
AND
  `log_action` = 'create' /* only include self-created users, exclude attached and proxy-registered users */
AND
  `log_timestamp` BETWEEN @start_date AND @end_date
GROUP BY 1;
+----------+---------------------------+
| day      | SUM(event_userId IS NULL) |
+----------+---------------------------+
| 20131101 |                         0 |
| 20131102 |                         0 |
| 20131103 |                         0 |
| 20131104 |                         0 |
| 20131105 |                         0 |
| 20131106 |                         0 |
| 20131107 |                         0 |
| 20131108 |                         0 |
| 20131109 |                        31 |
| 20131110 |                        46 |
| 20131111 |                         0 |
| 20131112 |                         0 |
| 20131113 |                         0 |
| 20131114 |                         0 |
| 20131115 |                         7 |
| 20131116 |                        22 |
| 20131117 |                         0 |
| 20131118 |                         0 |
| 20131119 |                         9 |
| 20131120 |                         4 |
| 20131121 |                        30 |
| 20131122 |                         1 |
| 20131123 |                         0 |
| 20131124 |                         0 |
| 20131125 |                         0 |
| 20131126 |                         0 |
| 20131127 |                         0 |
| 20131128 |                         0 |
| 20131129 |                         0 |
| 20131130 |                         0 |
+----------+---------------------------+
30 rows in set (3.81 sec)

Looks like most days we don't lose data, but we *are* missing a few records scattered throughout a few days in November. --Halfak (WMF) (talk) 19:40, 19 May 2014 (UTC)Reply