Jump to content

Research talk:Daily unique media creators

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 10 years ago by Halfak (WMF) in topic Wednesday, September 10th

Work log

[edit]
archive

Wednesday, September 10th

[edit]

Today, I'm reverse engineering "media creators". To me, this means people who originally uploaded a file.

> select distinct log_action from logging where log_type = "upload";
+------------+
| log_action |
+------------+
| upload     |
| overwrite  |
+------------+
2 rows in set (16.71 sec)

Looks like we have two actions. I'm guessing that "upload" is the original and "overwrite" is a re-upload for the same filename. --Halfak (WMF) (talk) 16:01, 10 September 2014 (UTC)Reply


> SELECT 
    ->     log_namespace, 
    ->     log_title, 
    ->     SUM(log_action = "upload") AS upload,
    ->     SUM(log_action = "overwrite") AS overwrite
    -> FROM logging 
    -> INNER JOIN (
    ->     SELECT 
    ->         page_namespace AS log_namespace,
    ->         page_title AS log_title
    ->     FROM page
    ->     WHERE page_namespace = 6
    ->     ORDER BY RAND()
    ->     LIMIT 100
    -> ) AS sampled_media USING (log_namespace, log_title) 
    -> WHERE 
    ->     log_type = "upload"
    -> GROUP BY 1,2;
+---------------+-----------------------------------------------------------------+--------+-----------+
| log_namespace | log_title                                                       | upload | overwrite |
+---------------+-----------------------------------------------------------------+--------+-----------+
|             6 | "Jodi,_1999".jpg                                                |      1 |         0 |
|             6 | A1_in_London_1923.jpg                                           |      1 |         0 |
|             6 | AWMphotoA02730.jpg                                              |      1 |         0 |
|             6 | Annarth_poster.jpg                                              |      1 |         0 |
|             6 | Arcoftriiamph.jpg                                               |      1 |         0 |
|             6 | Artemis_Fowl_-_The_Atlantis_Complex.jpg                         |      1 |         1 |
|             6 | B_McCulloch_CSA_ACW.jpg                                         |      1 |         0 |
|             6 | Backyard_Football_'09_Coverart.png                              |      1 |         0 |
|             6 | Batman_Contagion_TPB_cover.jpg                                  |      1 |         4 |
|             6 | Bbbbb.jpg                                                       |      2 |         0 |
|             6 | Benningsen_Beavers.jpg                                          |      1 |         0 |
|             6 | Berliner_Zeitung_Logo.svg                                       |      1 |         0 |
|             6 | BestWhorehouse.jpg                                              |      1 |         0 |
|             6 | Bin_Quraya_Logo_(2013).svg                                      |      1 |         0 |
|             6 | Black_cobra.gif                                                 |      1 |         0 |
|             6 | Bleriot_monoplane.jpg                                           |      1 |         0 |
|             6 | BoyleFarmIsland.jpg                                             |      1 |         0 |
|             6 | Brigitte_Mira.jpg                                               |      1 |         0 |
|             6 | Cheap_Trick_It's_Only_Love_Single_1986.png                      |      1 |         0 |
|             6 | Chrisye_-_Lilin-Lilin_Kecil_(1977).ogg                          |      1 |         0 |
|             6 | CleanAirCampaignLogo.jpg                                        |      1 |         0 |
|             6 | Community_Emergency_Response_Team_(US)_Logo.jpg                 |      1 |         0 |
|             6 | CraigInteview09_01.jpg                                          |      1 |         1 |
|             6 | Cut_image_of_Rarden_Image-Warrior_Infantry_Fighting_Vehicle.jpg |      1 |         0 |
|             6 | Cyclonscreenshot001.jpg                                         |      1 |         0 |
|             6 | Department_of_Stomatology_at_Martin_University_Hospital.jpg     |      1 |         0 |
|             6 | Devon_Cornwall_Police_logo.jpg                                  |      1 |         0 |
|             6 | Dickey-Deliverance.jpg                                          |      1 |         0 |
|             6 | Distribution_wealth_population_europe_2000.gif                  |      1 |         0 |
|             6 | Efaktura_logo.png                                               |      1 |         0 |
|             6 | Elin_Lanto_Rockbjornen.jpg                                      |      1 |         0 |
|             6 | Eliseu_Visconti_-_Maternity_-_Google_Art_Project.jpg            |      1 |         0 |
|             6 | Elvis_Presley-The_Sun_Sessions_(album_cover).jpg                |      1 |         0 |
|             6 | Embroidered_foliage.jpg                                         |      1 |         0 |
|             6 | Empress_Catherine_I_-c.1724_-3.jpg                              |      1 |         0 |
|             6 | Factor_X.png                                                    |      1 |         1 |
|             6 | FortunateFall.jpg                                               |      1 |         0 |
|             6 | GCRiptide.PNG                                                   |      1 |         0 |
|             6 | GED_GameEditor.jpg                                              |      1 |         0 |
|             6 | Gatecrasher4.jpg                                                |      1 |         1 |
|             6 | Georgia-Tbilisi.png                                             |      1 |         0 |
|             6 | Geroge_Benson_Big_Boss_Band_cover.jpg                           |      1 |         0 |
|             6 | Girls_Action_Foundation_logo.png                                |      1 |         0 |
|             6 | Grease_Super_Macro.JPG                                          |      2 |         0 |
|             6 | Hapinas_logo.jpg                                                |      1 |         1 |
|             6 | Imperial_College_Healthcare_logo.jpg                            |      1 |         0 |
|             6 | Inkscape-Arc.png                                                |      1 |         0 |
|             6 | JohnSteinbeck_ThePearl_title.jpg                                |      1 |         0 |
|             6 | KDAF_AntennaTV.png                                              |      1 |         0 |
|             6 | KSNA_(FM)_logo.jpg                                              |      1 |         0 |
|             6 | Kenny_Rogers_-_Dont_Fall_single.jpg                             |      1 |         0 |
|             6 | Krtologosmall.jpg                                               |      1 |         5 |
|             6 | Large_Arch_by_Henry_Moore_(1971)_in_April_2014.jpg              |      1 |         0 |
|             6 | Lastpost.jpg                                                    |      1 |         0 |
|             6 | Laurent_Nivalle_portrait.jpg                                    |      1 |         0 |
|             6 | Len_Harvey_1.jpg                                                |      1 |         0 |
|             6 | MANXYZ_permease_Step_7.jpg                                      |      1 |         0 |
|             6 | MDB_drum_majors.jpg                                             |      1 |         0 |
|             6 | Maa_Music_logo.jpg                                              |      1 |         0 |
|             6 | Mahout2_crop.jpg                                                |      1 |         0 |
|             6 | Marvel-super-hero-squad-cover.jpg                               |      1 |         0 |
|             6 | Mosaic_Stadium_Logo.png                                         |      1 |         0 |
|             6 | My-Cousin-Vinny-Poster.jpg                                      |      1 |         0 |
|             6 | Nama_Mia.png                                                    |      1 |         0 |
|             6 | Nasserist_Socialist_Vanguard_Party_symbol.jpg                   |      1 |         0 |
|             6 | New_Borinque_Pro_Wrestling_Logo.jpg                             |      1 |         0 |
|             6 | OK-CLEET-logo.jpg                                               |      1 |         0 |
|             6 | Once-upon-a-time-in-the-north.jpg                               |      1 |         2 |
|             6 | One.Way.Ticket.to.Hell...And.Back_The.Darkness.back-front.jpg   |      1 |         0 |
|             6 | Owlz.PNG                                                        |      1 |         1 |
|             6 | Peyronie's-disease-illustration.jpg                             |      1 |         0 |
|             6 | Pockbot.gif                                                     |      1 |         0 |
|             6 | Prof_Pyg.jpg                                                    |      1 |         1 |
|             6 | Robertlang.JPG                                                  |      1 |         0 |
|             6 | Sailorneptune.jpg                                               |      7 |         0 |
|             6 | Scream-In_promo_card_(WXIX)_circa_1970.jpg                      |      1 |         0 |
|             6 | Software-Freedom-Conservancy-Logo.svg                           |      1 |         1 |
|             6 | Soroban_6_50px.png                                              |      1 |         0 |
|             6 | St._lourdes_church.jpg                                          |      1 |         0 |
|             6 | Suburban_Mayhem_poster.jpg                                      |      1 |         0 |
|             6 | Sugarloafoverlook.jpg                                           |      1 |         0 |
|             6 | Swarm_(comics).png                                              |      1 |         0 |
|             6 | Thanksgiving_(2004_film)_(movie_poster).jpg                     |      1 |         0 |
|             6 | The_Sign_next_to_the_Rochdal_Canal.JPG                          |      1 |         0 |
|             6 | Theodore-rex-DVD.jpg                                            |      1 |         1 |
|             6 | Timothyvanlaar.jpg                                              |      1 |         0 |
|             6 | Tru-blu-company-logo.png                                        |      1 |         0 |
|             6 | University_of_Denver_campus_pics_043.jpg                        |      1 |         0 |
|             6 | Usvls-seal.png                                                  |      1 |         0 |
|             6 | Virginia_392.png                                                |      1 |         0 |
|             6 | WDIC-AM_FM.png                                                  |      1 |         0 |
|             6 | WNTP_logo.gif                                                   |      1 |         0 |
|             6 | Ängby_IF.png                                                    |      1 |         0 |
|             6 | Το_τελευταίο_ψέμα,.jpg                                          |      1 |         0 |
+---------------+-----------------------------------------------------------------+--------+-----------+
94 rows in set (6.77 sec)
<pre>

It looks like this mostly works.  There are some files that have more than one "upload" event.  Let's see what's up with them. 

<pre>
> select log_id, log_timestamp, log_type, log_action, LEFT(log_comment, 50) from logging where log_namespace = 6 and log_title = "Sailorneptune.jpg";
+----------+----------------+----------+------------+----------------------------------------------------+
| log_id   | log_timestamp  | log_type | log_action | LEFT(log_comment, 50)                              |
+----------+----------------+----------+------------+----------------------------------------------------+
|  2699690 | 20060510035725 | upload   | upload     | Image of cartoon character "Sailor Neptune"<br>
Ap |
|  3218202 | 20060620174530 | upload   | upload     | http://moonprincess.com/galleries/details.php?imag |
|  3233294 | 20060621192957 | upload   | upload     | Reverted to earlier revision                       |
|  3233318 | 20060621193122 | upload   | upload     | Reverted to earlier revision                       |
|  3233329 | 20060621193153 | upload   | upload     | Reverted to earlier revision                       |
|  6879519 | 20070206084304 | upload   | upload     | Reverted to earlier revision                       |
|  8904429 | 20070522020803 | upload   | upload     |                                                    |
| 38925761 | 20111021004413 | delete   | delete     | Deleted old revision 20060620174530!Sailorneptune. |
| 38925762 | 20111021004416 | delete   | delete     | Deleted old revision 20070522020803!Sailorneptune. |
| 38925765 | 20111021004418 | delete   | delete     | Deleted old revision 20070206084304!Sailorneptune. |
| 38925768 | 20111021004420 | delete   | delete     | Deleted old revision 20060621193152!Sailorneptune. |
| 38925770 | 20111021004422 | delete   | delete     | Deleted old revision 20060621193122!Sailorneptune. |
| 38925772 | 20111021004424 | delete   | delete     | Deleted old revision 20060621192956!Sailorneptune. |
+----------+----------------+----------+------------+----------------------------------------------------+
13 rows in set (0.02 sec)

Well... that's weird. It looks like there was a set of uploads in 2006,2007, but then a set of deletes in one day in 2011. Those "reverted to earlier versions" have me worried that log_action = "upload" used to be used for overwrites. Let's look at another example.

> select log_id, log_timestamp, log_type, log_action, LEFT(log_comment, 50) from logging where log_namespace = 6 and log_title = "Grease_Super_Macro.JPG";
+---------+----------------+----------+------------+------------------------------------------------+
| log_id  | log_timestamp  | log_type | log_action | LEFT(log_comment, 50)                          |
+---------+----------------+----------+------------+------------------------------------------------+
| 9884320 | 20070718174156 | upload   | upload     | Picture of grease taken with Super Macro Mode. |
| 9884440 | 20070718175047 | upload   | upload     |                                                |
+---------+----------------+----------+------------+------------------------------------------------+
2 rows in set (0.06 sec)

Yeah. That one is weird too. It's also in 2007. Let's try the last query with the first event date included.

> SELECT 
    ->     log_namespace, 
    ->     log_title, 
    ->     SUM(log_action = "upload") AS uploads,
    ->     SUM(log_action = "overwrite") AS overwrites,
    ->     MIN(log_timestamp) AS first_upload
    -> FROM logging 
    -> INNER JOIN (
    ->     SELECT 
    ->         page_namespace AS log_namespace,
    ->         page_title AS log_title
    ->     FROM page
    ->     WHERE page_namespace = 6
    ->     ORDER BY RAND()
    ->     LIMIT 1000
    -> ) AS sampled_media USING (log_namespace, log_title) 
    -> WHERE 
    ->     log_type = "upload"
    -> GROUP BY 1,2
    -> HAVING SUM(log_action = "upload") > 1
    -> ;
+---------------+------------------------------------------------+---------+------------+----------------+
| log_namespace | log_title                                      | uploads | overwrites | first_upload   |
+---------------+------------------------------------------------+---------+------------+----------------+
|             6 | Bags.jpg                                       |       2 |          0 | 20061018113801 |
|             6 | CAC_logo.jpg                                   |       4 |          0 | 20070413081748 |
|             6 | CHR_WA_A01.jpg                                 |       3 |          3 | 20090605112708 |
|             6 | Campbellwalker.jpg                             |       3 |          0 | 20060731191012 |
|             6 | CannibalHolocaustimpale.jpg                    |       5 |          2 | 20070112005709 |
|             6 | Charlielost.PNG                                |       3 |          3 | 20070219042047 |
|             6 | Dr_pepper_ballpark_right_field_toward_home.jpg |       2 |          0 | 20060805095416 |
|             6 | Ed_Chynoweth_Cup.JPG                           |       3 |          0 | 20070530002001 |
|             6 | EveOfTheWarEP.jpg                              |       3 |          0 | 20070521012020 |
|             6 | GREATWESTstates.PNG                            |       2 |          3 | 20061129185017 |
|             6 | Gupt.jpg                                       |       2 |          0 | 20060226025609 |
|             6 | Head_over_heels_amstrad_1.png                  |       2 |          1 | 20070316185350 |
|             6 | IndianEducationSystem.JPG                      |       2 |          0 | 20070101091319 |
|             6 | Jonathan_Rhys_Meyers.jpg                       |       2 |          1 | 20071002191953 |
|             6 | KazanRailTerminalWest.jpg                      |       2 |          0 | 20060614100347 |
|             6 | Mardi_Gras_logo_box_white.png                  |       2 |          0 | 20070403040509 |
|             6 | Monty.GIF                                      |       6 |          0 | 20050723222006 |
|             6 | News.jpg                                       |       7 |          0 | 20070417204330 |
|             6 | NoSirNihilismIsNotPractical.jpg                |       2 |          0 | 20060126182642 |
|             6 | North_Street_1957-1.jpg                        |       2 |          0 | 20090513061928 |
|             6 | Rabbanim_Supporting_the_Manhattan.jpg          |       3 |          0 | 20070712212608 |
|             6 | Rasathanthram.jpg                              |       2 |          0 | 20070514103051 |
|             6 | Right_Place,_Wrong_Time_DVD.jpg                |       2 |          1 | 20080808025356 |
|             6 | RiserDiagram.png                               |       2 |          0 | 20070605215455 |
|             6 | SHU_main_gate.jpg                              |       2 |          0 | 20080624101518 |
|             6 | Sequence.jpg                                   |       2 |          3 | 20080312002548 |
|             6 | Sourceradio.jpg                                |       2 |          0 | 20051208192739 |
|             6 | Spitfire2xs.jpg                                |       2 |          1 | 20060506014856 |
|             6 | Star_Ocean_SFC.jpg                             |       2 |          0 | 20060707224941 |
|             6 | Teaser2.jpg                                    |       2 |          0 | 20060917172304 |
|             6 | The_Changeling.jpg                             |       2 |          1 | 20060913164513 |
|             6 | Tom_and_jerry_mgm_parody.jpg                   |       2 |          0 | 20060614232540 |
|             6 | Tyranos.jpg                                    |       4 |          0 | 20060603033435 |
|             6 | What-Has-Befallen-Us,-Barbad.jpg               |       2 |          0 | 20070629101709 |
|             6 | Working_memory_model.PNG                       |       2 |          0 | 20070312124212 |
+---------------+------------------------------------------------+---------+------------+----------------+
35 rows in set (47.13 sec)

Look at all those pre-2008 entries. I wonder if there's a clear time cutoff when the log starts to be sane.

> SELECT
    ->     LEFT(first_upload, 6) AS month,
    ->     SUM(uploads > 1)/COUNT(*) AS multi_upload_prop
    -> FROM (
    ->     SELECT
    ->         log_namespace,
    ->         log_title,
    ->         COUNT(*) AS uploads,
    ->         MIN(log_timestamp) AS first_upload
    ->     FROM logging
    ->     WHERE
    ->         log_namespace = 6 AND
    ->         log_action = "upload" AND
    ->         log_type = "upload"
    ->     GROUP BY 1,2
    -> ) AS file_stats
    -> GROUP BY 1;
+--------+-------------------+
| month  | multi_upload_prop |
+--------+-------------------+
| 200412 |            0.1723 |
| 200501 |            0.1655 |

<...snip...>

| 200704 |            0.1072 |
| 200705 |            0.1114 |
| 200706 |            0.1405 |
| 200707 |            0.1360 |
| 200708 |            0.1174 | <--,
| 200709 |            0.0379 | <--'
| 200710 |            0.0350 |
| 200711 |            0.0281 |
| 200712 |            0.0252 |
| 200801 |            0.0257 |

<...snip...>

| 201406 |            0.0105 |
| 201407 |            0.0139 |
| 201408 |            0.0047 |
| 201409 |            0.0069 |
+--------+-------------------+
118 rows in set (2 min 59.80 sec)

Looks like somewhere in Sept. 2007, the upload log gets sane.

> SELECT
    ->     LEFT(log_timestamp, 6) AS month,
    ->     SUM(log_action = "upload") AS overwrites,
    ->     SUM(log_action = "overwrite") AS overwrites
    -> FROM logging
    -> WHERE
    ->     log_namespace = 6 AND
    ->     log_type = "upload"
    -> GROUP BY 1
    -> ;
+--------+------------+------------+
| month  | overwrites | overwrites |
+--------+------------+------------+
| 200412 |       3706 |          0 |
| 200501 |      13547 |          0 |
| 200502 |      13089 |          0 |

<...snip...>

| 200705 |      78580 |          0 |
| 200706 |      76389 |          0 |
| 200707 |      78138 |          0 |
| 200708 |      71218 |       2914 |
| 200709 |      54373 |      11625 |
| 200710 |      55511 |      13791 |

<...snip...>

| 201407 |       8005 |       2205 |
| 201408 |       9073 |       2357 |
| 201409 |       2677 |        738 |
+--------+------------+------------+
118 rows in set (1 min 15.14 sec)

So... is there something else I can look for in these events to know which ones are the creation?

> select * from logging where log_namespace = 6 and log_title = "GREATWESTstates.PNG" and log_action = "upload";
+---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+
| log_id  | log_type | log_action | log_timestamp  | log_user | log_namespace | log_title           | log_comment                                                                    | log_params | log_deleted | log_user_text | log_page |
+---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+
| 5711684 | upload   | upload     | 20061129185017 |   455767 |             6 | GREATWESTstates.PNG | Derived from [http://en.wikipedia.org/wiki/Image:BlankMap-USA-states.PNG here] |            |           0 |               |     NULL |
| 7814423 | upload   | upload     | 20070325024149 |  1279118 |             6 | GREATWESTstates.PNG | official conference colors                                                     |            |           0 |               |     NULL |
+---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+
2 rows in set (0.00 sec)

Nope. Hmm... I wonder if I could try a different strategy. I'll look for the first upload with an exclusive outer join.


Got it!


SELECT
    COUNT(DISTINCT upload.log_user)
FROM (
    SELECT *
    FROM logging
    WHERE
        log_type = "upload" AND
        log_action = "upload" AND
        log_timestamp BETWEEN "20140101" AND "20140102"
) AS upload
LEFT JOIN logging old_upload ON
    old_upload.log_type = "upload" AND
    old_upload.log_action = "upload" AND
    old_upload.log_timestamp < "20140101" AND
    upload.log_namespace = old_upload.log_namespace AND
    upload.log_title = old_upload.log_title
WHERE old_upload.log_id IS NULL;

It's pretty quick for a day's worth of uploads too. Woot! --Halfak (WMF) (talk) 20:44, 10 September 2014 (UTC)Reply

Discussion

[edit]