Research talk:Daily unique media creators
Add topicWork log
[edit]archive
|
---|
|
Wednesday, September 10th
[edit]Today, I'm reverse engineering "media creators". To me, this means people who originally uploaded a file.
> select distinct log_action from logging where log_type = "upload"; +------------+ | log_action | +------------+ | upload | | overwrite | +------------+ 2 rows in set (16.71 sec)
Looks like we have two actions. I'm guessing that "upload" is the original and "overwrite" is a re-upload for the same filename. --Halfak (WMF) (talk) 16:01, 10 September 2014 (UTC)
> SELECT -> log_namespace, -> log_title, -> SUM(log_action = "upload") AS upload, -> SUM(log_action = "overwrite") AS overwrite -> FROM logging -> INNER JOIN ( -> SELECT -> page_namespace AS log_namespace, -> page_title AS log_title -> FROM page -> WHERE page_namespace = 6 -> ORDER BY RAND() -> LIMIT 100 -> ) AS sampled_media USING (log_namespace, log_title) -> WHERE -> log_type = "upload" -> GROUP BY 1,2; +---------------+-----------------------------------------------------------------+--------+-----------+ | log_namespace | log_title | upload | overwrite | +---------------+-----------------------------------------------------------------+--------+-----------+ | 6 | "Jodi,_1999".jpg | 1 | 0 | | 6 | A1_in_London_1923.jpg | 1 | 0 | | 6 | AWMphotoA02730.jpg | 1 | 0 | | 6 | Annarth_poster.jpg | 1 | 0 | | 6 | Arcoftriiamph.jpg | 1 | 0 | | 6 | Artemis_Fowl_-_The_Atlantis_Complex.jpg | 1 | 1 | | 6 | B_McCulloch_CSA_ACW.jpg | 1 | 0 | | 6 | Backyard_Football_'09_Coverart.png | 1 | 0 | | 6 | Batman_Contagion_TPB_cover.jpg | 1 | 4 | | 6 | Bbbbb.jpg | 2 | 0 | | 6 | Benningsen_Beavers.jpg | 1 | 0 | | 6 | Berliner_Zeitung_Logo.svg | 1 | 0 | | 6 | BestWhorehouse.jpg | 1 | 0 | | 6 | Bin_Quraya_Logo_(2013).svg | 1 | 0 | | 6 | Black_cobra.gif | 1 | 0 | | 6 | Bleriot_monoplane.jpg | 1 | 0 | | 6 | BoyleFarmIsland.jpg | 1 | 0 | | 6 | Brigitte_Mira.jpg | 1 | 0 | | 6 | Cheap_Trick_It's_Only_Love_Single_1986.png | 1 | 0 | | 6 | Chrisye_-_Lilin-Lilin_Kecil_(1977).ogg | 1 | 0 | | 6 | CleanAirCampaignLogo.jpg | 1 | 0 | | 6 | Community_Emergency_Response_Team_(US)_Logo.jpg | 1 | 0 | | 6 | CraigInteview09_01.jpg | 1 | 1 | | 6 | Cut_image_of_Rarden_Image-Warrior_Infantry_Fighting_Vehicle.jpg | 1 | 0 | | 6 | Cyclonscreenshot001.jpg | 1 | 0 | | 6 | Department_of_Stomatology_at_Martin_University_Hospital.jpg | 1 | 0 | | 6 | Devon_Cornwall_Police_logo.jpg | 1 | 0 | | 6 | Dickey-Deliverance.jpg | 1 | 0 | | 6 | Distribution_wealth_population_europe_2000.gif | 1 | 0 | | 6 | Efaktura_logo.png | 1 | 0 | | 6 | Elin_Lanto_Rockbjornen.jpg | 1 | 0 | | 6 | Eliseu_Visconti_-_Maternity_-_Google_Art_Project.jpg | 1 | 0 | | 6 | Elvis_Presley-The_Sun_Sessions_(album_cover).jpg | 1 | 0 | | 6 | Embroidered_foliage.jpg | 1 | 0 | | 6 | Empress_Catherine_I_-c.1724_-3.jpg | 1 | 0 | | 6 | Factor_X.png | 1 | 1 | | 6 | FortunateFall.jpg | 1 | 0 | | 6 | GCRiptide.PNG | 1 | 0 | | 6 | GED_GameEditor.jpg | 1 | 0 | | 6 | Gatecrasher4.jpg | 1 | 1 | | 6 | Georgia-Tbilisi.png | 1 | 0 | | 6 | Geroge_Benson_Big_Boss_Band_cover.jpg | 1 | 0 | | 6 | Girls_Action_Foundation_logo.png | 1 | 0 | | 6 | Grease_Super_Macro.JPG | 2 | 0 | | 6 | Hapinas_logo.jpg | 1 | 1 | | 6 | Imperial_College_Healthcare_logo.jpg | 1 | 0 | | 6 | Inkscape-Arc.png | 1 | 0 | | 6 | JohnSteinbeck_ThePearl_title.jpg | 1 | 0 | | 6 | KDAF_AntennaTV.png | 1 | 0 | | 6 | KSNA_(FM)_logo.jpg | 1 | 0 | | 6 | Kenny_Rogers_-_Dont_Fall_single.jpg | 1 | 0 | | 6 | Krtologosmall.jpg | 1 | 5 | | 6 | Large_Arch_by_Henry_Moore_(1971)_in_April_2014.jpg | 1 | 0 | | 6 | Lastpost.jpg | 1 | 0 | | 6 | Laurent_Nivalle_portrait.jpg | 1 | 0 | | 6 | Len_Harvey_1.jpg | 1 | 0 | | 6 | MANXYZ_permease_Step_7.jpg | 1 | 0 | | 6 | MDB_drum_majors.jpg | 1 | 0 | | 6 | Maa_Music_logo.jpg | 1 | 0 | | 6 | Mahout2_crop.jpg | 1 | 0 | | 6 | Marvel-super-hero-squad-cover.jpg | 1 | 0 | | 6 | Mosaic_Stadium_Logo.png | 1 | 0 | | 6 | My-Cousin-Vinny-Poster.jpg | 1 | 0 | | 6 | Nama_Mia.png | 1 | 0 | | 6 | Nasserist_Socialist_Vanguard_Party_symbol.jpg | 1 | 0 | | 6 | New_Borinque_Pro_Wrestling_Logo.jpg | 1 | 0 | | 6 | OK-CLEET-logo.jpg | 1 | 0 | | 6 | Once-upon-a-time-in-the-north.jpg | 1 | 2 | | 6 | One.Way.Ticket.to.Hell...And.Back_The.Darkness.back-front.jpg | 1 | 0 | | 6 | Owlz.PNG | 1 | 1 | | 6 | Peyronie's-disease-illustration.jpg | 1 | 0 | | 6 | Pockbot.gif | 1 | 0 | | 6 | Prof_Pyg.jpg | 1 | 1 | | 6 | Robertlang.JPG | 1 | 0 | | 6 | Sailorneptune.jpg | 7 | 0 | | 6 | Scream-In_promo_card_(WXIX)_circa_1970.jpg | 1 | 0 | | 6 | Software-Freedom-Conservancy-Logo.svg | 1 | 1 | | 6 | Soroban_6_50px.png | 1 | 0 | | 6 | St._lourdes_church.jpg | 1 | 0 | | 6 | Suburban_Mayhem_poster.jpg | 1 | 0 | | 6 | Sugarloafoverlook.jpg | 1 | 0 | | 6 | Swarm_(comics).png | 1 | 0 | | 6 | Thanksgiving_(2004_film)_(movie_poster).jpg | 1 | 0 | | 6 | The_Sign_next_to_the_Rochdal_Canal.JPG | 1 | 0 | | 6 | Theodore-rex-DVD.jpg | 1 | 1 | | 6 | Timothyvanlaar.jpg | 1 | 0 | | 6 | Tru-blu-company-logo.png | 1 | 0 | | 6 | University_of_Denver_campus_pics_043.jpg | 1 | 0 | | 6 | Usvls-seal.png | 1 | 0 | | 6 | Virginia_392.png | 1 | 0 | | 6 | WDIC-AM_FM.png | 1 | 0 | | 6 | WNTP_logo.gif | 1 | 0 | | 6 | Ängby_IF.png | 1 | 0 | | 6 | Το_τελευταίο_ψέμα,.jpg | 1 | 0 | +---------------+-----------------------------------------------------------------+--------+-----------+ 94 rows in set (6.77 sec) <pre> It looks like this mostly works. There are some files that have more than one "upload" event. Let's see what's up with them. <pre> > select log_id, log_timestamp, log_type, log_action, LEFT(log_comment, 50) from logging where log_namespace = 6 and log_title = "Sailorneptune.jpg"; +----------+----------------+----------+------------+----------------------------------------------------+ | log_id | log_timestamp | log_type | log_action | LEFT(log_comment, 50) | +----------+----------------+----------+------------+----------------------------------------------------+ | 2699690 | 20060510035725 | upload | upload | Image of cartoon character "Sailor Neptune"<br> Ap | | 3218202 | 20060620174530 | upload | upload | http://moonprincess.com/galleries/details.php?imag | | 3233294 | 20060621192957 | upload | upload | Reverted to earlier revision | | 3233318 | 20060621193122 | upload | upload | Reverted to earlier revision | | 3233329 | 20060621193153 | upload | upload | Reverted to earlier revision | | 6879519 | 20070206084304 | upload | upload | Reverted to earlier revision | | 8904429 | 20070522020803 | upload | upload | | | 38925761 | 20111021004413 | delete | delete | Deleted old revision 20060620174530!Sailorneptune. | | 38925762 | 20111021004416 | delete | delete | Deleted old revision 20070522020803!Sailorneptune. | | 38925765 | 20111021004418 | delete | delete | Deleted old revision 20070206084304!Sailorneptune. | | 38925768 | 20111021004420 | delete | delete | Deleted old revision 20060621193152!Sailorneptune. | | 38925770 | 20111021004422 | delete | delete | Deleted old revision 20060621193122!Sailorneptune. | | 38925772 | 20111021004424 | delete | delete | Deleted old revision 20060621192956!Sailorneptune. | +----------+----------------+----------+------------+----------------------------------------------------+ 13 rows in set (0.02 sec)
Well... that's weird. It looks like there was a set of uploads in 2006,2007, but then a set of deletes in one day in 2011. Those "reverted to earlier versions" have me worried that log_action = "upload" used to be used for overwrites. Let's look at another example.
> select log_id, log_timestamp, log_type, log_action, LEFT(log_comment, 50) from logging where log_namespace = 6 and log_title = "Grease_Super_Macro.JPG"; +---------+----------------+----------+------------+------------------------------------------------+ | log_id | log_timestamp | log_type | log_action | LEFT(log_comment, 50) | +---------+----------------+----------+------------+------------------------------------------------+ | 9884320 | 20070718174156 | upload | upload | Picture of grease taken with Super Macro Mode. | | 9884440 | 20070718175047 | upload | upload | | +---------+----------------+----------+------------+------------------------------------------------+ 2 rows in set (0.06 sec)
Yeah. That one is weird too. It's also in 2007. Let's try the last query with the first event date included.
> SELECT -> log_namespace, -> log_title, -> SUM(log_action = "upload") AS uploads, -> SUM(log_action = "overwrite") AS overwrites, -> MIN(log_timestamp) AS first_upload -> FROM logging -> INNER JOIN ( -> SELECT -> page_namespace AS log_namespace, -> page_title AS log_title -> FROM page -> WHERE page_namespace = 6 -> ORDER BY RAND() -> LIMIT 1000 -> ) AS sampled_media USING (log_namespace, log_title) -> WHERE -> log_type = "upload" -> GROUP BY 1,2 -> HAVING SUM(log_action = "upload") > 1 -> ; +---------------+------------------------------------------------+---------+------------+----------------+ | log_namespace | log_title | uploads | overwrites | first_upload | +---------------+------------------------------------------------+---------+------------+----------------+ | 6 | Bags.jpg | 2 | 0 | 20061018113801 | | 6 | CAC_logo.jpg | 4 | 0 | 20070413081748 | | 6 | CHR_WA_A01.jpg | 3 | 3 | 20090605112708 | | 6 | Campbellwalker.jpg | 3 | 0 | 20060731191012 | | 6 | CannibalHolocaustimpale.jpg | 5 | 2 | 20070112005709 | | 6 | Charlielost.PNG | 3 | 3 | 20070219042047 | | 6 | Dr_pepper_ballpark_right_field_toward_home.jpg | 2 | 0 | 20060805095416 | | 6 | Ed_Chynoweth_Cup.JPG | 3 | 0 | 20070530002001 | | 6 | EveOfTheWarEP.jpg | 3 | 0 | 20070521012020 | | 6 | GREATWESTstates.PNG | 2 | 3 | 20061129185017 | | 6 | Gupt.jpg | 2 | 0 | 20060226025609 | | 6 | Head_over_heels_amstrad_1.png | 2 | 1 | 20070316185350 | | 6 | IndianEducationSystem.JPG | 2 | 0 | 20070101091319 | | 6 | Jonathan_Rhys_Meyers.jpg | 2 | 1 | 20071002191953 | | 6 | KazanRailTerminalWest.jpg | 2 | 0 | 20060614100347 | | 6 | Mardi_Gras_logo_box_white.png | 2 | 0 | 20070403040509 | | 6 | Monty.GIF | 6 | 0 | 20050723222006 | | 6 | News.jpg | 7 | 0 | 20070417204330 | | 6 | NoSirNihilismIsNotPractical.jpg | 2 | 0 | 20060126182642 | | 6 | North_Street_1957-1.jpg | 2 | 0 | 20090513061928 | | 6 | Rabbanim_Supporting_the_Manhattan.jpg | 3 | 0 | 20070712212608 | | 6 | Rasathanthram.jpg | 2 | 0 | 20070514103051 | | 6 | Right_Place,_Wrong_Time_DVD.jpg | 2 | 1 | 20080808025356 | | 6 | RiserDiagram.png | 2 | 0 | 20070605215455 | | 6 | SHU_main_gate.jpg | 2 | 0 | 20080624101518 | | 6 | Sequence.jpg | 2 | 3 | 20080312002548 | | 6 | Sourceradio.jpg | 2 | 0 | 20051208192739 | | 6 | Spitfire2xs.jpg | 2 | 1 | 20060506014856 | | 6 | Star_Ocean_SFC.jpg | 2 | 0 | 20060707224941 | | 6 | Teaser2.jpg | 2 | 0 | 20060917172304 | | 6 | The_Changeling.jpg | 2 | 1 | 20060913164513 | | 6 | Tom_and_jerry_mgm_parody.jpg | 2 | 0 | 20060614232540 | | 6 | Tyranos.jpg | 4 | 0 | 20060603033435 | | 6 | What-Has-Befallen-Us,-Barbad.jpg | 2 | 0 | 20070629101709 | | 6 | Working_memory_model.PNG | 2 | 0 | 20070312124212 | +---------------+------------------------------------------------+---------+------------+----------------+ 35 rows in set (47.13 sec)
Look at all those pre-2008 entries. I wonder if there's a clear time cutoff when the log starts to be sane.
> SELECT -> LEFT(first_upload, 6) AS month, -> SUM(uploads > 1)/COUNT(*) AS multi_upload_prop -> FROM ( -> SELECT -> log_namespace, -> log_title, -> COUNT(*) AS uploads, -> MIN(log_timestamp) AS first_upload -> FROM logging -> WHERE -> log_namespace = 6 AND -> log_action = "upload" AND -> log_type = "upload" -> GROUP BY 1,2 -> ) AS file_stats -> GROUP BY 1; +--------+-------------------+ | month | multi_upload_prop | +--------+-------------------+ | 200412 | 0.1723 | | 200501 | 0.1655 | <...snip...> | 200704 | 0.1072 | | 200705 | 0.1114 | | 200706 | 0.1405 | | 200707 | 0.1360 | | 200708 | 0.1174 | <--, | 200709 | 0.0379 | <--' | 200710 | 0.0350 | | 200711 | 0.0281 | | 200712 | 0.0252 | | 200801 | 0.0257 | <...snip...> | 201406 | 0.0105 | | 201407 | 0.0139 | | 201408 | 0.0047 | | 201409 | 0.0069 | +--------+-------------------+ 118 rows in set (2 min 59.80 sec)
Looks like somewhere in Sept. 2007, the upload log gets sane.
> SELECT -> LEFT(log_timestamp, 6) AS month, -> SUM(log_action = "upload") AS overwrites, -> SUM(log_action = "overwrite") AS overwrites -> FROM logging -> WHERE -> log_namespace = 6 AND -> log_type = "upload" -> GROUP BY 1 -> ; +--------+------------+------------+ | month | overwrites | overwrites | +--------+------------+------------+ | 200412 | 3706 | 0 | | 200501 | 13547 | 0 | | 200502 | 13089 | 0 | <...snip...> | 200705 | 78580 | 0 | | 200706 | 76389 | 0 | | 200707 | 78138 | 0 | | 200708 | 71218 | 2914 | | 200709 | 54373 | 11625 | | 200710 | 55511 | 13791 | <...snip...> | 201407 | 8005 | 2205 | | 201408 | 9073 | 2357 | | 201409 | 2677 | 738 | +--------+------------+------------+ 118 rows in set (1 min 15.14 sec)
So... is there something else I can look for in these events to know which ones are the creation?
> select * from logging where log_namespace = 6 and log_title = "GREATWESTstates.PNG" and log_action = "upload"; +---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+ | log_id | log_type | log_action | log_timestamp | log_user | log_namespace | log_title | log_comment | log_params | log_deleted | log_user_text | log_page | +---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+ | 5711684 | upload | upload | 20061129185017 | 455767 | 6 | GREATWESTstates.PNG | Derived from [http://en.wikipedia.org/wiki/Image:BlankMap-USA-states.PNG here] | | 0 | | NULL | | 7814423 | upload | upload | 20070325024149 | 1279118 | 6 | GREATWESTstates.PNG | official conference colors | | 0 | | NULL | +---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+ 2 rows in set (0.00 sec)
Nope. Hmm... I wonder if I could try a different strategy. I'll look for the first upload with an exclusive outer join.
Got it!
SELECT COUNT(DISTINCT upload.log_user) FROM ( SELECT * FROM logging WHERE log_type = "upload" AND log_action = "upload" AND log_timestamp BETWEEN "20140101" AND "20140102" ) AS upload LEFT JOIN logging old_upload ON old_upload.log_type = "upload" AND old_upload.log_action = "upload" AND old_upload.log_timestamp < "20140101" AND upload.log_namespace = old_upload.log_namespace AND upload.log_title = old_upload.log_title WHERE old_upload.log_id IS NULL;
It's pretty quick for a day's worth of uploads too. Woot! --Halfak (WMF) (talk) 20:44, 10 September 2014 (UTC)