Research talk:New page reviewer impact analysis/Work log/2017-06-11
Add topicAppearance
Latest comment: 7 years ago by EpochFail in topic Sunday, June 11, 2017
Sunday, June 11, 2017
[edit]Looking at the mediawiki page history table.
hive (wmf)> describe mediawiki_page_history; OK col_name data_type comment wiki_db string enwiki, dewiki, eswiktionary, etc. page_id bigint Id of the page, as in the page table. page_id_artificial string Generated Id for deleted pages without real Id. page_creation_timestamp string Timestamp of the page's first revision. page_title string Historical page title. page_title_latest string Page title as of today. page_namespace int Historical namespace. page_namespace_is_content boolean Whether the historical namespace is categorized as content page_namespace_latest int Namespace as of today. page_namespace_is_content_latest boolean Whether the current namespace is categorized as content page_is_redirect_latest boolean In revision/page events: whether the page is currently a redirect start_timestamp string Timestamp from where this state applies (inclusive). end_timestamp string Timestamp to where this state applies (exclusive). caused_by_event_type string Event that caused this state (create, move, delete or restore). caused_by_user_id bigint ID from the user that caused this state. inferred_from string If non-NULL, some fields have been inferred from an inconsistency in the source data. snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) # Partition Information # col_name data_type comment snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) Time taken: 0.129 seconds, Fetched: 22 row(s)
hive (wmf)> SELECT page_id, page_creation_timestamp, page_namespace AS page_creation_namespace, caused_by_user_id AS creation_user_id FROM mediawiki_page_history WHERE wiki_db = 'enwiki' AND caused_by_event_type = 'create' AND snapshot = "2017-05" AND page_namespace_latest = 0 limit 10; page_id page_creation_timestamp page_creation_namespace creation_user_id 609284 20040421024403 0 60719 2653101 20050910134608 0 36424 39050015 20130408122524 0 18801839 25399871 20091211213244 0 3010696 47053441 20150622221319 0 20318817 41954592 20140215234257 0 494861 35278895 20120331003832 0 4649642 15487338 20080129204923 0 231599 257932 20030704030557 0 6120 26284416 20100221181306 0 11571269 Time taken: 34.195 seconds, Fetched: 10 row(s)
Cool so it looks like we can split up the page creations by the original namespace and we can also get the creating user_id. OK my plan for the next time I sit down is to use this to generate a dataset that I'll start with. I only want page creations after 2008 because everything else is likely to be a little goofy. I think I'll then look into sampling from various time periods to know what proportion of new pages are created by newcomers. --EpochFail (talk) 19:59, 11 June 2017 (UTC)