Jump to content

Research talk:Measuring article importance/Work log/2014-10-25

Add topic
From Meta, a Wikimedia project coordination wiki

Saturday, October 25, 2014

[edit]

Woo! Data. So, the query I set up for getting inlinks ended up getting killed, so I had to do it in two queries.

proto_inlink_count
SELECT
    pl_namespace AS page_namespace,
    pl_title AS page_title,
    COUNT(pl_from) AS inlinks
FROM pagelinks
GROUP BY pl_namespace, pl_title;
inlink_count
SELECT
    page.page_id,
    IFNULL(inlink_count.inlinks, 0) AS inlinks,
    redirect.page_id AS redirect_id
FROM page
LEFT JOIN staging.proto_inlink_count inlink_count USING
    (page_namespace, page_title)
LEFT JOIN redirect redirectlink ON
    page_is_redirect AND
    redirectlink.rd_from = page_id
LEFT JOIN page redirect ON
    redirect.page_title = redirectlink.rd_title AND
    redirect.page_namespace = redirectlink.rd_namespace
WHERE page.page_namespace = 0;
resolved_inlink_count
CREATE TABLE staging.resolved_inlink_count
SELECT
    article.page_id,
    article.inlinks + SUM(redirect.inlinks) AS inlinks,
    article.inlinks AS direct_inlinks,
    SUM(redirect.inlinks) AS inlinks_from_redirects,
    COUNT(redirect.page_id) AS redirects
FROM staging.inlink_count article
LEFT JOIN staging.inlink_count redirect ON
    redirect.redirect_id = article.page_id
WHERE article.redirect_id = 0
GROUP BY article.page_id;

Now, I get to play around a bit. Let's check out the most inlinked articles:

> select page_title, links.* from resolved_inlink_count links INNER JOIN enwiki.page USING (page_id) ORDER BY inlinks DESC LIMIT 10;
+-----------------------------------------------+----------+---------+----------------+------------------------+-----------+
| page_title                                    | page_id  | inlinks | direct_inlinks | inlinks_from_redirects | redirects |
+-----------------------------------------------+----------+---------+----------------+------------------------+-----------+
| IP_address                                    |    14921 | 1434099 |        1432770 |                   1329 |        59 |
| United_States                                 |  3434750 | 1064447 |         995446 |                  69001 |       166 |
| Geographic_coordinate_system                  |    48361 |  908460 |         905848 |                   2612 |        31 |
| Fair_use                                      |    10772 |  539840 |         538743 |                   1097 |        15 |
| International_Standard_Book_Number            |    14919 |  513159 |         510836 |                   2323 |        24 |
| Copyright_infringement                        | 18948365 |  507796 |         506533 |                   1263 |        76 |
| Copyright_law_of_the_United_States_of_America | 18938663 |  493085 |             37 |                 493048 |        27 |
| Wikipedia                                     |  5043734 |  473676 |         472489 |                   1187 |       194 |
| Tilde                                         |   212390 |  444556 |         442103 |                   2453 |        22 |
| Network_address_translation                   |    53036 |  405086 |         404836 |                    250 |        41 |
+-----------------------------------------------+----------+---------+----------------+------------------------+-----------+
10 rows in set (1.62 sec)

Woah! en:IP_address? Really? I get that en:Geographic_coordinate_system is up there due to geo-tags. I bet that en:International_Standard_Book_Number comes from citations. The copyright links might be from references too. en:Tilde is another weird one. I don't event know what to think of that. I don't think that anything in this list is worth of top 10 Encyclopedic Importance.


How about we look at the articles with the most redirects:

> select page_title, links.* from resolved_inlink_count links INNER JOIN enwiki.page USING (page_id) ORDER BY redirects DESC LIMIT 10,10;
+-------------------------------------------------+----------+---------+----------------+------------------------+-----------+
| page_title                                      | page_id  | inlinks | direct_inlinks | inlinks_from_redirects | redirects |
+-------------------------------------------------+----------+---------+----------------+------------------------+-----------+
| List_of_minor_planets:_11001–12000              |   720156 |    5871 |            687 |                   5184 |       640 |
| List_of_minor_planets:_20001–21000              |   723102 |    4407 |            659 |                   3748 |       637 |
| List_of_Dragon_Ball_characters                  |  1001666 |    3706 |           1197 |                   2509 |       620 |
| List_of_Naruto_episodes                         |  2137845 |    1541 |            940 |                    601 |       587 |
| List_of_minor_planets:_4001–5000                |   719861 |    7410 |            654 |                   6756 |       578 |
| List_of_hundreds_of_Sweden                      |  7525873 |    1050 |            584 |                    466 |       529 |
| ANSI_escape_code                                |    65930 |    1458 |           1181 |                    277 |       526 |
| Afghan_detainees_at_Guantanamo_Bay              | 19287734 |    5865 |            595 |                   5270 |       508 |
| List_of_unclassified_languages_of_South_America | 37445047 |    1172 |            496 |                    676 |       479 |
| List_of_bus_routes_in_London                    |  5179207 |   98671 |            972 |                  97699 |       466 |
+-------------------------------------------------+----------+---------+----------------+------------------------+-----------+
10 rows in set (1.20 sec)

OK. What? Let's check out some examples. en:List_of_minor_planets:_11001–12000 (redirects). I smell a bot. It looks like a bit went through and created ~600 redirects -- one per "minor planet" on the page.

How about another? en:List_of_Naruto_episodes (redirects) Same deal. A redirect for every episode.

One more. This one has far more inlinks. en:List_of_bus_routes_in_London (redirects) Again. A redirect for every bus name. --Halfak (WMF) (talk) 02:39, 25 October 2014 (UTC)Reply