Jump to content

Research talk:Reading time/Work log/2018-11-17

Add topic
From Meta, a Wikimedia project coordination wiki

Saturday, November 17, 2018

[edit]

Today I worked on model selection for the HDI variable and the interaction between HDI and mobile. I decided to work on this first because I was suspicious of the result that showed a quite striking large difference between reading time in lower HDI and higher HDI countries that was greatly diminished on mobile. Today I tried out many specifications, I looked at residual plots, and I evaluated the models on a hold-out sample. My conclusions are that adding a higher-order polynomial for HDI provides a quite modest, but statistically significant improvement to model fit. Adding a higher-order interaction between HDI and mobile is not substantive. This absolves my primary concern, which was that mobile in the lowest-HDI would be quite different from the middle-HDI and high-HDI contexts where most of our data lie. Despite the introduction of a higher-order polynomial for HDI, our substantive results from the model with only a single-order term are largely robust to this specification: Readers in lower-HDI contexts have longer dwell times, readers on the non-mobile site read for longer than readers on the mobile site, and the gap between mobile and PC declines in higher HDI contexts. Moreover, since all values from the same country have the same value of HDI, higher order terms for HDI are probably picking up on information about countries that is correlated with, and potentially confounding HDI. This suggests that we fit a mixed model with random-intercepts (maybe even slopes, or wikis) for countries. I should give some attention to other tasks, but I think that is what should be done for the next round of modeling.

Model Selection

[edit]

Model 3 (HDI*mobile)

[edit]

The model fit improves slightly, but significantly (as indicated by F-tests, shown below) with the addition of higher-order terms for HDI. One matter of concern is that the coefficients become quite large in the higher-order specifications suggesting overfit. I'll see which model has the best predictive performance on a hold-out sample.

Statistical models
model 2 model 3 model 3 with quadratic HDI model 3 with cubic HDI model 3 with quartic HDI model 3 with quintic HDI
Intercept 10.8247 (0.0122)*** 11.6286 (0.0165)*** 6.4282 (0.1259)*** 65.8866 (1.2369)*** 253.8476 (11.0605)*** -2637.6658 (77.9810)***
mobile -0.2016 (0.0011)*** -1.5927 (0.0191)***
Human Development Index -1.0219 (0.0103)*** -1.8893 (0.0157)*** 9.8203 (0.2826)*** -198.1313 (4.3139)*** -1138.6273 (52.9416)*** 16669.4991 (473.8399)***
mobile : HDI 1.5146 (0.0207)*** -1.9572 (0.0228)*** -2.5460 (0.3171)*** 51.7714 (3.4166)*** -76.4951 (30.0172)*
mobile : HDI^2 1.8852 (0.0246)*** 3.3265 (0.7124)*** -185.1585 (11.8494)*** 415.2980 (143.1758)**
mobile : HDI^3 -0.8700 (0.3991)* 216.3254 (13.6596)*** -832.7491 (255.2358)**
mobile : HDI^4 -83.1222 (5.2339)*** 727.8907 (201.5475)***
mobile : HDI^5 -234.1483 (59.4847)***
HDI^2 -6.5621 (0.1587)*** 234.7857 (5.0081)*** 1983.9482 (94.8060)*** -41665.3318 (1149.0281)***
HDI^3 -92.9699 (1.9344)*** -1527.1808 (75.2644)*** 51709.8261 (1389.8805)***
HDI^3 -92.9699 (1.9344)*** -1527.1808 (75.2644)*** 51709.8261 (1389.8805)***
HDI^4 437.7320 (22.3464)*** -31878.6736 (838.6029)***
HDI^5 7812.3979 (201.9034)***
Revision length (bytes) 0.1665 (0.0005)*** 0.1664 (0.0005)*** 0.1665 (0.0005)*** 0.1663 (0.0005)*** 0.1663 (0.0005)*** 0.1662 (0.0005)***
time to first paint -0.0154 (0.0006)*** -0.0152 (0.0006)*** -0.0150 (0.0006)*** -0.0148 (0.0006)*** -0.0149 (0.0006)*** -0.0148 (0.0006)***
time to dom interactive 0.0036 (0.0009)*** 0.0036 (0.0009)*** 0.0036 (0.0009)*** 0.0036 (0.0009)*** 0.0036 (0.0009)*** 0.0038 (0.0009)***
R2 0.0520 0.0525 0.0526 0.0528 0.0529 0.0531
Adj. R2 0.0520 0.0525 0.0526 0.0528 0.0529 0.0530
Num. obs. 9873641 9873641 9873641 9873641 9873641 9873641
RMSE 14.3860 14.3821 14.3812 14.3795 14.3791 14.3780
***p < 0.001, **p < 0.01, *p < 0.05

Anova Results

[edit]
Res.Df RSS Df Sum of Sq F Pr(>F)
9873619 2043418459 NA NA NA NA
9873618 2042312944 1 1105514.7 5347.7375 0
9873617 2042058043 1 254901.4 1233.0417 0
9873615 2041575118 2 482924.4 1168.0320 0
9873613 2041441930 2 133187.8 322.1365 0
9873611 2041128943 2 312987.5 757.0118 0

Residual Analysis

[edit]

I made partial residual plots to diagnose the fit for model 3. The first set of plots show the residuals, adjusted by the projection of X, on the Y axis and variable on the X axis. If the slope is not flat they can indicate a motivation to include higher order terms in the model. I use a loess smoother to look for curvature.

Residuals plot for modeling Wikipedia reading times with respect to the interaction between HDI and mobile. There doesn't seem to be much evidence of curvature (the line is pretty straight) suggesting that the current specification is appropriate, but that including higher order terms does not seem totally unreasonable. However, HDI is pretty much only in the range of (0.5,1) so the separation of the data might obscure curvature.
Partial residuals plot for the interaction of HDI and mobile. Residuals plot for modeling Wikipedia reading times with respect to the interaction between HDI and mobile. There doesn't seem to be much evidence of curvature (the line is pretty straight) suggesting that the current specification is appropriate, but that including higher order terms does not seem totally unreasonable. However, HDI is pretty much only in the range of (0.5,1) so the separation of the data might obscure curvature.

Similar to the plot to the left, except that I transformed HDI by logging and scaling it before I fit the model. Now there is even less appearance of curvature suggesting that the current (first-order) specification is appropriate.
Partial residuals plot for the interaction of HDI (scaled) and mobile. Similar to the plot to the left, except that I transformed HDI by logging and scaling it before I fit the model. Now there is even less appearance of curvature suggesting that the current (first-order) specification is appropriate.

There is a clear bend between 0.8 and 0.9 on the range of HDI. To me this suggests that we can include higher-order polynomial terms for HDI in the model.
Partial residuals plot for HDI (Not scaled). There is a clear bend between 0.8 and 0.9 on the range of HDI. To me this suggests that we can include higher-order polynomial terms for HDI in the model.

Specifications without higher-order mobile:HDI interaction, but with higher order HDI polynomial

[edit]

Per the results of the residual analysis. I realized that the right specification for HDI and mobile might be a higher-order polynomial for HDI, but not for the interaction between HDI and mobile. Now it appears that the complex interaction is not very helpful, but what is helpful is the higher-order specification for HDI.

Also, the F-test still passes for the quadratic interaction, but the statistic is much smaller. We have so much data that it will be hard to fail any test.

Statistical models
model 2 model 3 model 3 with quadratic HDI model 3 with cubic HDI model 3 with quartic HDI model 3 with quadratic HDI:mobile
Intercept 9.9687 (0.0077)*** 10.0457 (0.0078)*** 10.0487 (0.0078)*** 10.0171 (0.0078)*** 10.0308 (0.0079)*** 10.0314 (0.0079)***
mobile -0.2016 (0.0011)*** -0.3236 (0.0020)*** -0.3137 (0.0020)*** -0.3082 (0.0020)*** -0.3053 (0.0020)*** -0.3071 (0.0020)***
Human Development Index -0.0946 (0.0009)*** -0.1746 (0.0014)*** -0.1172 (0.0022)*** -0.0520 (0.0026)*** -0.0243 (0.0028)*** -0.0147 (0.0034)***
mobile : HDI 0.1399 (0.0019)*** 0.1275 (0.0019)*** 0.1196 (0.0020)*** 0.1186 (0.0020)*** 0.1029 (0.0036)***
mobile : HDI^2 0.0145 (0.0028)***
HDI^2 -0.0482 (0.0014)*** 0.0199 (0.0020)*** -0.0690 (0.0042)*** -0.0785 (0.0046)***
HDI^3 -0.0764 (0.0016)*** -0.0868 (0.0017)*** -0.0863 (0.0017)***
HDI^4 0.0421 (0.0018)*** 0.0424 (0.0018)***
Revision length (bytes) 0.1665 (0.0005)*** 0.1664 (0.0005)*** 0.1664 (0.0005)*** 0.1663 (0.0005)*** 0.1664 (0.0005)*** 0.1664 (0.0005)***
time to first paint -0.0154 (0.0006)*** -0.0152 (0.0006)*** -0.0150 (0.0006)*** -0.0148 (0.0006)*** -0.0148 (0.0006)*** -0.0148 (0.0006)***
time to dom interactive 0.0036 (0.0009)*** 0.0036 (0.0009)*** 0.0035 (0.0009)*** 0.0036 (0.0009)*** 0.0036 (0.0009)*** 0.0036 (0.0009)***
R2 0.0520 0.0525 0.0526 0.0528 0.0529 0.0529
Adj. R2 0.0520 0.0525 0.0526 0.0528 0.0529 0.0529
Num. obs. 9873641 9873641 9873641 9873641 9873641 9873641
RMSE 14.3860 14.3821 14.3812 14.3796 14.3792 14.3791
***p < 0.001, **p < 0.01, *p < 0.05
Res.Df RSS Df Sum of Sq F Pr(>F)
9873619 2043410883 NA NA NA NA
9873618 2042305921 1 1104961.432 5344.18611 0e+00
9873617 2042058100 1 247821.330 1198.59687 0e+00
9873616 2041587173 1 470927.432 2277.65764 0e+00
9873615 2041469312 1 117861.191 570.03993 0e+00
9873614 2041463831 1 5480.562 26.50694 3e-07

Even more, it is clear that the quadratic term for the interaction is over-fitting. When I test the models on out-of-sample prediction the model with the quadratic interaction does worse.

Rmse Rsqr name
1.677372 0.0542021 model 2
1.678363 0.0575015 model 3
1.679984 0.0562511 model 3 with quadratic HDI
1.679288 0.0590972 model 3 with cubic HDI
1.680671 0.0616457 model 3 with quartic HDI
1.680658 0.0613616 model 3 with quadratic HDI:mobile

This chart is used to interpret a regression model predicting amount of time a Wikipedia page is visible in reader's browsers. It is a marginal effects plot showing how the model-predicted relationship between the HDI of the country in which the reader is located and the amount of time they spend reading depending on whether they visit the mobile or desktop sites. Readers read longer on desktops than on mobile devices and from lower HDI contexts, and the gap between desktop and mobile is greater in lower HDI contexts.

I wonder if the little increase in the mid range of HDI on mobile is robust?