Research talk:Reading time/Work log/2018-10-07
Add topicSunday, October 7, 2018
[edit]Model Selection
[edit]Here is a table used to illustrate the model selection process. We fit each of 5 distributions to a sample of views from each wiki and compute goodness of fit criteria. Each of the models is fit on a 75% sub-sample (training set) and the goodness of fit criteria are computed using the other 25% (test set).
AIC is the Akaike information criterion , BIC is the Bayesian information criterion. These two criteria are pretty similar (lower is better) and attempt to quantify the amount of information lost by the model. The main difference is that BIC takes sample size into account. KS is the Kolmogorov-Smirnov statistic, which may not be so useful for large sample sizes (it rejects the null that the model is the "true model").
The table shows results on a handful of selected wikis. The lognormal model appears to be a good fit, outperforming the Weibull model (weibull_min). The exponentiated Weibull model also performs well, but may be difficult to interpret. Next we will fit the models on all the wikis to better evaluate which models are a good fit.
wiki | model | AIC | BIC | ks | loglik | rank (BIC) | |
---|---|---|---|---|---|---|---|
0 | dewiki | weibull_min | 1.151329e+04 | 4.004192e+04 | 1.000000 | -5.754643e+03 | 3.0 |
1 | dewiki | exponweib | 1.109909e+04 | 3.859606e+04 | 1.000000 | -5.546544e+03 | 1.0 |
2 | dewiki | lognorm | 1.110096e+04 | 3.860755e+04 | 1.000000 | -5.548482e+03 | 2.0 |
3 | dewiki | gamma | 1.232521e+04 | 4.286638e+04 | 1.000000 | -6.160603e+03 | 4.0 |
4 | dewiki | expon | 1.496725e+04 | 5.206236e+04 | 1.000000 | -7.482627e+03 | 5.0 |
5 | enwiki | weibull_min | 1.712008e+04 | 5.923140e+04 | 1.000000 | -8.558042e+03 | 3.0 |
6 | enwiki | exponweib | 1.052603e+04 | 3.640883e+04 | 1.000000 | -5.260014e+03 | 1.0 |
7 | enwiki | lognorm | 1.054560e+04 | 3.648146e+04 | 1.000000 | -5.270798e+03 | 2.0 |
8 | enwiki | gamma | 2.454977e+06 | 8.495036e+06 | 1.000000 | -1.227487e+06 | 4.0 |
9 | enwiki | expon | 9.267797e+06 | 3.206969e+07 | 1.000000 | -4.633898e+06 | 5.0 |
10 | pawiki | weibull_min | 7.399301e+03 | 2.424868e+04 | 0.988747 | -3.697651e+03 | 3.0 |
11 | pawiki | exponweib | 6.876295e+03 | 2.252950e+04 | 0.999188 | -3.435147e+03 | 2.0 |
12 | pawiki | lognorm | 6.801278e+03 | 2.228812e+04 | 0.999755 | -3.398639e+03 | 1.0 |
13 | pawiki | gamma | 9.181350e+03 | 3.009093e+04 | 0.730970 | -4.588675e+03 | 4.0 |
14 | pawiki | expon | 1.895372e+04 | 6.213312e+04 | 0.999999 | -9.475861e+03 | 5.0 |
15 | nlwiki | weibull_min | 1.103810e+04 | 3.835726e+04 | 1.000000 | -5.517048e+03 | 3.0 |
16 | nlwiki | exponweib | 1.064092e+04 | 3.697176e+04 | 0.999966 | -5.317459e+03 | 1.0 |
17 | nlwiki | lognorm | 1.066475e+04 | 3.705955e+04 | 0.999993 | -5.330375e+03 | 2.0 |
18 | nlwiki | gamma | 1.166566e+04 | 4.053862e+04 | 1.000000 | -5.830832e+03 | 4.0 |
19 | nlwiki | expon | 1.341853e+04 | 4.663632e+04 | 1.000000 | -6.708263e+03 | 5.0 |
20 | eswiki | weibull_min | 1.245377e+04 | 4.339014e+04 | 0.999097 | -6.224885e+03 | 3.0 |
21 | eswiki | exponweib | 1.176727e+04 | 4.099280e+04 | 0.999977 | -5.880635e+03 | 2.0 |
22 | eswiki | lognorm | 1.176492e+04 | 4.098958e+04 | 0.999897 | -5.880461e+03 | 1.0 |
23 | eswiki | gamma | 1.498942e+04 | 5.222664e+04 | 0.749022 | -7.492712e+03 | 4.0 |
24 | eswiki | expon | 2.794001e+04 | 9.736306e+04 | 0.999999 | -1.396901e+04 | 5.0 |
25 | hiwiki | weibull_min | 1.018607e+04 | 3.496911e+04 | 1.000000 | -5.091036e+03 | 3.0 |
26 | hiwiki | exponweib | 1.001290e+04 | 3.436956e+04 | 0.999519 | -5.003449e+03 | 2.0 |
27 | hiwiki | lognorm | 1.000967e+04 | 3.436335e+04 | 0.999625 | -5.002835e+03 | 1.0 |
28 | hiwiki | gamma | 1.032541e+04 | 3.544760e+04 | 1.000000 | -5.160707e+03 | 4.0 |
29 | hiwiki | expon | 1.049624e+04 | 3.603908e+04 | 1.000000 | -5.247119e+03 | 5.0 |
30 | arwiki | weibull_min | 2.439485e+04 | 8.414907e+04 | 1.000000 | -1.219543e+04 | 3.0 |
31 | arwiki | exponweib | 1.083212e+04 | 3.735461e+04 | 1.000000 | -5.413059e+03 | 2.0 |
32 | arwiki | lognorm | 1.075584e+04 | 3.709637e+04 | 1.000000 | -5.375922e+03 | 1.0 |
33 | arwiki | gamma | 2.492751e+06 | 8.599635e+06 | 1.000000 | -1.246373e+06 | 4.0 |
34 | arwiki | expon | 8.934334e+06 | 3.082221e+07 | 1.000000 | -4.467166e+06 | 5.0 |