User:MPopov (WMF)/Notes/JS support
For T251464: judging by the traffic to English Wikipedia homepage, the numbers – while not definitive – suggest ratios of 2:1 on desktop and 3:1 on mobile web for JavaScript supported & enabled vs. disabled or unsupported.
Setup
[edit]This code (transferred from a Jupyter notebook running on stat1008 in the analytics cluster / SWAP) relies on the Presto Python client, because the equivalent query takes 5-10 minutes in Hive but only minute or two with Presto. Install via:
pip3 install presto-python-client['kerberos']
See wikitech:Analytics/Systems/Presto for more info.
import prestodb
import os
server = {
'host': 'an-coord1001.eqiad.wmnet',
'port': 8281,
'ca_bundle': '/etc/presto/ca.crt.pem',
}
conn = prestodb.dbapi.connect(
http_scheme='https',
host=server['host'],
port=server['port'],
user=os.environ['USER'],
catalog='analytics_hive',
auth=prestodb.auth.KerberosAuthentication(
config='/etc/krb5.conf',
service_name='presto',
principal='{}@WIKIMEDIA'.format(os.environ['USER']),
ca_bundle=server['ca_bundle']
)
)
cursor = conn.cursor()
from jinja2 import Template
import numpy as np
import pandas as pd
from datetime import datetime
import time
from pathlib import Path
Methodology
[edit]To count how many users have JS, we will be using a few proxies. First, here we will focus on the Main Page (the homepage you see when you visit en.wikipedia.org, en.m.wikipedia.org). Then, we will look at various resources being requested by visitors. Essentially, we want to count how many "clients" (identified by their User-Agent + IP address combination) request certain images, CSS styles, and JS scripts as part of their visit to the Main Page. Due to our privacy policy, we cannot identify devices/users uniquely.
The UA+IP methodology is flawed in that, same environments on the same network sharing a public IP address (e.g. computers in a school computer lab) will be seen as the same client. At best we have these proxies and approximations.
What makes it tricky is that Wikipedia users can select different skins and some skins are defaults for some scenarios but not others, such as:
- Modern (default on mobile): the skin used when visiting mobile web version (e.g. en.m.wikipedia.org), can also be enabled on desktop
- Monobook: the "classic" look that was seen on most wikis from 2005 to 2010
- Vector (default on desktop): A skin based on Monobook; as well as a new look, it adds a few new features to the wiki interface, particularly in the editing process.
As of April 2013, the Chick, Classic, MySkin, Nostalgia, and Simple skins are no longer available
All page visits request:
- the HTML for the page itself, and those HTTP requests are labeled as pageviews
- the favicon, located at
/static/favicon/wikipedia.ico
, even if they have images disabled (tested on Chrome 81) - CSS styles, which are loaded via ResourceLoader, for example:
<link rel="stylesheet" href="/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector">
If the user does not have JavaScript disabled, a request is made (via ResourceLoader) for the page's JS – e.g.
<script async="" src="/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector"></script>
If the user does not have images disabled AND they're using on, we can check for requests to certain images such as:
- the "Powered by MediaWiki" image in the footer of Vector and Monobook (e.g.
/static/images/poweredby_mediawiki_132x47.png
) - the Wikipedia wordmark (
/static/images/mobile/copyright/wikipedia-wordmark-en.svg
) in the Modern skin
We can estimate how many visits have JS and images enabled by counting how many clients (not users) with pageviews request CSS & favicon, but not JS or images:
query = Template("""WITH file_requests AS (
SELECT
access_method, concat(user_agent, client_ip) AS pseudo_client_id,
SUM(IF(regexp_like(content_type, 'text/html'), 1, 0)) > 0 AS html, -- aka page view
SUM(IF(content_type IN('image/png', 'image/svg'), 1, 0)) > 0 AS image, -- "powered by" or wordmark
SUM(IF(content_type = 'image/vnd.microsoft.icon', 1, 0)) > 0 AS favicon, -- requested even if images are blocked
SUM(IF(regexp_like(content_type, 'text/css'), 1, 0)) > 0 AS css, -- site.styles loaded by ResourceLoader
SUM(IF(regexp_like(content_type, 'text/javascript'), 1, 0)) > 0 AS js -- startup loaded by ResourceLoader
FROM wmf.webrequest
WHERE year = {{ year }} AND month = {{ month }} AND day = {{ day }} AND hour = {{ hour }}
AND webrequest_source = 'text'
AND access_method IN('desktop', 'mobile web')
AND agent_type = 'user'
AND (
-- Either a enwiki:Main Page pageview:
(is_pageview AND namespace_id = 0 AND page_id = 15580374)
-- or a resource requested on visit to that page:
OR (
http_status IN('200', '304')
AND uri_host IN('en.wikipedia.org', 'en.m.wikipedia.org')
AND referer IN('https://en.wikipedia.org/wiki/Main_Page', 'https://en.m.wikipedia.org/wiki/Main_Page') -- resources requested from visiting front page
AND regexp_like(content_type, '(image|text)/(vnd.microsoft.icon|png|svg|css|javascript)')
-- specifically, different types of resources:
AND (
uri_path IN('/static/favicon/wikipedia.ico')
-- "powered by mediawiki" at the bottom of the vector, timeless, monobook skins or logo at the top of minerva skin:
OR regexp_like(uri_path, '(poweredby|wikipedia-wordmark)')
OR (
uri_path = '/w/load.php' -- ResourceLoader
AND url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'only') = 'styles'
-- check if site.styles is one of the STYLES modules being loaded:
AND regexp_like(url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'modules'), 'site.styles')
)
OR (
uri_path = '/w/load.php' -- ResourceLoader
AND url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'only') = 'scripts'
-- check if startup is one of the SCRIPTS modules being loaded:
AND regexp_like(url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'modules'), 'startup')
)
)
)
)
GROUP BY access_method, concat(user_agent, client_ip)
)
SELECT access_method, html, image, favicon, css, js, COUNT(1) AS n_clients
FROM file_requests
GROUP BY access_method, html, image, favicon, css, js""")
def run_query(**kwargs):
cursor.execute(query.render(**kwargs))
rows = cursor.fetchall()
column_descriptions = cursor.description
df = pd.DataFrame(rows)
df.columns = [c[0] for c in column_descriptions]
return df.sort_values(by = ['access_method', 'html', 'image', 'favicon', 'css', 'js'])
Test run
[edit]result = run_query(year = 2020, month = 4, day = 20, hour = 12)
result
access_method | html | image | favicon | css | js | n_clients | |
---|---|---|---|---|---|---|---|
1 | desktop | False | False | False | False | False | 17422 |
20 | desktop | False | False | False | False | True | 89 |
16 | desktop | False | False | False | True | False | 110 |
34 | desktop | False | False | False | True | True | 399 |
15 | desktop | False | False | True | False | False | 14639 |
9 | desktop | False | False | True | False | True | 2 |
8 | desktop | False | False | True | True | False | 2 |
23 | desktop | False | False | True | True | True | 3 |
21 | desktop | False | True | False | False | False | 18 |
6 | desktop | False | True | False | False | True | 2 |
33 | desktop | False | True | False | True | False | 9 |
5 | desktop | False | True | False | True | True | 27 |
30 | desktop | False | True | True | False | False | 1 |
13 | desktop | False | True | True | True | True | 1 |
18 | desktop | True | False | False | False | False | 28289 |
32 | desktop | True | False | False | False | True | 767 |
28 | desktop | True | False | False | True | False | 900 |
27 | desktop | True | False | False | True | True | 25732 |
29 | desktop | True | False | True | False | False | 66 |
22 | desktop | True | False | True | False | True | 7 |
3 | desktop | True | False | True | True | False | 5 |
10 | desktop | True | False | True | True | True | 1355 |
36 | desktop | True | True | False | False | False | 56 |
25 | desktop | True | True | False | False | True | 30 |
4 | desktop | True | True | False | True | False | 196 |
12 | desktop | True | True | False | True | True | 15335 |
26 | desktop | True | True | True | False | False | 3 |
11 | desktop | True | True | True | False | True | 3 |
0 | desktop | True | True | True | True | False | 8 |
14 | desktop | True | True | True | True | True | 4977 |
17 | mobile web | False | False | False | False | True | 2693 |
2 | mobile web | False | False | False | True | False | 7 |
31 | mobile web | False | False | False | True | True | 26 |
19 | mobile web | True | False | False | False | False | 21395 |
35 | mobile web | True | False | False | False | True | 67398 |
7 | mobile web | True | False | False | True | False | 2 |
24 | mobile web | True | False | False | True | True | 51 |
Okay, so it's not entirely clear what it means to have clients request resources like images, favicon, CSS, JS, but not the HTML for the page (which is what would have pointed the browser at those resources). But this is a start...
(
result.query('html')
.groupby(['access_method', 'js'])
.agg({'n_clients': 'sum'})
.groupby(level = 0)
.transform(lambda x: 100 * x / x.sum())
)
n_clients | ||
---|---|---|
access_method | js | |
desktop | False | 37.981963 |
True | 62.018037 | |
mobile web | False | 24.083245 |
True | 75.916755 |
Some interesting findings from just one hour of data from 12PM-1PM on April 20th:
- on desktop, about 60% requested JS
- on mobile web, about 75% requested JS
Full run
[edit]Let's look at full 24 hours for a week of traffic:
- Thursday, April 23
- Friday, April 24
- Saturday, April 25
- Sunday, April 26
- Monday, April 27
- ...
- Thursday, April 30
if Path('js_requests.csv').exists():
results = pd.read_csv('js_requests.csv')
else:
results = []
dates = pd.date_range(start = "2020-04-23", end = "2020-04-30").to_pydatetime()
for d in dates:
yyyy, mm, dd = d.year, d.month, d.day
print('processing %s' % (d.date().isoformat()))
start, end = time.time(), time.time()
dfs = []
for hh in range(24):
df = run_query(year = yyyy, month = mm, day = dd, hour = hh)
df['date'] = pd.Series([d.date().isoformat() for x in range(df.shape[0])])
df['dt'] = pd.Series([d.isoformat() for x in range(df.shape[0])])
dfs.append(df)
prev, end = end, time.time()
print('processed hour %d of 24 in %f minutes' % (hh + 1, (end - prev) / 60))
dfs = pd.concat(dfs)
dfs = dfs.groupby(['date', 'access_method', 'html', 'image', 'favicon', 'css', 'js'], as_index = False).aggregate({ "n_clients": "sum"})
results.append(dfs)
print('processed %s in %f hours' % (d.date().isoformat(), (end - start) / 3600))
results = pd.concat(results)
# save all the data:
results.to_csv('js_requests.csv', index = False)
Note: even using the blazing-fast Presto, the querying took 14.6 hours total with the following times:
Data date | Average time per hour of data | Median time per hour of data | Total time to query 24 hours of data |
---|---|---|---|
2020-04-23 | 2.7 minutes | 2.7 minutes | 1.1 hours |
2020-04-24 | 2.9 minutes | 2.9 minutes | 1.2 hours |
2020-04-25 | 4.0 minutes | 3.8 minutes | 1.6 hours |
2020-04-26 | 4.9 minutes | 5.0 minutes | 2.0 hours |
2020-04-27 | 7.4 minutes | 7.3 minutes | 3.0 hours |
2020-04-28 | 13.2 minutes | 11.7 minutes | 5.3 hours |
2020-04-29 | 0.6 minutes | 0.6 minutes | 0.3 hours |
2020-04-30 | 0.8 minutes | 0.6 minutes | 0.3 hours |
Results
[edit]Note: Full results
dataset can be downloaded from T251464#6104955
proportions = (
results.query('html')
.groupby(['access_method', 'date', 'js'])
.agg({'n_clients': 'sum'})
.groupby(level = [0, 1])
.transform(lambda x: 100 * x / x.sum()) # yields % w/ JS and w/o JS on each day, within each access_method
)
proportions['n_clients'] = proportions['n_clients'].map('{:,.2f}%'.format)
proportions.unstack()
n_clients | |||
---|---|---|---|
js | False | True | |
access_method | date | ||
desktop | 2020-04-23 | 37.26% | 62.74% |
2020-04-24 | 36.81% | 63.19% | |
2020-04-25 | 39.37% | 60.63% | |
2020-04-26 | 38.62% | 61.38% | |
2020-04-27 | 36.97% | 63.03% | |
2020-04-28 | 37.15% | 62.85% | |
2020-04-29 | 37.45% | 62.55% | |
2020-04-30 | 36.93% | 63.07% | |
mobile web | 2020-04-23 | 24.88% | 75.12% |
2020-04-24 | 24.78% | 75.22% | |
2020-04-25 | 24.84% | 75.16% | |
2020-04-26 | 24.87% | 75.13% | |
2020-04-27 | 24.70% | 75.30% | |
2020-04-28 | 24.47% | 75.53% | |
2020-04-29 | 24.52% | 75.48% | |
2020-04-30 | 24.66% | 75.34% |