User:MPopov (WMF)/Notes/JS support

For T251464: judging by the traffic to English Wikipedia homepage, the numbers – while not definitive – suggest ratios of 2:1 on desktop and 3:1 on mobile web for JavaScript supported & enabled vs. disabled or unsupported.

Setup

This code (transferred from a Jupyter notebook running on stat1008 in the analytics cluster / SWAP) relies on the Presto Python client, because the equivalent query takes 5-10 minutes in Hive but only minute or two with Presto. Install via:

pip3 install presto-python-client['kerberos']

See wikitech:Analytics/Systems/Presto for more info.

import prestodb
import os

server = {
    'host': 'an-coord1001.eqiad.wmnet',
    'port': 8281,
    'ca_bundle':  '/etc/presto/ca.crt.pem',
}

conn = prestodb.dbapi.connect(
    http_scheme='https',
    host=server['host'],
    port=server['port'],
    user=os.environ['USER'],
    catalog='analytics_hive',
    auth=prestodb.auth.KerberosAuthentication(
        config='/etc/krb5.conf',
        service_name='presto',
        principal='{}@WIKIMEDIA'.format(os.environ['USER']),
        ca_bundle=server['ca_bundle']
    )
)
cursor = conn.cursor()

from jinja2 import Template
import numpy as np
import pandas as pd
from datetime import datetime
import time
from pathlib import Path

Methodology

To count how many users have JS, we will be using a few proxies. First, here we will focus on the Main Page (the homepage you see when you visit en.wikipedia.org, en.m.wikipedia.org). Then, we will look at various resources being requested by visitors. Essentially, we want to count how many "clients" (identified by their User-Agent + IP address combination) request certain images, CSS styles, and JS scripts as part of their visit to the Main Page. Due to our privacy policy, we cannot identify devices/users uniquely.

The UA+IP methodology is flawed in that, same environments on the same network sharing a public IP address (e.g. computers in a school computer lab) will be seen as the same client. At best we have these proxies and approximations.

What makes it tricky is that Wikipedia users can select different skins and some skins are defaults for some scenarios but not others, such as:

Modern (default on mobile): the skin used when visiting mobile web version (e.g. en.m.wikipedia.org), can also be enabled on desktop
Monobook: the "classic" look that was seen on most wikis from 2005 to 2010
Vector (default on desktop): A skin based on Monobook; as well as a new look, it adds a few new features to the wiki interface, particularly in the editing process.

As of April 2013, the Chick, Classic, MySkin, Nostalgia, and Simple skins are no longer available

All page visits request:

the HTML for the page itself, and those HTTP requests are labeled as pageviews
the favicon, located at /static/favicon/wikipedia.ico, even if they have images disabled (tested on Chrome 81)
CSS styles, which are loaded via ResourceLoader, for example:

<link rel="stylesheet" href="/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector">

If the user does not have JavaScript disabled, a request is made (via ResourceLoader) for the page's JS – e.g.

<script async="" src="/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector"></script>

If the user does not have images disabled AND they're using on, we can check for requests to certain images such as:

the "Powered by MediaWiki" image in the footer of Vector and Monobook (e.g. /static/images/poweredby_mediawiki_132x47.png)
the Wikipedia wordmark (/static/images/mobile/copyright/wikipedia-wordmark-en.svg) in the Modern skin

We can estimate how many visits have JS and images enabled by counting how many clients (not users) with pageviews request CSS & favicon, but not JS or images:

query = Template("""WITH file_requests AS (
    SELECT
      access_method, concat(user_agent, client_ip) AS pseudo_client_id,
      SUM(IF(regexp_like(content_type, 'text/html'), 1, 0)) > 0 AS html, -- aka page view
      SUM(IF(content_type IN('image/png', 'image/svg'), 1, 0)) > 0 AS image, -- "powered by" or wordmark
      SUM(IF(content_type = 'image/vnd.microsoft.icon', 1, 0)) > 0 AS favicon, -- requested even if images are blocked
      SUM(IF(regexp_like(content_type, 'text/css'), 1, 0)) > 0 AS css, -- site.styles loaded by ResourceLoader
      SUM(IF(regexp_like(content_type, 'text/javascript'), 1, 0)) > 0 AS js -- startup loaded by ResourceLoader
    FROM wmf.webrequest
    WHERE year = {{ year }} AND month = {{ month }} AND day = {{ day }} AND hour = {{ hour }}
      AND webrequest_source = 'text'
      AND access_method IN('desktop', 'mobile web')
      AND agent_type = 'user'
      AND (
        -- Either a enwiki:Main Page pageview:
        (is_pageview AND namespace_id = 0 AND page_id = 15580374)
        -- or a resource requested on visit to that page:
        OR (
          http_status IN('200', '304')
          AND uri_host IN('en.wikipedia.org', 'en.m.wikipedia.org')
          AND referer IN('https://en.wikipedia.org/wiki/Main_Page', 'https://en.m.wikipedia.org/wiki/Main_Page') -- resources requested from visiting front page
          AND regexp_like(content_type, '(image|text)/(vnd.microsoft.icon|png|svg|css|javascript)')
          -- specifically, different types of resources:
          AND (
            uri_path IN('/static/favicon/wikipedia.ico')
            -- "powered by mediawiki" at the bottom of the vector, timeless, monobook skins or logo at the top of minerva skin:
            OR regexp_like(uri_path, '(poweredby|wikipedia-wordmark)') 
            OR (
              uri_path = '/w/load.php' -- ResourceLoader
              AND url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'only') = 'styles'
              -- check if site.styles is one of the STYLES modules being loaded:
              AND regexp_like(url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'modules'), 'site.styles')
            )
            OR (
              uri_path = '/w/load.php' -- ResourceLoader
              AND url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'only') = 'scripts'
              -- check if startup is one of the SCRIPTS modules being loaded:
              AND regexp_like(url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'modules'), 'startup')
            )
          )
        )
      )
      GROUP BY access_method, concat(user_agent, client_ip)
)
SELECT access_method, html, image, favicon, css, js, COUNT(1) AS n_clients
FROM file_requests
GROUP BY access_method, html, image, favicon, css, js""")

def run_query(**kwargs):
    cursor.execute(query.render(**kwargs))
    rows = cursor.fetchall()
    column_descriptions = cursor.description
    df = pd.DataFrame(rows)
    df.columns = [c[0] for c in column_descriptions]
    return df.sort_values(by = ['access_method', 'html', 'image', 'favicon', 'css', 'js'])

Test run

result = run_query(year = 2020, month = 4, day = 20, hour = 12)
result

	access_method	html	image	favicon	css	js	n_clients
1	desktop	False	False	False	False	False	17422
20	desktop	False	False	False	False	True	89
16	desktop	False	False	False	True	False	110
34	desktop	False	False	False	True	True	399
15	desktop	False	False	True	False	False	14639
9	desktop	False	False	True	False	True	2
8	desktop	False	False	True	True	False	2
23	desktop	False	False	True	True	True	3
21	desktop	False	True	False	False	False	18
6	desktop	False	True	False	False	True	2
33	desktop	False	True	False	True	False	9
5	desktop	False	True	False	True	True	27
30	desktop	False	True	True	False	False	1
13	desktop	False	True	True	True	True	1
18	desktop	True	False	False	False	False	28289
32	desktop	True	False	False	False	True	767
28	desktop	True	False	False	True	False	900
27	desktop	True	False	False	True	True	25732
29	desktop	True	False	True	False	False	66
22	desktop	True	False	True	False	True	7
3	desktop	True	False	True	True	False	5
10	desktop	True	False	True	True	True	1355
36	desktop	True	True	False	False	False	56
25	desktop	True	True	False	False	True	30
4	desktop	True	True	False	True	False	196
12	desktop	True	True	False	True	True	15335
26	desktop	True	True	True	False	False	3
11	desktop	True	True	True	False	True	3
0	desktop	True	True	True	True	False	8
14	desktop	True	True	True	True	True	4977
17	mobile web	False	False	False	False	True	2693
2	mobile web	False	False	False	True	False	7
31	mobile web	False	False	False	True	True	26
19	mobile web	True	False	False	False	False	21395
35	mobile web	True	False	False	False	True	67398
7	mobile web	True	False	False	True	False	2
24	mobile web	True	False	False	True	True	51

Okay, so it's not entirely clear what it means to have clients request resources like images, favicon, CSS, JS, but not the HTML for the page (which is what would have pointed the browser at those resources). But this is a start...

(
    result.query('html')
    .groupby(['access_method', 'js'])
    .agg({'n_clients': 'sum'})
    .groupby(level = 0)
    .transform(lambda x: 100 * x / x.sum())
)

		n_clients
access_method	js
desktop	False	37.981963
desktop	True	62.018037
mobile web	False	24.083245
mobile web	True	75.916755

Some interesting findings from just one hour of data from 12PM-1PM on April 20th:

on desktop, about 60% requested JS
on mobile web, about 75% requested JS

Full run

Let's look at full 24 hours for a week of traffic:

Thursday, April 23
Friday, April 24
Saturday, April 25
Sunday, April 26
Monday, April 27
...
Thursday, April 30

if Path('js_requests.csv').exists():
    results = pd.read_csv('js_requests.csv')
else:
    results = []
    dates = pd.date_range(start = "2020-04-23", end = "2020-04-30").to_pydatetime()
    for d in dates:
        yyyy, mm, dd = d.year, d.month, d.day
        print('processing %s' % (d.date().isoformat()))
        start, end = time.time(), time.time()
        dfs = []
        for hh in range(24):
            df = run_query(year = yyyy, month = mm, day = dd, hour = hh)
            df['date'] = pd.Series([d.date().isoformat() for x in range(df.shape[0])])
            df['dt'] = pd.Series([d.isoformat() for x in range(df.shape[0])])
            dfs.append(df)
            prev, end = end, time.time()
            print('processed hour %d of 24 in %f minutes' % (hh + 1, (end - prev) / 60))
        dfs = pd.concat(dfs)
        dfs = dfs.groupby(['date', 'access_method', 'html', 'image', 'favicon', 'css', 'js'], as_index = False).aggregate({ "n_clients": "sum"})
        results.append(dfs)
        print('processed %s in %f hours' % (d.date().isoformat(), (end - start) / 3600))

    results = pd.concat(results)
    # save all the data:
    results.to_csv('js_requests.csv', index = False)

Note: even using the blazing-fast Presto, the querying took 14.6 hours total with the following times:

Data date	Average time per hour of data	Median time per hour of data	Total time to query 24 hours of data
2020-04-23	2.7 minutes	2.7 minutes	1.1 hours
2020-04-24	2.9 minutes	2.9 minutes	1.2 hours
2020-04-25	4.0 minutes	3.8 minutes	1.6 hours
2020-04-26	4.9 minutes	5.0 minutes	2.0 hours
2020-04-27	7.4 minutes	7.3 minutes	3.0 hours
2020-04-28	13.2 minutes	11.7 minutes	5.3 hours
2020-04-29	0.6 minutes	0.6 minutes	0.3 hours
2020-04-30	0.8 minutes	0.6 minutes	0.3 hours

Results

Note: Full results dataset can be downloaded from T251464#6104955

proportions = (
    results.query('html')
    .groupby(['access_method', 'date', 'js'])
    .agg({'n_clients': 'sum'})
    .groupby(level = [0, 1])
    .transform(lambda x: 100 * x / x.sum()) # yields % w/ JS and w/o JS on each day, within each access_method
)
proportions['n_clients'] = proportions['n_clients'].map('{:,.2f}%'.format)
proportions.unstack()

		n_clients
	js	False	True
access_method	date
desktop	2020-04-23	37.26%	62.74%
	2020-04-24	36.81%	63.19%
	2020-04-25	39.37%	60.63%
	2020-04-26	38.62%	61.38%
	2020-04-27	36.97%	63.03%
	2020-04-28	37.15%	62.85%
	2020-04-29	37.45%	62.55%
	2020-04-30	36.93%	63.07%
mobile web	2020-04-23	24.88%	75.12%
	2020-04-24	24.78%	75.22%
	2020-04-25	24.84%	75.16%
	2020-04-26	24.87%	75.13%
	2020-04-27	24.70%	75.30%
	2020-04-28	24.47%	75.53%
	2020-04-29	24.52%	75.48%
	2020-04-30	24.66%	75.34%