User:MPopov (WMF)/Notes/JS support

For T251464: judging by the traffic to English Wikipedia homepage, the numbers – while not definitive – suggest ratios of 2:1 on desktop and 3:1 on mobile web for JavaScript supported & enabled vs. disabled or unsupported.

Setup

edit

This code (transferred from a Jupyter notebook running on stat1008 in the analytics cluster / SWAP) relies on the Presto Python client, because the equivalent query takes 5-10 minutes in Hive but only minute or two with Presto. Install via:

pip3 install presto-python-client['kerberos']

See wikitech:Analytics/Systems/Presto for more info.

import prestodb
import os

server = {
    'host': 'an-coord1001.eqiad.wmnet',
    'port': 8281,
    'ca_bundle':  '/etc/presto/ca.crt.pem',
}

conn = prestodb.dbapi.connect(
    http_scheme='https',
    host=server['host'],
    port=server['port'],
    user=os.environ['USER'],
    catalog='analytics_hive',
    auth=prestodb.auth.KerberosAuthentication(
        config='/etc/krb5.conf',
        service_name='presto',
        principal='{}@WIKIMEDIA'.format(os.environ['USER']),
        ca_bundle=server['ca_bundle']
    )
)
cursor = conn.cursor()
from jinja2 import Template
import numpy as np
import pandas as pd
from datetime import datetime
import time
from pathlib import Path

Methodology

edit

To count how many users have JS, we will be using a few proxies. First, here we will focus on the Main Page (the homepage you see when you visit en.wikipedia.org, en.m.wikipedia.org). Then, we will look at various resources being requested by visitors. Essentially, we want to count how many "clients" (identified by their User-Agent + IP address combination) request certain images, CSS styles, and JS scripts as part of their visit to the Main Page. Due to our privacy policy, we cannot identify devices/users uniquely.

The UA+IP methodology is flawed in that, same environments on the same network sharing a public IP address (e.g. computers in a school computer lab) will be seen as the same client. At best we have these proxies and approximations.

What makes it tricky is that Wikipedia users can select different skins and some skins are defaults for some scenarios but not others, such as:

  • Modern (default on mobile): the skin used when visiting mobile web version (e.g. en.m.wikipedia.org), can also be enabled on desktop
  • Monobook: the "classic" look that was seen on most wikis from 2005 to 2010
  • Vector (default on desktop): A skin based on Monobook; as well as a new look, it adds a few new features to the wiki interface, particularly in the editing process.

As of April 2013, the Chick, Classic, MySkin, Nostalgia, and Simple skins are no longer available

All page visits request:

  • the HTML for the page itself, and those HTTP requests are labeled as pageviews
  • the favicon, located at /static/favicon/wikipedia.ico, even if they have images disabled (tested on Chrome 81)
  • CSS styles, which are loaded via ResourceLoader, for example:
<link rel="stylesheet" href="/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector">

If the user does not have JavaScript disabled, a request is made (via ResourceLoader) for the page's JS – e.g.

<script async="" src="/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector"></script>

If the user does not have images disabled AND they're using on, we can check for requests to certain images such as:

  • the "Powered by MediaWiki" image in the footer of Vector and Monobook (e.g. /static/images/poweredby_mediawiki_132x47.png)
  • the Wikipedia wordmark (/static/images/mobile/copyright/wikipedia-wordmark-en.svg) in the Modern skin

We can estimate how many visits have JS and images enabled by counting how many clients (not users) with pageviews request CSS & favicon, but not JS or images:

query = Template("""WITH file_requests AS (
    SELECT
      access_method, concat(user_agent, client_ip) AS pseudo_client_id,
      SUM(IF(regexp_like(content_type, 'text/html'), 1, 0)) > 0 AS html, -- aka page view
      SUM(IF(content_type IN('image/png', 'image/svg'), 1, 0)) > 0 AS image, -- "powered by" or wordmark
      SUM(IF(content_type = 'image/vnd.microsoft.icon', 1, 0)) > 0 AS favicon, -- requested even if images are blocked
      SUM(IF(regexp_like(content_type, 'text/css'), 1, 0)) > 0 AS css, -- site.styles loaded by ResourceLoader
      SUM(IF(regexp_like(content_type, 'text/javascript'), 1, 0)) > 0 AS js -- startup loaded by ResourceLoader
    FROM wmf.webrequest
    WHERE year = {{ year }} AND month = {{ month }} AND day = {{ day }} AND hour = {{ hour }}
      AND webrequest_source = 'text'
      AND access_method IN('desktop', 'mobile web')
      AND agent_type = 'user'
      AND (
        -- Either a enwiki:Main Page pageview:
        (is_pageview AND namespace_id = 0 AND page_id = 15580374)
        -- or a resource requested on visit to that page:
        OR (
          http_status IN('200', '304')
          AND uri_host IN('en.wikipedia.org', 'en.m.wikipedia.org')
          AND referer IN('https://en.wikipedia.org/wiki/Main_Page', 'https://en.m.wikipedia.org/wiki/Main_Page') -- resources requested from visiting front page
          AND regexp_like(content_type, '(image|text)/(vnd.microsoft.icon|png|svg|css|javascript)')
          -- specifically, different types of resources:
          AND (
            uri_path IN('/static/favicon/wikipedia.ico')
            -- "powered by mediawiki" at the bottom of the vector, timeless, monobook skins or logo at the top of minerva skin:
            OR regexp_like(uri_path, '(poweredby|wikipedia-wordmark)') 
            OR (
              uri_path = '/w/load.php' -- ResourceLoader
              AND url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'only') = 'styles'
              -- check if site.styles is one of the STYLES modules being loaded:
              AND regexp_like(url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'modules'), 'site.styles')
            )
            OR (
              uri_path = '/w/load.php' -- ResourceLoader
              AND url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'only') = 'scripts'
              -- check if startup is one of the SCRIPTS modules being loaded:
              AND regexp_like(url_extract_parameter(concat('https://', uri_host, uri_path, uri_query), 'modules'), 'startup')
            )
          )
        )
      )
      GROUP BY access_method, concat(user_agent, client_ip)
)
SELECT access_method, html, image, favicon, css, js, COUNT(1) AS n_clients
FROM file_requests
GROUP BY access_method, html, image, favicon, css, js""")

def run_query(**kwargs):
    cursor.execute(query.render(**kwargs))
    rows = cursor.fetchall()
    column_descriptions = cursor.description
    df = pd.DataFrame(rows)
    df.columns = [c[0] for c in column_descriptions]
    return df.sort_values(by = ['access_method', 'html', 'image', 'favicon', 'css', 'js'])

Test run

edit
result = run_query(year = 2020, month = 4, day = 20, hour = 12)
result
access_method html image favicon css js n_clients
1 desktop False False False False False 17422
20 desktop False False False False True 89
16 desktop False False False True False 110
34 desktop False False False True True 399
15 desktop False False True False False 14639
9 desktop False False True False True 2
8 desktop False False True True False 2
23 desktop False False True True True 3
21 desktop False True False False False 18
6 desktop False True False False True 2
33 desktop False True False True False 9
5 desktop False True False True True 27
30 desktop False True True False False 1
13 desktop False True True True True 1
18 desktop True False False False False 28289
32 desktop True False False False True 767
28 desktop True False False True False 900
27 desktop True False False True True 25732
29 desktop True False True False False 66
22 desktop True False True False True 7
3 desktop True False True True False 5
10 desktop True False True True True 1355
36 desktop True True False False False 56
25 desktop True True False False True 30
4 desktop True True False True False 196
12 desktop True True False True True 15335
26 desktop True True True False False 3
11 desktop True True True False True 3
0 desktop True True True True False 8
14 desktop True True True True True 4977
17 mobile web False False False False True 2693
2 mobile web False False False True False 7
31 mobile web False False False True True 26
19 mobile web True False False False False 21395
35 mobile web True False False False True 67398
7 mobile web True False False True False 2
24 mobile web True False False True True 51

Okay, so it's not entirely clear what it means to have clients request resources like images, favicon, CSS, JS, but not the HTML for the page (which is what would have pointed the browser at those resources). But this is a start...

(
    result.query('html')
    .groupby(['access_method', 'js'])
    .agg({'n_clients': 'sum'})
    .groupby(level = 0)
    .transform(lambda x: 100 * x / x.sum())
)
n_clients
access_method js
desktop False 37.981963
True 62.018037
mobile web False 24.083245
True 75.916755

Some interesting findings from just one hour of data from 12PM-1PM on April 20th:

  • on desktop, about 60% requested JS
  • on mobile web, about 75% requested JS

Full run

edit

Let's look at full 24 hours for a week of traffic:

  • Thursday, April 23
  • Friday, April 24
  • Saturday, April 25
  • Sunday, April 26
  • Monday, April 27
  • ...
  • Thursday, April 30
if Path('js_requests.csv').exists():
    results = pd.read_csv('js_requests.csv')
else:
    results = []
    dates = pd.date_range(start = "2020-04-23", end = "2020-04-30").to_pydatetime()
    for d in dates:
        yyyy, mm, dd = d.year, d.month, d.day
        print('processing %s' % (d.date().isoformat()))
        start, end = time.time(), time.time()
        dfs = []
        for hh in range(24):
            df = run_query(year = yyyy, month = mm, day = dd, hour = hh)
            df['date'] = pd.Series([d.date().isoformat() for x in range(df.shape[0])])
            df['dt'] = pd.Series([d.isoformat() for x in range(df.shape[0])])
            dfs.append(df)
            prev, end = end, time.time()
            print('processed hour %d of 24 in %f minutes' % (hh + 1, (end - prev) / 60))
        dfs = pd.concat(dfs)
        dfs = dfs.groupby(['date', 'access_method', 'html', 'image', 'favicon', 'css', 'js'], as_index = False).aggregate({ "n_clients": "sum"})
        results.append(dfs)
        print('processed %s in %f hours' % (d.date().isoformat(), (end - start) / 3600))

    results = pd.concat(results)
    # save all the data:
    results.to_csv('js_requests.csv', index = False)

Note: even using the blazing-fast Presto, the querying took 14.6 hours total with the following times:

Data date Average time per hour of data Median time per hour of data Total time to query 24 hours of data
2020-04-23 2.7 minutes 2.7 minutes 1.1 hours
2020-04-24 2.9 minutes 2.9 minutes 1.2 hours
2020-04-25 4.0 minutes 3.8 minutes 1.6 hours
2020-04-26 4.9 minutes 5.0 minutes 2.0 hours
2020-04-27 7.4 minutes 7.3 minutes 3.0 hours
2020-04-28 13.2 minutes 11.7 minutes 5.3 hours
2020-04-29 0.6 minutes 0.6 minutes 0.3 hours
2020-04-30 0.8 minutes 0.6 minutes 0.3 hours

Results

edit

Note: Full results dataset can be downloaded from T251464#6104955

proportions = (
    results.query('html')
    .groupby(['access_method', 'date', 'js'])
    .agg({'n_clients': 'sum'})
    .groupby(level = [0, 1])
    .transform(lambda x: 100 * x / x.sum()) # yields % w/ JS and w/o JS on each day, within each access_method
)
proportions['n_clients'] = proportions['n_clients'].map('{:,.2f}%'.format)
proportions.unstack()
n_clients
js False True
access_method date
desktop 2020-04-23 37.26% 62.74%
2020-04-24 36.81% 63.19%
2020-04-25 39.37% 60.63%
2020-04-26 38.62% 61.38%
2020-04-27 36.97% 63.03%
2020-04-28 37.15% 62.85%
2020-04-29 37.45% 62.55%
2020-04-30 36.93% 63.07%
mobile web 2020-04-23 24.88% 75.12%
2020-04-24 24.78% 75.22%
2020-04-25 24.84% 75.16%
2020-04-26 24.87% 75.13%
2020-04-27 24.70% 75.30%
2020-04-28 24.47% 75.53%
2020-04-29 24.52% 75.48%
2020-04-30 24.66% 75.34%