Research and Decision Science/Data glossary/Clickthrough Rate

This document was created as part of Signals and Data Services Objective 2 Key Result (KR) 2 from the Wikimedia Foundation's 2024-2025 Annual Plan. The KR focused on developing a set of product and experimentation-focused essential metrics. This document defines three (3) ways to measure clickthrough rate (CTR), each with its own scenario/use-case where it is better suited than others for measuring click-based user engagement with a feature or element of the user interface (e.g. link or button): overall CTR, average CTR, and unique CTR. This document also provides example queries (written in Spark SQL and Presto) and specifies some requirements and recommendation regarding the implementation (measurement) of the metrics, such as taking into consideration element visibility. Finally, this document also recommends methodologies for analyzing the CTRs in the context of measuring baselines with confidence intervals and evaluating experiments (A/B tests).

Glossary

edit
Unit
A single source of related interactions such as a session, user, or install.
Impression
When an element (such as a link or a button) is loaded and shown to the user.
Click
When the user clicks on the shown element, including tapping on a touch screen.
Engagement
When the user interacts with content in ways that include clicking but are not exclusively clicking, such as typing text, scrolling, or hovering over elements.
CTR
Clickthrough rate, usually represented as a percentage (rather than value 0–1)

Metric definitions

edit

We use three different definitions/notions of a clickthrough rate (CTR) for interface elements and features.

There are other contexts where we are interested in more specific CTRs, such as when the user is searching for content and is shown search results or recommendations that they can click on or is presented with additional search features that they can interact with.

Overall clickthrough rate

edit

Or just "clickthrough rate":

 

This is the simplest one to track and calculate and there is no need for a unit identifier (e.g. session ID, app install ID, user ID). You simply count the total number of click events recorded and divide it by the count of impression events recorded:

SELECT
  SUM(IF(action = 'click', 1, 0)) / SUM(IF(action = 'impression', 1, 0))
WHERE action IN('click', 'impression')

This metric is very sensitive to the presence of automated agents (bots) in the data. An inflation of impressions or an inflation of both impressions and clicks will substantially affect the estimates. When possible to include additional information about the unit (such as session ID or install/user ID), the average clickthrough rate is preferred.

Average clickthrough rate

edit

In some cases we may want to measure CTR that is more robust to potential automated agent (bot) activity that may:

  • generate substantially more impressions than click-throughs (e.g. none), which would cause us to underestimate the actual CTR if using the overall method
  • generate a perfect, one-to-one ratio of impressions to click-throughs, which would cause us to overestimate the actual CTR if using the overall method

We can account for this potential behavior by calculating the CTR at unit level (such as session-by-session or user-by-user) and then calculate the average of those per-unit CTRs. This makes for a substantially more resilient/robust measurement when the data contains bots.

 

-- Average across sessions:
WITH per_session_ctrs AS (
  SELECT
	session_id,
	SUM(IF(action = 'click', 1, 0)) / SUM(IF(action = 'impression', 1, 0)) AS ctr
  WHERE action IN('click', 'impression')
  GROUP BY session_id
)

SELECT
  AVG(ctr)
FROM per_session_ctrs

The proportion of automated agents to non-automated agents will impact how influential their individual CTRs are in the calculation of the group CTR, and the hope is that the volume of non-automated agents vastly outweighs the volume of automated ones.

Unique clickthrough rate

edit

In some cases we may not care about the total number of times content was seen and then clicked on, but rather how many sessions or users saw the content and then what proportion of them clicked on it after seeing it, like the first stage of a multi-stage funnel.

It's especially useful in a conversion scenario where the user can convert only once per unit's lifetime, such as registering a new account or signing up to participate in an organized event. So we may, for example, be interested in capturing the notion of "how many users demonstrate intent to sign up for an account" by measuring unique CTR on a Create Account link/button.

This is fundamentally different from both the overall CTR and the average CTR.

 

WITH sessions_that_clicked AS (
  SELECT DISTINCT
	session_id,
	true AS clicked
  WHERE action = 'click'
), sessions_that_saw AS (
  SELECT DISTINCT session_id
  WHERE action = 'impression'
), sessions AS (
  SELECT
	i.session_id,
	COALESCE(clicked, false) AS click_through
  FROM sessions_that_saw i
  LEFT JOIN sessions_that_clicked c
	ON i.session_id = c.session_id
)
SELECT
  SUM(IF(click_through, 1, 0)) / COUNT(1)
FROM sessions

Measuring

edit

Data collection (instrumentation)

edit

Data contract

edit

In addition to action, the instrument needs to record element_friendly_name as we are interested in calculating clickthrough rate (whether overall, average, or unique) for specific elements (such as a particular link or button) shown to the user.

If the user is enrolled in any experiments (A/B tests), the instrument needs to record that information in experiments (cf. T368326).

Element visibility

edit

Whenever possible "impression" should correspond to content actually shown to the user (e.g. the DOM element is visible in the browser viewport; View is visible on Android). This is useful for buttons/links that are below the fold and thus may be clicked on substantially less simply due to the fact that they are actually seen less.

An "impression" event should not automatically fire upon the instrument's initialization (which should emit an "init" event instead) but rather only if the attached UI element is actually visible.

Example: If there are 10 elements on a page with CTR instrumentation attached and 8 are visible right away and 2 will be visible if the user scrolls down, then there would be 8 impression events right away and 2 impression events if the user scrolls down.

If we do not implement this requirement it does hurt the accuracy/reliability of CTR measurements and makes it impossible to compare CTR for elements above the fold vs elements below the fold. However, in an A/B test setting this requirement matters less because if the instrumentation is flawed (in that it does not account for visibility of element when emitting an impression event) then as long as it is consistently and universally flawed (across all experiment subjects/groups) then we can still measure differences in CTR between experiment groups.

Standalone instrumentation

edit

Clicks and impressions must come from the same instrument. We should not combine clicks from one instrument and impressions from another instrument.

If we have JavaScript code that sends a "click" event when a link or a button on a page is clicked, we should not use the Webrequest-based count of pageviews for that page as our count of impressions. Instead the JavaScript code should send an "impression" event. This is particularly important because server-side instrumentation of impressions is immune to No-JS (JavaScript disabled in browser) and to ad/tracker-blockers which can and do block intake-analytics.wikimedia.org.

We also should not rely on multiple instruments because those instruments can be loaded and behave differently from one another. If we have a modular, instrumented feature and each module is instrumented separately, each module's instrument should still record impressions even if the overall feature's instrument already records impressions.

If there are multiple elements on the page (or screen) that we attach a clickthrough tracking instrument to, we would send an impression event for each of those elements. We would not "re-use" impression events.

No-JS support

edit

It should be noted that for technical reasons we would not be able to instrument server-side click-throughs, only client-side. This means that on the web we would not be able to measure CTR among users who have JavaScript disabled or to robustly measure CTR for links/buttons even in the presence of an ad/tracker-blocker.

Note: we already have plenty of features that require JS (e.g. if they're using Codex for UI), so it is reasonable to focus on insights about users who have JS enabled and exclude a very small proportion of users who don't. Refer to No-JavaScript notes for more information.

Example queries

edit

This section provides queries written in Spark SQL and Presto dialects/implementations of SQL. The queries use a demo table called user_events with simulated data that looks like:

Example event data
user_id (string) assigned (string) session_id (string) event_id (string) action (string)
9d607a663f3e9b0a90c3c8d4426640dc a 02a8effc4e091311 0616f417fa837c00067ca0 impression
9d607a663f3e9b0a90c3c8d4426640dc a 02a8effc4e091311 8ee22e2c2f64ca97275c3 click
894f782a148b33af1e39a0efed952d69 b 9a8fc1bc918b49ce 213c9131ba603cc645663 impression

Where assigned is a value that would be in the experiments.assigned array in actual data collected with a Metrics Platform-based instrument.

Overall CTR

edit

Average CTR

edit

These example queries use session as the analysis unit, but other units such as install or user can be used if needed and if appropriate identifiers/tokens are available.

Unique CTR

edit

These example queries use session as the analysis unit, but other units such as install or user can be used if needed and if appropriate identifiers/tokens are available.

Analysis recommendations

edit

Baseline estimation

edit

Confidence intervals (CIs) for both overall CTR and unique CTR can be calculated using binomial proportion confidence interval method:

 

where   is the sample proportion (the CTR) and   is the   quantile of a standard normal distribution. For example, for a 95% CI ( ),   would be 1.96. In the case of overall CTR,   is the total number of impressions. In the case of unique CTR,   is the total number of units where at least 1 impression was recorded.

The CI for average CTR can be calculated via:

 

where   is the sample mean (average CTR),   is the sample standard deviation (calculated from per-unit CTRs, just like the sample mean),   is the total number of units (e.g. sessions) used in the calculation, and   is the   quantile of a standard normal distribution (same as above).

Planning experiments

edit

Since unique CTR is also just a proportion just like overall CTR, the power analysis would be the same. In both cases we would use the standard effect size for proportions – Cohen's h.

For a change in overall CTR from 10% to 11%, Cohen's h would be 0.0326. To estimate the sample size we need to reach 80% power (with 5% significance level):

pwr.2p.test(
    h = 0.0326,
    power = 0.8,
    alternative = "greater"
)

We would need at least 11,635 impressions to be able to detect a minimum improvement of 1pp.

Tip: In practice we would be dealing with very, very small clickthrough rates and correspondingly very small improvements – and thus, very small effect sizes. For example, the effect size – as measured by Cohen’s h – between 0.001% CTR and 0.0011% CTR (a 10% improvement or 0.0001pp increase) is 0.0031. The number of impressions required in this case would be 1.30 M.

Average CTR, on the other hand, uses Cohen's d as the effect size. Suppose the baseline average CTR is 5.8% (with 4.8% standard deviation). To a detect a 0.1pp increase (a 1.72% relative increase). – and assuming the same standard deviation in the treatment group (the one receiving a different experience), we would want to detect a Cohen's d of 0.02083.

pwr.t.test(
  d = ((0.059 - 0.058) / 0.048),
  power = 0.95,
  alternative = "greater"
)

To detect that effect size with 95% power we would need about 50K sessions per group in the experiment.

Evaluating experiments

edit

In these examples we are analyzing 3 groups:

A
First variant whose true, latent overall CTR is 15% (a 5pp increase over the baseline).
B
Second variant whose true, latent overall CTR is 11% (a 1pp increase over the baseline).
C
Control group whose true, latent overall CTR is 10%.

Analyzing proportions

edit

Since unique CTR is also just a proportion just like overall CTR, the statistical analysis would be the same.

First, we can test whether there is at least one group that is different from the others:

 

# clicks and impressions are vectors of length 3
prop.test(
    x = clicks,
    n = impressions
)
    3-sample test for equality of proportions without continuity correction

data:  clicks out of impressions
X-squared = 111721, df = 2, p-value < 2.2e-16
alternative hypothesis: two.sided
sample estimates:
    prop 1     prop 2     prop 3 
0.10641417 0.15366222 0.09478002 

We reject  . Then we compare A vs C, B vs C, and A vs B:

Important: Since we are making multiple comparisons (including the one we already did), we should use Bonferroni correction, meaning that whatever   we decide to use we would divide by   hypotheses. So   gives us a new   to use.

# clicks and impressions are named vectors of length 3
pairwise.prop.test(
  x = clicks,
  n = impressions,
  p.adjust.method = "bonferroni"
)
    Pairwise comparisons using Pairwise comparison of proportions 

data:  clicks out of impressions 

  b      a     
a <2e-16 -     
c <2e-16 <2e-16

Analyzing averages

edit

Using ANOVA to test whether all three group means are equal to each other:

# session_ctrs is a data frame of per-session CTRs
session_ctrs <- session_ctrs |>
    mutate(assigned = factor(assigned, c("a", "b", "c"))) |>
    mutate(assigned = relevel(assigned, ref = "c"))

fit <- lm(ctr ~ assigned, data = session_ctrs)

anova(fit)
Analysis of Variance Table

Response: ctr
             Df Sum Sq Mean Sq F value    Pr(>F)    
assigned      2 13.691  6.8456  2383.2 < 2.2e-16 ***
Residuals 26476 76.049  0.0029                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject  . Then we compare A vs C, B vs C, and A vs B. For demonstration purposes, only the first pairwise comparison (A vs C) is included:

# Is the average CTR in variant A better than the average CTR in the control group?
t.test(
    x = session_ctrs |>
        filter(assigned == "a") |>
        pull(ctr),
    y = session_ctrs |>
        filter(assigned == "c") |>
        pull(ctr),
    alternative = "greater"
)
    Welch Two Sample t-test

data:  pull(filter(session_ctrs, assigned == "a"), ctr) and pull(filter(session_ctrs, assigned == "c"), ctr)
t = 65.094, df = 16565, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.0525283       Inf
sample estimates:
 mean of x  mean of y 
0.11149323 0.05760311 

Note: alternative = "greater" is the alternative that x has a larger mean than y, which is why we are setting y as the control group’s clickthrough rates.