Research talk:Anonymous editor acquisition/Signup CTA experiment/Work log/2014-06-12
Thursday, June 12th
editTime to gather some data for scoping out in R.
SELECT
token_info.*,
experimental_revisions
FROM token_info
LEFT JOIN (
SELECT
wiki,
token,
COUNT(rev_id) AS experimental_revisions
FROM staging.token_revision
WHERE timestamp BETWEEN "20140519180800" AND "20140526180800"
GROUP BY wiki, token
) AS token_revision_count USING (wiki, token)
WHERE (first_user_id IS NULL OR first_user_registration > "20140519180800")
AND edit_link_clicks > 0
This is going to be a lot of data.
> SELECT -> COUNT(*) -> FROM token_info -> WHERE (first_user_id IS NULL OR first_user_registration > "20140519180800") -> AND first_event BETWEEN "20140519180800" AND "20140526180800" -> AND edit_link_clicks > 0 -> ; +----------+ | COUNT(*) | +----------+ | 690362 | +----------+ 1 row in set (1.34 sec)
But I should be able to handle it. --Halfak (WMF) (talk) 20:01, 12 June 2014 (UTC)
As expected, this data is very highly skewed. Sadly, I think this means that a geometric mean is not the best approximation that we could use. I'll need to use Mann–Whitney U test to make sure that the differences are actually significant.
> token_activity.stats wiki bucket revs.geo_mean revs.geo_se.upper revs.geo_se.lower n 1: dewiki control 0.10285178 0.10481355 0.10089349 42441 2: dewiki post-edit 0.09841916 0.10031747 0.09652413 38593 3: dewiki pre-edit 0.08392584 0.08576373 0.08209105 39331 4: enwiki pre-edit 0.11599710 0.11697464 0.11502041 198936 5: enwiki control 0.14224717 0.14329707 0.14119824 208205 6: enwiki post-edit 0.14108991 0.14216908 0.14001176 189530 7: frwiki pre-edit 0.10927743 0.11171140 0.10684879 30115 8: frwiki post-edit 0.13261317 0.13529830 0.12993438 28649 9: frwiki control 0.12999916 0.13250303 0.12750083 31714 10: itwiki pre-edit 0.12519958 0.12854587 0.12186321 20310 11: itwiki control 0.15190987 0.15557025 0.14826109 20297 12: itwiki post-edit 0.15375049 0.15764942 0.14986470 1835
Uhoh. It looks like I made a mistake -- and I think I know where it happened. The number of revisions performed by anons in the pre-edit condition is actually lower, not higher. I think that when I ran the SQL last time, I ended up generating stats on a column that had NULLs instead of zeros. I'm going to re-run an old query quick after converting NULLs to zeros and I expect to get a very different result.
SELECT
wiki,
bucket,
ROUND(EXP(AVG(LOG(IFNULL(experimental_revisions, 0)+1)))-1, 3) AS geom_mean_revisions,
SUM(experimental_revisions > 0) AS editing_clients,
SUM(experimental_revisions > 0)/COUNT(*) AS editing_prop,
COUNT(*) AS relevant_tokened_clients
FROM token_info
LEFT JOIN (
SELECT
wiki,
token,
COUNT(rev_id) AS experimental_revisions
FROM staging.token_revision
WHERE TIMESTAMP BETWEEN "20140519180800" AND "20140526180800"
GROUP BY wiki, token
) AS token_revision_count USING (wiki, token)
WHERE (first_user_id IS NULL OR first_user_registration > "20140519180800")
AND link_clicks > 0
GROUP BY wiki, bucket;
+--------+-----------+---------------------+-----------------+--------------+--------------------------+ | wiki | bucket | geom_mean_revisions | editing_clients | editing_prop | relevant_tokened_clients | +--------+-----------+---------------------+-----------------+--------------+--------------------------+ | dewiki | control | 0.103 | 3972 | 0.0936 | 42441 | | dewiki | post-edit | 0.098 | 3663 | 0.0949 | 38593 | | dewiki | pre-edit | 0.084 | 3005 | 0.0764 | 39331 | | enwiki | control | 0.142 | 25751 | 0.1237 | 208205 | | enwiki | post-edit | 0.141 | 23602 | 0.1245 | 189530 | | enwiki | pre-edit | 0.116 | 19673 | 0.0989 | 198936 | | frwiki | control | 0.130 | 3750 | 0.1182 | 31714 | | frwiki | post-edit | 0.133 | 3448 | 0.1204 | 28649 | | frwiki | pre-edit | 0.109 | 2881 | 0.0957 | 30115 | | itwiki | control | 0.152 | 2562 | 0.1262 | 20297 | | itwiki | post-edit | 0.154 | 2325 | 0.1266 | 18359 | | itwiki | pre-edit | 0.125 | 2063 | 0.1016 | 20310 | +--------+-----------+---------------------+-----------------+--------------+--------------------------+ 12 rows in set (1 min 28.65 sec)
There we do. Well... This is *bad* news I guess. Better to be right then wrongfully ignorant. --Halfak (WMF) (talk) 21:35, 12 June 2014 (UTC)
That shows the situation pretty clearly. The "productivity" as measured by edit count goes down for editors in the pre-edit condition. --Halfak (WMF) (talk) 22:25, 12 June 2014 (UTC)
And now for a quick change of pace. Steven (WMF) asked in IRC how many impressions of the CTA users generally saw.
Looks like we got a ton more impressions for the pre-edit click -- which makes perfect sense. Getting a second impression was pretty unlikely. --Halfak (WMF) (talk) 22:59, 12 June 2014 (UTC)