Notes from the Quarterly Review meeting with the Wikimedia Foundation's Editing department, July 8, 2015, 9:30 am - 11:00 am PDT.

Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Present (in the office): James Forrester, Terry Gilbey, Kevin LeDuc, Trevor Parscal, Roan Kattouw, Luis Villa, Geoff Brigham, Garfield Byrd, Stephen LaPorte, Neil Quinn, Danny Horn, Guillaume Paumier, Kim Gilbey, Katherine Maher, Tilman Bayer (taking minutes), Lila Tretikov, Nirzar Pangarkar, Zhou Zhou, Joel Aufrecht, Toby Negrin; participating remotely: Sherry Snyder, Niklas Laxstrom, Runa Bhattacharjee, Amir Aharoni, Mark Holmquist, Wes Moran, Matthew Flaschen, Arthur Richards, Subramanya Sastry, Pau Giner, Arlo Breault, Ed Sanders, Prateek Saxena, C Scott Ananian (joining after 10:00)

Presentation slides from the meeting

Terry: Welcome
Reminder: binary approach to goals, but misses are not necessary a bad thing, very interested in learnings instead
James will lead the meeting

James:
there are 5 teams in the Editing dept, presenting separately, then a piece on KPIs

Collaboration

edit
 
slide 3

[slide 3]

Danny:
Last quarter (Q3) we were very ambitious, had 5 goals, didn't complete any
So this quarter we focused on doing one thing right: a full namespace on Catalan WP rollout
First time we did this, shows that we can do this technically and culturally
Terry: Why Catalan?
Danny: They came to us first, furthest along
Roan: It's a theme with other teams and products too, Catalan community is innovation friendly
Lila: Measures of success post-deployment?
Danny: Functional. We have a small sample, and it's difficult to compare.
LIla: From user perspective? How is end user's life better, and how do we measure that?
Danny: Not yet
LIla: How do you plan to measure?
Danny: Number of unique users, although it's not a great measure -- as we deploy in more places, that number goes up
Lila: ...
Danny: Average thread length could work, but it's a small sample and a couple long threads skews the data
Lila: As general rule, have clear success measure.
Terry: Think "If this was your money, at what point would you bolt?" (disinvest)
Danny: Get user engagement in discussions.
--> Lila: Should never deploy without clear success measure.

 
slide 36

[slide 36]

Trevor: 4 pieces of data in Appendix (slide 36ff)
number of messages posted, etc.
--> Lila: Can you mark deployment times in these charts (Appendix slides 36ff)?
--> Lila: success criteria need to be agreed on by community
Danny:

 
slide 4

[slide 4]

Now we really get to dive in with experienced users, work on the most difficult and interesting features
Lila: (in published version of slides) can we include screenshots of UX elements when we talk about them, like Readership did in their quarterly review?

 
slide 5

[slide 5]

Misses
Danny: LQT conversion
Lila: How many pages are converted already on mediawiki.org?
Matthew: A bit over 250.
Danny: Not too many wikis used LQT
JamesF: A dozen or so actively
Lila: LQT can be activated per page, right?
yes (Flow too)
Lila: What about mobile?
Danny: Version that works, and tested.
Lila: Have User Research tested it? No.
Lila: Should prioritize this, how we get this to work.
Lila: can we have an example link?
Roan: https://ca.wikipedia.org/wiki/Viquip%C3%A8dia:La_taverna/Propostes

Language team

edit
 
slide 8

[slide 8]

Amir: Globally distributed team, most important product: CX.
Have a specially designed statistics page.
Want to move CX from beta to stable.
Had discussions with Catalan wiki community.
Plan this for next quarter.

 
slide 9

[slide 9]

Lila: So now on all Wikipedia as beta? Yes, as of this week
Terry: Do we have stats on whether these articles actually get read?
Amir: No, plan to do in future
--> Lila: Can you show reading stats for CX-generated articles in next quarterly review?
This is really interesting work!
Trevor: Team does keep track of deletion rate.
Lila: Higher or lower deletion rate? Much lower.
Amir: ...
Lila: Know that research team is also looking at gap analysis, which would be a great way to guide translators.
Amir: Research team's French WP experiment actually included article ranking already.

 
slide 10

[slide 10]

Misses
Lila: Enable/usage percentage?
JamesF: Don't have that, because of BetaFeatures auto enable it wouldn't be very useful.
Pau: For frwiki it's about 55% of people with it enabled have tried it out.
Amir: Team has to do its own analytics work.
Small and stretched team.
Maintenance of non CX projects suffers.
Terry: So we are creating technical debt right now?
Amir: Yes…
Runa: Last quarter we tried to identify areas which needed support: ULS, Translate extension
Prepared a plan for this.
Added these tasks to our sprints, but couldn't keep up entirely
Trevor:
--> Terry: Can we get firm numbers on whether technical debt is accumulating or not?
Lila: Mobile aspect of CX?
Amir: Not targeting it, it's a desktop product.
--> Lila: Need to spend some cycles to think about mobile strategy for CX, even if it won't be executed for the next few quarters.
--> Readership team is thinking about splitting up articles into basic units, connect with them

 
slide 11

[slide 11]

Lila: Increasing creation speed means?
Amir, JamesF: how many new articles are created via CX that time period
Lila: So it's "rate" not "speed".
--> Lila: So creation rate of CX articles is key metric - we should show that key metric then.
looks good

Multimedia

edit
 
slide 13

[slide 13]

James: Paying a lot of technical debt, make infrastructure more usable.
Terry (explains): This team created a goal for two quarters.
James: The team was created a few weeks before the end of quarter

 
slide 14

[slide 14]

successes
UW has evolved to try to support lots of bespoke community initiatives like WLM
Pushing things into MediaWiki itself, make uploading function via an API
This allows WLM and similiar initiatives to be done much more easily, via community/WMF gadgets/etc.
Lila: Mobile strategy? Now that phone cameras are getting really good.
Anything we do, we need to think about impact on editor first.
Lila: PM for this is?
James: Me
Lila: Sync up with Ryan from Creative Commons on their List app.
Trevor: I already talked with them about this.
They are going to to help us with other things too, regarding educating users about licensing.
Terry: You mentioned the "Selfiepocalypse".
--> Terry: come up with projection for storage needs due to increased uploads
James: 3 issues to consider before enabling anything in this area: curation effort (taking editors' effort they could use on other things), cannibalizing users' time (that they could use for other contributions), Wikimedia cost/effort (is this the best thing for us to be spending funds on, in hardward and development time)?
Lila: Could upload to private holding space, where community pulls from (instead of push)
Guillaume: This idea has been discussed for 6-7 years
Tech and policy challenges
Matt Flaschen (on hangout chat, paraphrased): Pending changes and Article feedback show that pre-publication/pre-edit backlogs can sometimes become unmanageable.
Mark Holmquist (on hangout chat, paraphrased): To maintain quality (e.g. on Commons), post-publication triage is necessary anyway.
Matt Flaschen (on hangout chat, paraphrased): There's a difference between the two, when we consider the "good images" (the ones that would pass triage). It's important whether they're stuck in pre-publication backlog, or post-publication (people can see them even if they're not fully triaged).
--> Lila: Let's pin this, schedule some time (to talk about "upload to holding space" ideas), include someone from Legal - I want to learn more about this, but we definitely need to think about other workflows
JamesF: E.g. Suggestions about integrating services that scan for illegal content could also be done here to reduce community patrolling effort
Terry: ...
Lila: Key success indicator?
James: Being able to upload form within editor (instead of Commons). Longer term, the number of articles which are illustrated with media files.

 
slide 15

[slide 15]

Misses

Parsing

edit
 
slide 17

[slide 17]

Subbu:
roundtripping for semantic diffs
performance: edge cases - pages with large transclusions
Roan: see Appendix, linked announcement email

 
slide 18

[slide 18]

Sucesses
Luis: so 99.82% means like 6000 errors/ month on enwiki before this quarter? (Su: Besides the fact that this doesn't mean that, why 99.82% however? We are reporting 99.95% semantic accuracy in rt-testing. The wikitech-l thread has full details.)
James: No. This is about rt-testing, there are others systems in play for VE which catch/hide all these errors in actual editing
Subbu: besides roundtripping, we also simulate trivial edits in round-trip testing. and 99.986% of pages have 0 dirty diffs in these tests (during discussion, Roan, Trevor and others clarified that most of these 0.014% diffs were mostly newline diffs which are unproblematic)
--> Lila: make sure we report absolute values for errors next time
make clear that these are percentages of [test cases]
Subbu: these test cases are actual pages from production wikipedias, not synthetic tests (158K pages in all)
Trevor: so you want to know how many dirty diffs produced on enwiki? yes
with live edits, don't have systematic way to determine whether editor did a change on purpose (from google hangout discussions: you would need to ask expert users to tell you what is 'right' for each edit to determine dirty diff rate in production, and to tell you what is an acceptable dirty diff.)
Guillaume did manual assessment (https://phabricator.wikimedia.org/T94767 )
Guilaume: for 3 months (March to June) I did a manual review of edits made with VE (100 edits per week per wiki), on three wikis (English, French and Italian Wikipedia), and classified them by handcoding.
discovered several bugs that hadn't been reported
last time I did this (end of June) was not discovering any new bugs, and saw very few remaining bugs
Lila: sounds good, but need signal from end user
"report a problem" button within VE
leave you with that, happy that numbers are low
Roan, Trevor: keep in mind that some of these numbers include trivial issues (e.g. whitespace)
Lila: BTW, red goal was stretch goal - this is how it should look like, teams should push themselves

 
slide 19

[slide 19]

Misses
Subbu:
Terry: are you making progress on tech debt?
Subbu: yes

 
slide 20

[slide 20]

Core workflows and metrics
Lila: glad that we are so vigilant on errors

VisualEditor

edit
 
slide 22

[slide 22]

James: per feedback, one objective, assessed over four measures

 
slide 24

[slide 24]

Most important is focus on quality; everything else comes after that
Lila: Citoid was killer feature
James: yes, very positive reponse

 
slide 24

[slide 24]

Misses
discovered some usability misses, too late (should have been before they were finished) but in time to postpone A/B test (which would have been marred by these)
I think we made the right decision to postpone
Lila: agree
--> Lila: when thinking about next phases, encourage the user research team to explore freely
Get them to teach some of those research processes to designers, will free team up

Performance indicators, Editing (discussion, not pertaining Q4 goals)

edit

Neil: started to think about KPI for editing

 
slide 26

[slide 26]

 
slide 27

[slide 27]

 
slide 28

[slide 28]

 
slide 29

[slide 29]

 
slide 30

[slide 30]

 
slide 31

[slide 31]

Lila: normalized for productive edits?
Neill: Aaron et al. did some work on that, found it doesn't make a lot of difference between the metrics we've been looking at
Lila: Yes, but when tracking changes, you need to know that you don't make a difference to this (in % of bad edits). VE team did that, was really important

 
slide 32

[slide 32]

Neill: diversity - e.g. does a change just increase edits by 1M in one large Wikipedia, or 10K in each of 100 smaller Wikipedias?
Lila: and now we have revision quality scoring
James: for enwiki only sadly
Lila: yes, but needed to start somewhere
Toby: need retention measures too
Lila: exactly
how quickly can we get an editor from first edit to #5 or such
James: even if they make 5 edits, a very large proportion of new editors (>95%) don't come back after the first month

 
slide 33

[slide 33]

Neill: this is what we decided on (for short term): global edit save rate
--> Lila: (in published version of these slides) label axis, call out this insight you just mentioned in speaking on slide as well (twice as many people save...)

 
slide 34

[slide 34]

Neill: depth metric for a project (created in ~2006 / 2007 out of concern that #articles was too shallow a measure)

Wrapup discussion

edit

Lila: this is so much better this q, appreciate what we are getting to in getting things done, and thinking about mission, how much knowledge is created
Garfield: what role does anon editing plays currently? is it adding value?
Neill: about 33% on the English Wikipedia
there are long-term experienced anon editors, it's not just one-off users
Toby: we talked about segmentation for readers (in the Readership team's review), I think it's also relevant here
James: segmenting by logged in vs. IP editor is probably wrong approach; we should focus on what they want to do, not how they log in
Lila: (question about VE rollout timeline)
James: Right now, we're working with Ops to investigate server quality issues in production
Lila: roadmap for all wikis?
James: probably not the same approach as for enwiki for all of them; should adjust to what works for each wiki
right now wikis without VE are the 4 big ones that said no (enwiki, dewiki, nlwiki, eswiki), and those with language support issues (e.g. kowiki, fawiki, ..)