Research talk:Automated classification of edit quality/Work log/2017-05-04

Thursday, May 4, 2017

edit

Today, I'm exploring an issue that was reported by the collab team. Apparently there's very little overlap between "goodfaith" and "damaging" edits for English Wikipedia, but other wikis have enough overlap to target goodfaith newcomers who are running into issues.

In order to examine this problem, I gathered a random sample of 10k edits from recentchanges in enwiki. I filtered out bot edits because those are uninteresting for recentchanges patrolling. Here's my query: https://quarry.wmflabs.org/query/18386

Now I'm working on a script that uses ores.api.Session to query live ORES and get scores for the sample of edits. I've just got this script in my little analysis repo, but we should probably add it as a utility to ORES soon.

"""
Scores a set of revisions

Usage:
    score_revisions (-h|--help)
    score_revisions <ores-host> <context> <model>...
                    [--debug]
                    [--verbose]

Options:
    -h --help    Prints this documentation
    <ores-host>  The host name for an ORES instance to use in scoring
    <context>    The name of the wiki to execute model(s) for
    <model>      The name of a model to use in scoring
"""
import json
import logging
import sys

import docopt
from ores import api

logger = logging.getLogger(__name__)


def main():
    args = docopt.docopt(__doc__)

    logging.basicConfig(
        level=logging.INFO if not args['--debug'] else logging.DEBUG,
        format='%(asctime)s %(levelname)s:%(name)s -- %(message)s'
    )

    ores_host = args['<ores-host>']
    context = args['<context>']
    model_names = args['<model>']
    verbose = args['--verbose']

    rev_docs = [json.loads(l) for l in sys.stdin]

    run(ores_host, context, model_names, rev_docs, verbose)


def run(ores_host, context, model_names, rev_docs, verbose):
    session = api.Session(ores_host, user_agent="ahalfaker@wikimedia.org")

    rev_ids = [d['rev_id'] for d in rev_docs]
    scores = session.score(context, model_names, rev_ids)

    for rev_doc, score_doc in zip(rev_docs, scores):
        rev_doc['score'] = score_doc
        json.dump(rev_doc, sys.stdout)
        sys.stdout.write("\n")
        if verbose:
            sys.stderr.write(".")
            sys.stderr.flush()


if __name__ == "__main__":
    main()

I ran it on my 10k sample and only got 9 errors.

$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep error | wc
      9     197    2342
$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep error | json2tsv score.damaging.error.type
TextDeleted
TextDeleted
TimeoutError
TextDeleted
TextDeleted
TextDeleted
TextDeleted
RevisionNotFound
TimeoutError

Looks like a bit of deletion and timeout errors. I'll be working with this data as it looks good.

OK next step is to extract the fields I want into a TSV so that I can load them into R for some analysis.

$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep -v error | json2tsv rev_id score.damaging.score.probability.true score.goodfaith.score.probability.true --header | head
rev_id	score.damaging.score.probability.true	score.goodfaith.score.probability.true
778068153	0.0341070039332118	0.9631482216073363
778323385	0.06079271012144102	0.9183819888507275
774264535	0.018699456923994003	0.9848181505213502
774896131	0.32644924496861927	0.5472383417030015
775918221	0.12748914158045266	0.8296519735326966
775977649	0.05609497811177157	0.8352973506092333
775539875	0.01176361409844698	0.9837210953518821
777263348	0.5899814608767912	0.5644538254856134
776059314	0.02054486212356617	0.9772033930188049

OK that looks good. Time for some analysis. --EpochFail (talk) 17:38, 4 May 2017 (UTC)Reply

Analysis

edit

OK! I've got the hist of what's going on. See my code here: https://github.com/halfak/damaging-goodfaith-overlap

 
Density of predictions. Damaging and goodfaith ORES score density are plotted for a random sample of edits from English Wikipedia.
 
Prediction pairs scatter-plot. Damaging and goodfaith ORES scores are plotted for a random sample of edits from English Wikipedia.

We can see from these plots that, while scores will often get into the extremes, there's little overlap in the extremely high or low values for both models.

 
High probability pairs. Damaging and goodfaith ORES scores are plotted for a random sample of edits from English Wikipedia where damaging >= 0.879 and goodfaith >= 0.86 (both very high probability). No points == no overlap.
 
Moderate probability pairs. Damaging and goodfaith ORES scores are plotted for a random sample of edits from English Wikipedia where damaging >= 0.398 and goodfaith >= 0.601 (both moderate probability).

#High probability pairs makes the issue plain. There's just no overlap at the confidence that the Collab team has told me they expect(damaging min_precision=0.6, goodfaith min_precision=0.99). However, if I set the damaging threshold to abide by more moderate rules (damaging min_recall=0.75, goodfaith min_precision=0.99), I get some results as can be seen in #Moderate probability pairs.

OK, but is the moderate cross-section useful for anything? Let's check! The following table is a random sample of edits that meet the moderate pair thresholds with my annotations:

revision damaging proba goodfaith proba notes
en:Special:Diff/776491504 0.4011175 0.6523189 maybe damaging, goodfaith (newcomer, mobile edit)
en:Special:Diff/776561939 0.5577317 0.6381191 maybe damaging, goodfaith (anon)
en:Special:Diff/773901225 0.4808844 0.6326436 not damaging, goodfaith (anon)
en:Special:Diff/776192598 0.5090065 0.7602717 not damaging, goodfaith (anon)
en:Special:Diff/775184319 0.5168659 0.6679756 not damaging, goodfaith (anon)
en:Special:Diff/776909321 0.4109281 0.8508490 damaging, goodfaith (newcomer)
en:Special:Diff/773839838 0.4705899 0.6161455 damaging, goodfaith (newcomer)
en:Special:Diff/775681846 0.3980012 0.8870231 not damaging, goodfaith (anon)
en:Special:Diff/777385056 0.4906228 0.6944950 damaging, goodfaith (anon)
en:Special:Diff/775954857 0.4083657 0.7240080 damaging, goodfaith (newcomer)
en:Special:Diff/778629261 0.4156775 0.7470698 not damaging, goodfaith (anon)
en:Special:Diff/777972078 0.4976089 0.6170718 not damaging, goodfaith (newcomer)
en:Special:Diff/776171391 0.5123592 0.8396888 not damaging, goodfaith (anon, counter-vandalism)
en:Special:Diff/775954413 0.3981722 0.6712455 damaging, goodfaith (anon)
en:Special:Diff/774703855 0.4264561 0.7632287 not damaging, goodfaith (anon, adding category)
en:Special:Diff/777069077 0.4241885 0.6990100 damaging, goodfaith (newcomer)
en:Special:Diff/777864924 0.4098085 0.6073056 not damaging, goodfaith (anon, counter-vandalism)
en:Special:Diff/774911971 0.4021984 0.6594416 damaging, goodfaith (anon, misplaced talk post)
en:Special:Diff/775082597 0.6174247 0.6371081 damaging, goodfaith (anon, misplaced talk post)
en:Special:Diff/778161116 0.4311144 0.6327798 not damaging, goodfaith (newcomer)
en:Special:Diff/776781184 0.4929796 0.6192534 damaging, goodfaith (newcomer, BLP)
en:Special:Diff/774472865 0.4664499 0.6066368 damaging, goodfaith (newcomer)
en:Special:Diff/774799454 0.4839814 0.7210619 damaging, goodfaith (anon)
en:Special:Diff/775569040 0.5607529 0.6193204 damaging, goodfaith (newcomer)
en:Special:Diff/775292667 0.4404379 0.8778261 damaging, goodfaith (anon, failing to fix table)
en:Special:Diff/775535192 0.4850735 0.6673567 damaging, goodfaith (anon)
en:Special:Diff/775352387 0.4932909 0.6775150 damaging, goodfaith (anon)
en:Special:Diff/776968902 0.4367727 0.6644402 not damaging, goodfaith (anon, mobile)
en:Special:Diff/776072339 0.5684984 0.6742460 damaging, maybe badfaith (anon)
en:Special:Diff/776084132 0.4516739 0.8753995 damaging, goodfaith (newcomer-ish)

So that looks pretty useful. My recommendation: Don't set such strict thresholds. Models will still be useful at lower levels of confidence. --EpochFail (talk) 18:48, 4 May 2017 (UTC)Reply

Return to "Automated classification of edit quality/Work log/2017-05-04" page.