User:Adamw/Draft/fiwiki.flaggedrevs work log
Experiment 1a: Training using Flagged Revisions as a proxy for damaging
editIn T166235, we did an experiment to see if a model trained on edits accepted through the flagged revisions interface would do any better at finding damaging edits than a model trained on the Wiki Labels damaging data. The results were not promising, with ROC-AUC falling from 0.954 to 0.900.
Hypothesis
editIs data from the flagged revisions system a higher quality and more relevant to the task of finding damaging edits, than the data keyed through Wiki Labels? If so, training on the flagged revisions data should give us a fitness boost.
Methodology
editZache (talk · contribs) gave us a Quarry script to find revisions approved through the Flagged Revisions system. A simplified query was eventually used to generate a list of all approved revisions, consisting of about 50,000 rows. We labeled these as good-faith and not damaging, and gave an approved=1 label for good measure. These labeled revisions were union merged (see below) with 15,000 of the Wiki Labels that had been reserved as a training set. This merged file became our training data. The remaining 5,000 Wiki Labels observations were used for testing model health. No cross-validation was performed.
In hindsight, these flaggedrevs approved revisions were not quite right because they may have been the final edit in what could have been a chain of edits to review. This was an omission and if we end up repeating an experiment like this, we should query for only final revisions whose parent revision equals the starting revision of the reviewed chain.
A model was trained using a Makefile[1] tweaked to build a second fiwiki.flaggedrevs.damaging model using the same parameters as the production fiwiki.damaging model, except it was fed the merged labels including flaggedrevs-approved changes as its source of true classifications. Here are test results from the two models:
Current champion damaging model | Model trained on approved Flagged Revisions |
---|---|
revscoring model_info models/fiwiki.damaging.gradient_boosting.model ScikitLearnClassifier - type: GradientBoosting - params: loss="deviance", warm_start=false, balanced_sample=false, subsample=1.0, max_leaf_nodes=null, min_samples_leaf=1, center=true, balanced_sample_weight=true, min_samples_split=2, learning_rate=0.01, verbose=0, min_weight_fraction_leaf=0.0, presort="auto", max_features="log2", scale=true, random_state=null, max_depth=5, init=null, n_estimators=700 - version: 0.3.0 - trained: 2017-06-26T03:59:29.167423 Table: ~False ~True ----- -------- ------- False 16727 2231 True 113 904 Accuracy: 0.883 Precision: ----- ----- False 0.993 True 0.289 ----- ----- Recall: ----- ----- False 0.882 True 0.89 ----- ----- PR-AUC: ----- ----- False 0.993 True 0.548 ----- ----- ROC-AUC: ----- ----- False 0.95 True 0.954 ----- ----- |
revscoring model_info models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model ScikitLearnClassifier - type: GradientBoosting - params: random_state=null, verbose=0, init=null, learning_rate=0.01, min_samples_split=2, subsample=1.0, warm_start=false, center=true, min_samples_leaf=1, scale=true, loss="deviance", presort="auto", min_weight_fraction_leaf=0.0, balanced_sample=false, n_estimators=700, balanced_sample_weight=true, max_features="log2", max_leaf_nodes=null, max_depth=5 - version: 0.0.1 - trained: 2017-07-25T20:50:13.806134 Table: ~False ~True ----- -------- ------- False 4589 138 True 137 121 Accuracy: 0.945 Precision: ----- ----- False 0.971 True 0.467 ----- ----- Recall: ----- ----- False 0.971 True 0.469 ----- ----- PR-AUC: ----- ----- False 0.993 True 0.437 ----- ----- ROC-AUC: ----- --- False 0.9 True 0.9 ----- --- |
Two new utilities were introduced to facilitate this work:
union_merge_observations will take multiple observations files, and does a set union of any observations of the same record. For revision observations, this will merge all labels applied to each revision. This tool is now available in the revscoring repo.[2]
normalize_column_types casts values to an expected type, in this case it was required because Quarry outputs integer 0/1 for boolean values, and our tools expect a true JSON boolean. We threw away this version of the tool because it wasn't worth the work to canonicalize it. If we end up needing it again one day, may want to combine it with a data validation step.
Experiment 1b: Refine data, omit multi-revision approvals, reverteds, and some bots
editTODO in a future iteration:
- Omit all bots.
- Include approvals that are part of a multi-revision chain, if all changes are by the same author. Perhaps all revisions in the chain should be included in our data set.
- If we can break out of scoring pure revisions, the diff between start and end is a high confidence good edit.
Methodology
editFilter to single-revision approvals
editZache pointed out that Flagged Revs is often (about 1/3 of the approvals) used to merge more than one edit at a time.[3] We can't be confident that all or any of these individual revisions are good-faith or non-damaging, only that the end product is an improvement. For example, a bad edit and its rollback might be included, and the reviewer would still approve the final article state.
I used a simple condition, that the beginning revision is the parent of the end revision. See the TODO on this work for how to correct some nuances that I missed--specifically that multiple edits by a single user probably stand a chance of being desirable edits and we should try harder to include them.
Filter out some bots
editAny approvals by Zache and SeulojaBot are omitted from our set. I'm not totally clear on the reasoning, but I think these are bots reviewing other bots, and as such are edits we want to avoid.
Filter out later reverted
editWe ran the "autolabel" script on our approved revisions, and threw out anything with the "review reason" of "reverted edit". (TODO: link to an explanation of how that script works.)
Prepare for intermediate database tables
editI split this query into pieces to make it easier to follow, and create a temporary table to store intermediate results. This is a bit annoying in Quarry and I ended up cheating, but the basic steps to replicate this approach are:
Create a user database to allow for intermediate tables.
editssh tools-login.wmflabs.org mysql --defaults-file=replica.my.cnf -h fiwiki.labsdb create database u4974__ores_tmp_p;
Building the results purely through Quarry might have been possible, but required some extra work to allow write access our temporary table, so I took a shortcut and ran the bulk of the queries from the console, only using Quarry to perform the fetch step.[4][5]
We discover a data iceberg
editIn experiment 1a, I had missed that we were only parsing the newest approvals, those created since December 2016. Older approvals used a legacy log_params format, which hadn't been picked up by our query condition. Once we relaxed the condition to include the legacy format, we gained 160,000 more approvals to add to our data set. The new query also filters out multi-revision approvals and some bot approvals (those by ). Finally, we filtered out anything that was later reverted, according to the autolabel script.
Results
editCurrent champion damaging model | Model trained on approved Flagged Revisions (2nd iteration) |
---|---|
ScikitLearnClassifier - type: GradientBoosting - params: loss="deviance", warm_start=false, balanced_sample=false, subsample=1.0, max_leaf_nodes=null, min_samples_leaf=1, center=true, balanced_sample_weight=true, min_samples_split=2, learning_rate=0.01, verbose=0, min_weight_fraction_leaf=0.0, presort="auto", max_features="log2", scale=true, random_state=null, max_depth=5, init=null, n_estimators=700 - version: 0.3.0 - trained: 2017-06-26T03:59:29.167423 Table: ~False ~True ----- -------- ------- False 16727 2231 True 113 904 Accuracy: 0.883 Precision: ----- ----- False 0.993 True 0.289 ----- ----- Recall: ----- ----- False 0.882 True 0.89 ----- ----- PR-AUC: ----- ----- False 0.993 True 0.548 ----- ----- ROC-AUC: ----- ----- False 0.95 True 0.954 ----- ----- |
ScikitLearnClassifier - type: GradientBoosting - params: max_leaf_nodes=null, warm_start=false, subsample=1.0, verbose=0, max_features="log2", random_state=null, min_samples_split=2, loss="deviance", init=null, n_estimators=700, learning_rate=0.01, balanced_sample_weight=true, scale=true, max_depth=5, center=true, min_weight_fraction_leaf=0.0, min_samples_leaf=1, presort="auto", balanced_sample=false - version: 0.0.1 - trained: 2017-08-02T04:43:42.045973 Table: ~False ~True ----- -------- ------- False 4588 139 True 138 120 Accuracy: 0.944 Precision: ----- ----- False 0.971 True 0.463 ----- ----- Recall: ----- ----- False 0.971 True 0.465 ----- ----- PR-AUC: ----- ----- False 0.991 True 0.401 ----- ----- ROC-AUC: ----- ----- False 0.878 True 0.878 ----- ----- |
References
edit- ↑ "Makefile for ORES Flagged Revisions experiment". Gist. Retrieved 2017-07-27.
- ↑ "Data utils by adamwight · Pull Request #338 · wiki-ai/revscoring". GitHub. Retrieved 2017-07-27.
- ↑ "⚓ T166235 Flagged revs approve model to fiwiki". phabricator.wikimedia.org. Retrieved 2017-08-03.
- ↑ fiwiki_flaggedrevs_approvals.sql, 2017-08-01, retrieved 2017-08-02
- ↑ "Fiwiki good diffs - Quarry". quarry.wmflabs.org. Retrieved 2017-08-02.