Research talk:Revision scoring as a service/2015
Revision handcoding (mockups)
editHey folks. I got some mockups together of a hand-coding gadget interface. What do you guys think?
--EpochFail (talk) 19:45, 18 October 2014 (UTC)
- I made some updates to the mocks and worked on a generalizable configuration strategy. I propose something like this for a campaign:
YAML campaign configuration
|
---|
name: Revision Coding -- English Wikipedia 2014 10k sample
source: enwiki 2014 revisions -- 10k random sample
author:
name: Aaron Halfaker
email: aaron.halfaker@gmail.com
coder:
class: revcoding.coders.RevisionDiff
form:
fields:
- damaging
- good-faith
fields:
damaging:
class: revcoding.ui.RadioButtons
label: Damaging?
help: Did this edit cause damage to the article?
options:
-
label: "yes"
value: "yes"
tooltip: Yes, this edit is damaging and should be reverted.
-
label: "no"
value: "no"
tooltip: >
No, this edit is not damaging and should not be
reverted.
-
label: unsure
value: unsure
tooltip: >
It's not clear whether this edit damages the article or
not.
good-faith:
class: revcoding.ui.RadioButtons
label: Good faith?
help: >
Does it appear as though the author of this edit was
trying to contribute productively?
options:
-
label: "yes"
value: "yes"
tooltip: Yes, this edit appears to have been made in good-faith.
-
label: "no"
value: "no"
tooltip: No, this edit appears to have been made in bad-faith.
-
label: unsure
value: unsure
tooltip: >
It's not clear whether or not this edit was made in
good-faith.
|
- A server running in WMF Labs that would make sources of rev_ids available. The above configuration describes a campaign. The gadget running in the user's browser will have a hard-coded campaign list page (e.g. en:User:EpochFail/Revcoding/CampiagnList.js). The campaigns listed there will appear in gadget users' Special:UserContribs page. The WMF labs server will be responsible for delivering (1) the campaign definition (described above) and (2) tracking, delivering and accepting submissions from work sets. --EpochFail (talk) 17:48, 19 October 2014 (UTC)
- Decided to hack together a quick diagram.
- --EpochFail (talk) 18:01, 19 October 2014 (UTC)
- I've realized a problem. When a user requests or submits a coding, how does the server know who they are? I wonder if we can get an oauth handshake in here somehow. If we open a popup window to the server that performs the oauth handshake and sets up a session with the user's browser, then subsequent requests will be identifiable. So... that means that a logged-in Wikipedia editor could be a logged-out Revcoder. Here's what that might look like:
- --EpochFail (talk) 18:16, 19 October 2014 (UTC)
- @EpochFail: newbie question: how/where do we use a YAML file like this? Helder 22:34, 20 October 2014 (UTC)
- Good Q. In the past, I have designed configuration strategies that build forms. See this one I use in en:WP:Snuggle: [1] (look for "user_actions:", it corresponds to Media:Snuggle.UI.user_menu.png). We'll have to write a form interpreter ourselves, but that's not too difficult. --EpochFail (talk) 23:14, 20 October 2014 (UTC)
- A few ideas:
- It might be worth to allow users to add a note about a specific revision when reviewing it, mainly when the user is "unsure" about the correct label.
- Maybe we could save a click for each review by not having a submit button? Then, when the user clicks for selecting the second label, the review is also submited to the system
- In this case, an undo button could be necessary/useful (see also a similar request for the Wikidata Game)
- Keyboard bindings/shortcuts
- Shortcuts :) let's put some emacs C-J-W-1-2-3 bindings --Jonas AGX (talk) 00:44, 17 November 2014 (UTC)
- Done Use colors in the vertical bars, to indicate if the revision is damaging or not, good-faith or not, etc.. (the bottom half of the bar could be used for a feature and the upper half for the other)
- This was implemented in the gadget by splitting each vertical bar in blocks (one for each field).
- Not done Move the "unsure" button to the middle (yes, unsure, no), so the ordering of the "scale" is more intuitive (+1, 0, -1)
- This does not scale to non-binary things (e.g. article quality class). However moving the unsure option into a separate checkbox makes sense, as it requires the user to make his best guess while still informing that there are doubts. Helder 18:27, 30 January 2015 (UTC)
- Helder 18:52, 16 November 2014 (UTC)
- Here is a screenshot of the interface as implemented in the gadget. Some notes:
- Each vertical rectangle corresponds to a revision, and it is split into boxes, where the boxes in the first row corresponds to "Damaging" and those in the second row correspond to "good-faith".
- If new fields are added to the spec:
- New boxes are added below the two current boxes of each revision
- New styles (e.g. colors) need to be added manually to the CSS file.
- Maybe it is a good idea to provide a default set of 10 colors which would be used for, say, the 10 first options of a field.
- I'm treating "unsure" as being different from "not evaluated yet", and I assume "unsure" would be stored as an actual value in the database
- The workset wraps automatically when the browser window is too small.
- I exemplified how to get a dataset of revids from recent changes
The buttons do not do anything yet, but they would use CORS to make API calls to e.g. Danilo.mac's prototype on Labs, to store the values provided by a user.Update: The submit button updates the progress bar with the values for the current diff (selected using the other buttons) and in the future it would use CORS to make API calls to e.g. Danilo's prototype on Labs, to store the values provided by a user. Helder 01:05, 14 January 2015 (UTC)- Update 2: The submit button updates the progress bar with the values for the current diff (selected using the other buttons) and uses jsonp to make API calls to Danilo's prototype on Labs, to store the values provided by the user. Helder 18:20, 22 January 2015 (UTC)
- This looks very good to me. I appreciate that you are thinking about how the visualization will extend for more fields. I'm worried about entering into some crazy visuals if someone adds a lot of fields to the form, but then again, they probably shouldn't have that many fields. --EpochFail (talk) 17:24, 20 January 2015 (UTC)
- @EpochFail: Yeah, if there are too many fields, the users will also have a hard time evaluating each revision. Helder 18:22, 22 January 2015 (UTC)
- Here is a screenshot of the interface as implemented in the gadget. Some notes:
- It's safe to publish openly those quality ratings we are going to collect? I mean, while we learn how vandals are working those vandals will learn (reading such datasets) how we learn from them and be able find out new way to get hidden in the ocean of approved revisions. Ok, it's much more a sci-fi question than a practical issue. --Jonas AGX (talk) 00:44, 17 November 2014 (UTC)
- @Jonas AGX: I think yes. It is common knowledge what a vandalism is, and I don't think vandals will change anything in their behaviour just because they know that we consider their actions as being disruptive. Helder 18:28, 22 January 2015 (UTC)
@Jonas AGX, EpochFail, とある白い猫: I made a first draft of a gadget based on these mockups. It is the first one available on testwiki:Special:Preferences#mw-prefsection-gadgets and its result can be seen in a diff page such as testwiki:Special:Diff/219084. Helder 20:47, 24 November 2014 (UTC)
- Are somebody developing something about database? I can try to make an API in toollabs:ptwikis to collect the data sent by the gadget in a database. I have already made a tool to register data, this tool is for voting in the last WLE photos, the votes are registered in a database, but it don't use OAuth. I still have to learn how to use OAuth and as ptwikis is a tool for Portuguese projects I will initially make this only for ptwiki, ok? (sorry bad English) Danilo.mac talk 02:47, 26 November 2014 (UTC)
- halfak and gwicke were talking something about this (I think) yesterday / today on #wikimedia-research. Maybe they have something to add here? Helder 16:17, 26 November 2014 (UTC)
- Just to explain better, I'm not trying to make some definitive, it is just for tests. I have made this tool that shows data saved using this API. Danilo.mac talk 16:53, 27 November 2014 (UTC)
- I don't think we want to use RestBASE (The stuff I was discussing with Gwicke) to store our training set data. We'll probably want to maintain our own system and the testing that Danilo.mac is doing is helping us get that up and running. Do you guys have any design docs put together yet? I have some mockups that I'd like to share. --EpochFail (talk) 17:24, 20 January 2015 (UTC)
- @EpochFail: Nope. Feel free to share your mockups. Helder 18:30, 22 January 2015 (UTC)
- I don't think we want to use RestBASE (The stuff I was discussing with Gwicke) to store our training set data. We'll probably want to maintain our own system and the testing that Danilo.mac is doing is helping us get that up and running. Do you guys have any design docs put together yet? I have some mockups that I'd like to share. --EpochFail (talk) 17:24, 20 January 2015 (UTC)
- Just to explain better, I'm not trying to make some definitive, it is just for tests. I have made this tool that shows data saved using this API. Danilo.mac talk 16:53, 27 November 2014 (UTC)
- halfak and gwicke were talking something about this (I think) yesterday / today on #wikimedia-research. Maybe they have something to add here? Helder 16:17, 26 November 2014 (UTC)
- Shouldn't we use radio buttons (see the WMF living styleguide) instead of buttons groups? With radio buttons, we could just use the class "mw-ui-radio" and the selected option in each group would be styled automatically, while with buttons we would need to define some new class with styles for the selected buttons. Helder 19:54, 12 January 2015 (UTC)
- Or, depending on the kinds of fields that we will have, checkboxes instead of radio buttons, to allow for multiple values (tags) for a single revision... Helder 23:03, 12 January 2015 (UTC)
- +1 for following whatever standards exist in MediaWiki. Otherwise, I'd like to optimize for usability. Radio buttons are small and hard to click. We can also surround the radio with clickable space. Something like this:
[( ) Label ] vs [(*) Label ]
. --EpochFail (talk) 17:24, 20 January 2015 (UTC)- I don't see the difference in the example
[( ) Label ] vs [(*) Label ]
, but I usually make the label of radio buttons clicable to make it easier to select an option.- Looks like there are two living style guides, the other one being about OOjs UI. This one has support for Button selects and options. Helder 12:33, 3 February 2015 (UTC)
- I don't see the difference in the example
- I ask this having in mind the mockup you have on your Google Drive (Revision scoring › coding › Revision handcoder), showing edit types. How would that be stored in the database? What if someone decide to add new options to a field like these? Helder 18:40, 22 January 2015 (UTC)
- @EpochFail: ^. Helder 18:34, 30 January 2015 (UTC)
- It's this pattern that makes me want to use JSON for the form field. For example, we could have a form field that stores a list of values:
{"edit_type": ["copy", "refactor", "addition"]}
- Or we could have a collection of booleans:
{"copy": true, "add_citation": false, "addition": true, "refactor": true, "removal": false}
- As I mentioned in the related trello card [2], Postgre's JSONB type supports this as well as querying and indexing of json elements -- though I suspect that we won't really want to index form data directly. --EpochFail (talk) 19:25, 30 January 2015 (UTC)
- +1 for following whatever standards exist in MediaWiki. Otherwise, I'd like to optimize for usability. Radio buttons are small and hard to click. We can also surround the radio with clickable space. Something like this:
- Or, depending on the kinds of fields that we will have, checkboxes instead of radio buttons, to allow for multiple values (tags) for a single revision... Helder 23:03, 12 January 2015 (UTC)
Handcoder home
editHey folks, I created a new mockup for a home for revcoding work. Such an interface could give our volunteer handcoders a window into the system's labeled data needs and may provide easy access to the revision handcoder to add new data.
- [propose] buttons would take users to the bug tracker to file a bug.
- The "training data" histograms on the right would visually present the recency of available data for training classifiers. More recent data is more better.
- [add data] button would load up a random sample of recent revisions into the handcoder for the user to process
I imagine that we'd have a suite of admin tools that would allow us to train/test/deploy new classifiers from the web interface. --EpochFail (talk) 17:32, 20 January 2015 (UTC)
- Cool! There could be a link to report bugs/make requests for the existing stemmers too (e.g. https://github.com/nltk/nltk/issues). Helder 18:54, 22 January 2015 (UTC)
- I do like this a lot. I just have a minor point with the color blue in the screen shot. It's a ad bit too bright. A tad bit darker shade would be better. Like the color of the lines in the graph. Is this a possibility? -- とある白い猫 chi? 21:03, 31 January 2015 (UTC)
- I should just make my mockups be black and white. :P But seriously, I wouldn't mind pulling in a someone with some visual design experience. In the meantime, I'll makes sure you have access to the google drawing to make changes on your own. --Halfak (WMF) (talk) 17:14, 1 February 2015 (UTC)
- I do like this a lot. I just have a minor point with the color blue in the screen shot. It's a ad bit too bright. A tad bit darker shade would be better. Like the color of the lines in the graph. Is this a possibility? -- とある白い猫 chi? 21:03, 31 January 2015 (UTC)
Halfak's approach
editI'd like to throw my two cents on the matter. After Halfak's explanation yesterday I find his approach on the matter quite sound. I think there is great benefit in having a gadget which has several campaigns which is divided among tasks that expire if people are sitting on them. I think this approach suits our crowd-sourcing culture at Wikimedia projects better. After all crowd-sourcing itself is a divide-and-conquer strategy to begin with. -- とある白い猫 chi? 09:04, 21 February 2015 (UTC)
- I think that the thread you are looking for is Research talk:Revision scoring as a service/Coder. :P --Halfak (WMF) (talk) 16:37, 21 February 2015 (UTC)
- I think you are probably right. -- とある白い猫 chi? 18:19, 21 February 2015 (UTC)
(Bad)Words as features
editOne thing I saw in the Machine Learning was an application of Linear SVM to SPAM detection, where we took a list of ~2000 English words (the ones which appeared more than 100 times in a subset of the SpamAssassin Public Corpus) and used the presence/absence of each of their stems in an e-mail as a (binary) feature. So, given a word list in the form (foo, bar, baz, quux, ...) and an e-mail whose text is "Get a discount for bar now!", we would represent the e-mail by a vector (0, 1, 0, 0, ...) whose dimension is the number of words in our list, and which contained ones in the entries corresponding to each word which was found in the given e-mail. Then we used a set of 4000 examples to train a SVM and tested it on 1000 other examples, getting ~98% of accuracy. After that, we also sorted our vocabulary by the weiths learned by the model to get a list of the top predictors of SPAM.
In the context of vandalism detection, top predictors could be used to improve the lists of badwords used by abuse filters, Salebot and similar tools which do not use machine learning. In the specific case of the Salebot list, we could even use the learned weights to fine tune the weights used by the bot.
This approach differs from the one currently in use on Revision-Scoring, where we just count the number (and the proportion, etc) of badwords added in a revision. It is as if all words in the list had the same weight, which doesn't look quite right. Helder 13:10, 6 January 2015 (UTC)
Papers
editThese are a few papers on vandalism detection which might be of interest to us:
- Khoi-Nguyen Tran & Peter Christen. Cross Language Learning from Bots and Users to detect Vandalism on Wikipedia. 2014.
- Santiago M. Mola Velasco. Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals. 2012.
- Jeffrey M. Rzeszotarski & Aniket Kittur. Learning from history: predicting reverted work at the word level in Wikipedia. 2012.
- Andrew G. West & Insup Lee. Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence. 2011.
- Kelly Y. Itakura & Charles LA Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. 2009.
- F. Gediz Aksit. Wikipedia Vandalism Detection using VandalSense 2.0. 2011.
- Sara Javanmardi & David W. McDonald & Cristina V. Lopes. Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso. 2011.
- B. Thomas Adler & Luca de Alfaro & Santiago Mola-Velasco & Paolo Rosso & Andrew G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. 2011.
- Koen Smets & Bart Goethals & Brigitte Verdonk. Automatic vandalism detection in Wikipedia: Towards a machine learning approach. 2008.
- Martin Potthast & Benno Stein & Robert Gerling. Automatic Vandalism Detection in Wikipedia. 2008.
- Jacobi Carter. ClueBot and Vandalism on Wikipedia. 2007.
There is also this one, about PR and ROC curves:
- Jesse Davis & Mark Goadrich. The relationship between Precision-Recall and ROC curves. 2006.
Other sources for similar articles:
- wikipapers:List of research areas#Vandalism and spam
- w:en:User:Emijrp/Anti-vandalism bot census#References
- w:en:Wikipedia:Counter-Vandalism Unit/Vandalism studies#Wikipedia vandalism studies outside of this project
- Search for wiki vandalism detection on Google Scholar
@Aaron: this should be a good start... Helder 18:30, 9 January 2015 (UTC)
Progress report: 2015-01-11
editHey folks,
We've been doing a lot of hacking over the holiday season. Since the last update, we:
- Cleaned up the structure of the Feature Extractor so that it is easier to use. [3]
- We chose a name for the system that will serve revscores -- The Objective Revision Evaluation System -- or ORES. [4]
- We built an ipython notebook demonstrating the use of our LinearSVC scorer. [5]
- We implemented file writing and reading for the scorer models so that storing and sharing models will be easy [6]
- Danilo tested a set of classifiers for detecting reverts in PTwiki and showed some success. We'll likely need to refine a bit.
- Part of that refining will be using modifiers on our input data. For that, we've implemented "modifiers" as Psuedo-Features that can be used in training classifiers. See [7].
- We've read a User:West.andrew.g's research on building machine learning models of damage and gathered a list of related work to pick through -- see #Papers. [8]
- We've also completed a discussion in trwiki and azwiki about this project. See Research:Revision_scoring_as_a_service/Engagement_with_Turkish_community and Research:Revision_scoring_as_a_service/Engagement_with_Azerbaijani_community.
User:とある白い猫 and User:He7d3r, please check out the bits above to see if I missed anything. --Halfak (WMF) (talk) 18:33, 11 January 2015 (UTC)
- Performed a minor correction. Looks good aside from that. :) -- とある白い猫 chi? 04:55, 15 January 2015 (UTC)
Installing and checking requirements.txt
editI tested an installation of the revscoring system and dependencies on Ubuntu 14.04 with Python 3.4.0. See my work log below.
# First thing I would like to do is set up a python virtual environment. I like to store my virtual
# environment in a "venv" folder, so I'll make one in my home directory.
$ mkdir ~/venv
~/venv $ cd venv
# Regretfully, pip is broken in the current version of venv, so we have to install it manually.
~/venv $ pyvenv-3.4 3.4 --without-pip
# Before we try to install pip, we'll need to activate the virtualenv
(3.4) ~/venv $ source 3.4/bin/activate
# Then use the installer script to install the most recent version
(3.4) ~/venv $ wget -O - https://bootstrap.pypa.io/get-pip.py | python
# Now we have pip in our venv so we can install our dependencies.
(3.4) ~/venv $ pip install deltas
(3.4) ~/venv $ pip install mediawiki-utilities
(3.4) ~/venv $ pip install nltk
(3.4) ~/venv $ pip install numpy
# Numpy failed because we are missing some headers for python and try again
(3.4) ~/venv $ sudo apt-get install python3-dev
(3.4) ~/venv $ pip install numpy
# OK back to the list
(3.4) ~/venv $ pip install pytz
(3.4) ~/venv $ pip install scikit-learn
(3.4) ~/venv $ pip install scipy
# Install a scipy fails due to missing libraries
(3.4) ~/venv $ sudo apt-get install gfortran libopenblas-dev liblapack-dev
# And now we try again
(3.4) ~/venv $ pip install scipy
# Now, before we get going, we should download the nltk data we need.
(3.4) ~/venv $ python
>>> python
>>> import nltk
>>> nltk.download()
>>> Downloader> d
>>> Identifier> wordnet
>>> Downloader> d
>>> Identifier> omw
>>> Downloader> q
>>> ^d
# OK now it is time to set up the revscoring project. I like to pull in all my projects -- whether library or
# or analysis into a "projects" directory.
(3.4) ~/venv $ mkdir ~/projects/
(3.4) ~/venv$ cd ~/projects
(3.4) ~/projects $ git clone https://github.com/halfak/Revision-Scoring revscoring
(3.4) ~/projects $ cd revscoring
(3.4) ~/projects/revscoring $ python demonstrate_extractor.py
# And it works!
--EpochFail (talk) 17:30, 15 January 2015 (UTC)
- On Linux Mint 17.1 (64 bits):
pip install numpy
fails with RuntimeError: Broken toolchain: cannot link a simple C program, butsudo apt-get install python3-dev
fixes it.pip install scikit-learn
fails with "sh: 1: x86_64-linux-gnu-g++: not found", sosudo apt-get install g++
needs to be executed before it "Successfully installed scikit-learn-0.15.2"- Before running
pip install scipy
, it was necessary to runsudo apt-get install liblapack-dev
(due to numpy.distutils.system_info.NotFoundError: no lapack/blas resources found) andsudo apt-get install gfortran
(due to error: library dfftpack has Fortran sources but no Fortran compiler found). The packagelibopenblas-dev
was not necessary. - Before cloning the repository, I had to install git too:
sudo apt-get install git
- Helder 19:31, 1 February 2015 (UTC)
Logo design
edit-
Author: Ekips39
Alternate version of Philosopher's stone gliph -
Author: Danilo.mac
-
Author: Helder
Alternative version of Danilo's proposal -
Author: Danilo.mac (animated)
So there are three designs so far. -- とある白い猫 chi? 22:14, 18 January 2015 (UTC)
- @Danilo.mac: I think your suggestion would be perfect if the blue circle were in the center of the white circle, and the whole gear at the bottom were the same color (red/brown). Helder 18:53, 23 January 2015 (UTC)
- I uploaded a new version with these changes. Helder 21:20, 24 January 2015 (UTC)
- @He7d3r: You might want to upload your version as a new file rather than overwriting. It helps us show and see different versions to compare. whym (talk) 03:52, 26 January 2015 (UTC)
- @whym: Done. Helder 08:56, 26 January 2015 (UTC)
- I particularly like the animated version. I think different aspects of our project can have different logos. For instance this animation can be used as the "loading" animation since I imagine some queries would take time. Or perhaps it could be the logo of revscores itself. -- とある白い猫 chi? 11:57, 26 January 2015 (UTC)
- @Helder: Thanks for fixing that! I'm not sure if my !vote counts, but I personally like Danilo.mac's (the 3rd one) because it apparently symbolizes the tool by a robot eyeball monitoring products. whym (talk) 11:05, 29 January 2015 (UTC)
- @whym: Done. Helder 08:56, 26 January 2015 (UTC)
- @He7d3r: You might want to upload your version as a new file rather than overwriting. It helps us show and see different versions to compare. whym (talk) 03:52, 26 January 2015 (UTC)
- I uploaded a new version with these changes. Helder 21:20, 24 January 2015 (UTC)
Symbolism behind logo design
editSo we agreed for this as the logo for ORES (Objective Revision Evaluation Service) for the time being. I want to explain the symbolism/story behind it.
First of the abbreviation of our system forms is an acronym with the plural of the word ores. Because we datamine raw data (data ore if you will) we felt this name fit our system best.
Gold ranks among the most valuable ore. In the 17th century the Philosopher's stone was a legendary substance believed to be capable of turning inexpensive metals into gold and was represented by an alchemical gliph. Our logo is inspired by this glyph since the service we intend to provide will convert otherwise worthless ore data (raw data) to gold ore data. Mind that it is still ore for others to process. As this services main goal is to enable other more powerful tools.
The idea behind the logo came from User:Mareklug and User:Ekips39 was kind enough to draft two versions of the logo.
Progress report: 2015-01-16
editHey folks, we got another week of work in which means it is time for a status update.
- We welcomed User:Jonas_AGX to the project team and he submitted a pull request to improve our feature set. [9]
- We completed a readthrough of AGWest's work [10]
- We wrote up installation notes for Ubuntu, Mint and Windows 7. [11]
- We've worked through the train/test/classify structure so that the whole team is familiar with it [12]
- We've done some substantial work testing and refining our classifiers on real world data [13]
- In a somewhat contrived environment, I've been able to demonstrate 0.85 AUC on English Wikipedia reverts -- which puts us on par with STiki. --16:32, 19 January 2015 (UTC)
- We attended the IEG office hour [14]
- We presented on revscoring at the January Metrics Meeting (video, slides) [15]
- We started a new repo for the Revision Handcoder [16]
- We fixed some issues with model file reading/writing that will make the system easier to generalize [17]
And I wrote a bunch of documentation to mark off our first month. Ping User:He7d3r, User:とある白い猫 and User:Jonas_AGX--Halfak (WMF) (talk) 16:32, 19 January 2015 (UTC)
Progress report: 2015-01-23
editHey folks. Again, I'm late with this report. You can blame the mw:MediaWiki Developer Summit 2015. I talked to a lot of people about the potential of this project there and learned a bit about operation concerns related to service proliferation. Anyway, last week:
- We discussed the design of revcoder -- including a new mockup of a revcoder homepage. [18] [19]
- We tested sending data between the revcoder and a service running on labs and worked out some of the details of storage strategies. [20]
- We pushed the accuracy of the enwiki revert classifier to .83 AUC. See an ipython notebook demonstrating the work. [21]
- Merged a feature that generalized is_mainspace to is_content_namespace [22]
- We filed a month report for December on the IEG page [23]
- We also finished up a little bit of other remaining IEG info bits [24]
- We made it easier to load and share model files [25]
That's all I've got. User:とある白い猫 and User:He7d3r, please review. :) --00:04, 30 January 2015 (UTC)
- This is the list of issues and pull requests we closed during that week:
- Helder 01:07, 17 February 2015 (UTC)
Progress report: 2015-01-30 (draft)
editLast week:
- We closed these issues and pull requests:
- We added members to the revscoring project on Wikimedia Labs [26][27]
Progress report: 2015-02-06 (draft)
editThis week:
- We closed these issues and pull requests:
- We created a test instance on the revscoring project on Wikimedia Labs
- We updated the existing code to consistently use the name "revscoring" [28][29]
- We released revscoring 0.0.3 on PyPI[30][31]
- We noticed some issues with the probabilities provided by sklearn [32][33]
- We investigated and fixed some issues in Mediawiki-Utilities
Progress report: 2015-02-20
editThis week we:
- Deployed a MVP API for gathering revision scores [34]
- See http://ores.wmflabs.org/scores/enwiki?revids=4567890&models=reverted
- wikis=['enwiki', 'ptwiki'] & models=['reverted']
- We also fixed a few bugs in order to make this work -- e.g. a JSON encoding bug [35] and a prediction probability bug [36]
- See http://ores.wmflabs.org/scores/enwiki?revids=4567890&models=reverted
- Wrote a monthly report for January [37]
- We dug into issues around stemming for Azerbaijani -- regretfully, it looks like we're going to need to hack something together. [38]
- We began work on the revision coding backend server in earnest. See /Coder. [39] [40]
- We completed a signpost article introducing Wikipedians to the revscoring project [41]
See our changes to the repos during this week.
Progress report: 2015-02-27
editHey folks. Progress report time. Here we do. During the last week, we:
- Filed a bug against NLTK to add a stemmer for Turkish [42]. They would like us to submit a pull request [43], so that will need to wait.
- We dug into work to create models for Turkish and Azerbaijani wikis by creating a test corpus of reverted revision [44]
- We performed some refactoring of the scoring system so that it can handle multiple models at a time and share features across them. [45] This provides for substantial performance improvements when a request is made for scores from multiple models. We also fixed some bugs with the dependency solver to improve caching behavior. [46]
- We also refactored languages so that they are expressed as a set of language utilities. [47] This allows for languages to be partially specified, but still useful.
- We implemented advances operator modifiers "* / min, max, log, ==, !=" [48] This allows one to express compound features in an intuitive way.
That's all for this week. --EpochFail (talk) 17:42, 7 March 2015 (UTC)
Progress report: 2015-03-06
editHey folks. Our article ran in the signpost! This week we:
- Improved diff algorithm performance so that we can generate those features more quickly. [49]
- We implemented partial language utilities for trwiki and azwiki [50] , build feature sets for classifying reverts [51] and we build revert models for them and found that we could get relatively high AUC despite the lack of language features [52]
- We translated our signpost article for the portuguese signpost [53].
That's all for now. --EpochFail (talk) 18:03, 7 March 2015 (UTC)
Progress report: 2015-03-13
editThis week was packed with refactoring and translation work.
- We completed (but didn't quite merge) a major refactor of the revscoring code base that brought better structure, more features and more tests. [54]
- We completed translations of our signpost article for Turkish and Azerbaijani wikis [55]
- We also made substantial progress towards connecting the back-end for our revision coder with our gadget prototype. See Research talk:Revision scoring as a service/Coder.
Progress report: 2015-03-20
editHey folks,
This week, we finished up quite a lot of work that was in progress last week.
- We official added Ladsgroup to the project. [56]
- reza1615 translated our coding gadget to Farsi [57] and Gediz translated the gadget to Turkish and Azerbaijani [58]
- We published our IEG midpoint report [59] and our monthly report for February [60]
- We initialized the paths for the revision coder backend server [61] and configured the coder gadget to re-write mediawiki pages [62]
- We merged a major refactoring and expansion of features into
revscoring
[63]
That's all folks. Stay tuned. :) --Halfak (WMF) (talk) 21:31, 27 March 2015 (UTC)
Feedback from Raylton
edit@Halfak: This is the conversation I had with Raylton some time ago about the revcoder mockups and other aspects of the project (it is in Portuguese, sorry):
chat log
|
---|
|
Progress report: 2015-03-27
editHey folks,
This week we:
- Stood up a dummy revision coder server to develop the gadget against [64] It returns standard JSON for every request, but the responses are hard-coded.
- We translated the signpost article for fawiki [65]
- We proposed and discussed a database schema for the revision coder [66]. See Research talk:Revision scoring as a service/Wiki labels#Schema proposal
We also have a some substantial work in progress:
- We've been propagating a recent refactor of the
revscoring
library across our other projects. - We're cleaning up annoying little bugs like this guy in the revscoring library that first appeared in the wild.
- Helder and I have settled on a configuration strategy for the coding gadget. [67]
That's all. Stay tuned. --EpochFail (talk) 15:02, 4 April 2015 (UTC)
Progress report: 2015-04-03
editThis week we finished off a lot of stuff. :)
- We improved error reporting in the feature extractor [68]
- We fixed a few annoying bugs in revscoring that appeared in the wild [69] & [70] & [71]
- We completed the refactoring of ORES to match the refactoring of revscoring [72]
- We deployed the approved configuration strategy [73]
- We implemented the revision coder server -- and it works in all the expected ways using a postgres database and postgres' JSON data type [74]
- We also added some documentation about the paths of the coder server [75]. see Research:Revision scoring as a service/Wiki labels#Coding service
- We added some simple API documentation to ORES [76]
That's all. Stay tuned. --EpochFail (talk) 15:08, 4 April 2015 (UTC)
Featured in a Washington Post Article.
editSee http://www.washingtonpost.com/news/the-intersect/wp/2015/04/15/the-great-wikipedia-hoax/
Excerpt:
Wikipedians have proposed other reforms, too. The Wikimedia Foundation is funding research into more robust bots that could score the quality of site revisions and refer bad edits to volunteers for review. Another proposed bot would crawl the site and parse suspicious passages into questions, which editors could quickly research and either reject or approve.
Cool! --Halfak (WMF) (talk) 15:28, 16 April 2015 (UTC)
- I just noticed the link back to our project. I am at a loss of words. -- とある白い猫 chi? 12:30, 26 April 2015 (UTC)
Progress report: 2015-04-10
editHey folks,
$ revscoring -h Provides access to a set of utilities for working with revision scorer models. Utilities * score Scores a set of revisions * extract_features Extracts a list of features for a set of revisions * train_test Trains and tests a MLScorerModel with extracted features. Usage: revscoring (-h | --help) revscoring <utility> [-h|--help]
Revscoring utility documentation
This week was another productive one with a lot of tasks coming together.
- We centralized the utility scripts that support revscoring within the revscoring project and made a cute general utility to make them easy to work with [77]. We also took the opportunity to make file reading/writing easier in Windows [78].
- We deployed a form builder interface for writing new form configurations used in the revcoder.
- We implemented a means for extracting all labels from the coder server [79]
- We added a feature to revscoring that prints the dependency tree for a feature. [80] This is useful when debugging dependency issues in feature extraction.
- We added a simple revert detector script to the ORES project [81]. This in combination with the centralized revscoring utilities provides automation for training new classifiers.
- We implemented an OAuth login for the revcoder system [82]. You can test the workflow by going to http://ores-test.wmflabs.org/coder/auth/initiate.
That's all for this week. Stay tuned. --Halfak (WMF) (talk) 22:25, 17 April 2015 (UTC)
Progress report: 2015-04-17
editHello all,
I will be filing the progress report for a short while in place of Halfak.
This week we:
- We now have a function OO.ui.instantiateFromParameters() which takes some JSON configuration and construct an OOjs UI field. This also populates a fieldMap with "name"/widget pairs that can be used later.[83] You can get a sense of it by trying our form builder.
- We refactored ORES for language specific features to reflect the changes made to Revision Scoring. We also reorganized the features list to both reuse code and to improve on performance and accuracy. [84]
- We created a Mediawiki gadget to filter recent changes feed by reverted score. [85]
- We renamed the service Revision handcoder to Wiki-Tagger. [86].
That is this weeks summary. Stay tuned. -- とある白い猫 chi? 12:54, 26 April 2015 (UTC)
Progress report: 2015-04-24
editHello all,
- We updated ORES server to the newer version of revscoring and included new models. [87]
- We renamed Wiki-Tagging to Wiki-Labels. We hope to name our hand coding campaigns with a "Wiki labels foo" format a bit like wiki loves monuments.[88] We have also defined dependencies for Wiki-Labels.[89]
- We have explore additional methods for automatically detecting badwords.[90]
- We have investigated advanced bag of words approaches such as TF-IDF and by its extension Latent Semantic Analysis (LSA) [91]
That is this weeks summary. Stay tuned. -- とある白い猫 chi? 13:41, 26 April 2015 (UTC)
Notes on ORES performance.
editSo, there's been some discussion recently of ORE's performance and how it's not nearly as fast as we would like to request that a bunch of revisions get scored. I'd like to take the opportunity to document a few things that I know are slow. I'll sign each section I create so that we can have a conversation about each point.
Looking for misspellings
editLooking for misspellings is one of our most substantial bottlenecks. Right now, we're using nltk's "wordnet" in order to look for words in English and Portuguese. This is slow. One some pages, scanning for misspellings can take up to 4 seconds on my i5. That's way too much -- especially because we end up scanning at least two revisions for misspellings. So, I've been doing some digging and I think that 'pyenchant' might be able to help us out here. The system uses your unix installed dictionaries to do lookups and it is much faster. Here's a performance comparison looking for misspellings in enwiki:4083720:
$ python demonstrate_spelling_speed.py Sending requests with default User-Agent. Set 'user_agent' on api.Session to quiet this message. Wordnet check took 3.7539222240448 seconds Enchant check took 0.008267879486083984 seconds
So, it looks like we can get back 3 orders of magnitude there. It looks like we can get a lot of dictionaries too. Here's apt-gets listing if myspell dictionaries:
myspell-af myspell-el-gr myspell-fo myspell-hy myspell-ns myspell-st myspell-ve myspell-bg myspell-en-au myspell-fr myspell-it myspell-pl myspell-sv-se myspell-xh myspell-ca myspell-en-gb myspell-fr-gut myspell-ku myspell-pt myspell-sw myspell-zu myspell-cs myspell-en-us myspell-ga myspell-lt myspell-pt-br myspell-th myspell-da myspell-en-za myspell-gd myspell-lv myspell-pt-pt myspell-tl myspell-de-at myspell-eo myspell-gv myspell-nb myspell-ru myspell-tn myspell-de-ch myspell-es myspell-he myspell-nl myspell-sk myspell-tools myspell-de-de myspell-et myspell-hr myspell-nn myspell-sl myspell-ts myspell-de-de-oldspell myspell-fa myspell-hu myspell-nr myspell-ss myspell-uk
No Turkish or Azerbaijani, but we can do Farsi, English and Portuguese. :) --EpochFail (talk) 16:29, 26 April 2015 (UTC)
- @EpochFail: how do you clear the "Sending requests with default User-Agent. Set 'user_agent' on api.Session to quiet this message" warning? James Salsman (talk) 08:08, 29 August 2020 (UTC)
- James Salsman, you'll need to provide a user_agent argument to the mwapi.Session() constructor. A good "user_agent" includes an email address to contact you at and a short description of what you are using the API session for. E.g. "Demonstrating spell check speed - aaron.halfaker@somedomain.com" --EpochFail (talk) 20:25, 2 September 2020 (UTC)
- @EpochFail: <3 How would you change [92] to do that? I'm not sure where the user agent string is set.
- I updated that url. James Salsman (talk) 20:59, 2 September 2020 (UTC)
- James Salsman, you'll need to provide a user_agent argument to the mwapi.Session() constructor. A good "user_agent" includes an email address to contact you at and a short description of what you are using the API session for. E.g. "Demonstrating spell check speed - aaron.halfaker@somedomain.com" --EpochFail (talk) 20:25, 2 September 2020 (UTC)
API latency and the need to perform multiple requests/score
editRight now, we gather data for extracting features one revision at a time. For a common 'reverted' scoring, we'll perform the following requests:
- Get the content of the revision under scrutiny,
- Get the content of the preceding revision (lookup based on parent_id)
- Get metadata from the first edit to the page (for determining the age of the page, lookup based on page_id, ordered by timestamp)
- Get metadata about the editing user (lookup based on user_text)
- Get metadata about the editing user's last edit (lookup based on user_text, ordered by timestamp)
One way that we can improve this is by batching all of the requests in advance before we provide the data to the feature extractor. So, let's say we receive a request to score 50 revisions, we would make one batch request to the API for content from those 50 revisions. Then we would make another batch request to retrieve the content of all parent revisions. I think we can also batch the requests for a first edit to a page (specifying multiple page_ids to prop=revisions with rvlimit=1). We can batch the request to list=users and list=usercontribs too. We'd have to use the extractors dependency injection to address these bits for each revision after the fact then. For example:
features = extractor.extract(rev_id, features, cache={revision.doc: <our doc>, parent_revision.doc: <our doc>, ... })
It makes me a bit sad to do this since we don't know that the revision.doc, parent_revision.doc necessary in the code. We might want to provide some functionality at the ScorerModel level to allow us to check this. E.g.
if scorer_model.requires(revision.doc): cache[revision.doc] = session.revisions.query(revids=...)
--EpochFail (talk) 16:29, 26 April 2015 (UTC)
- Other than the batching of requests, which seems very appropriated, would it help if the system used database access instead of API requests to extract the features? Helder 17:32, 26 April 2015 (UTC)
- +1 to Helder - 1 and 2 require the API, but 3-5 steps can all be done via the database batched very quickly. Yuvipanda (talk) 18:09, 26 April 2015 (UTC)
Caching
editRight now, there's no caching at all. If a score is requested, it's is calculated and returned and then forgotten. This is sad because we could probably store scores for the entire history of all the wikis in ~ 50-75GB. We could also make use of a simple LRU cache in memory (e.g. https://docs.python.org/3/library/functools.html#functools.lru_cache. This would work really well for managing the load of the set of bots/tools tracking the recentchanges feed. --EpochFail (talk) 16:29, 26 April 2015 (UTC)
- +1 for computing the scores for all revisions of all (supported) wikis and storing them in some kind of database, with associated version numbers to identify which model was used to compute the scores. As a user of the system I would like to be able to get e.g. a list of revisions whose previous score was a false positive (i.e. a constructive edit scored as a vandalism) and whose scores generated by a more recent model is now a true positive (similarly for other combinations of true/false positives/negatives). This would allow use to get an idea of how the system is improving over time, and to identify regressions in the quality of the scores we provide for users. Helder 17:32, 26 April 2015 (UTC)
- I'd suggest a local install of redis for caching over in-process caching. This ensures that you can restart your process nilly-willy without having to worry about losing cache. Yuvipanda (talk) 18:10, 26 April 2015 (UTC)
Pre-caching
editSince we know that the majority of our requests are going to be for recent data, we could try to beat our users to the punch by generating scores and caching them before they are requested. Assuming caching is in place, we'd just need to listen to something like RCStream and simply submit requests to ORES for changes as they happen. If we're fast enough, we'll beat the bots/tools. If we're too slow, we might end up needing to generate a score twice. It would be nice to be able to delay a request if a score is already being generated so that we only do it once. --EpochFail (talk) 16:29, 26 April 2015 (UTC)
- This looks like an interesting thing to do. Helder 17:32, 26 April 2015 (UTC)
Misclassifications
editMoved to Research:Revision scoring as a service/Misclassifications/Edit quality --EpochFail (talk) 20:11, 8 December 2015 (UTC)
Progress report: 2015-05-01
editHello all,
We decided that I shall carry out weekly reports now on.
- We had major on going work for Wiki-Labels, we intend to have everything up and running by 8 May where we will have first hand coder input from Wiki-Labels. Once this is achieved it will be a milestone for our project.
- We held a general discussion on the landing page and also designed the page (w:Wikipedia:Labels). [93]
- We wrote general documentation for Wiki-Labels here on meta: Wiki labels. [94]
That is this weeks summary. Brace to your seats for more. -- とある白い猫 chi? 00:05, 7 May 2015 (UTC)
Badwords
edit@Ladsgroup: what is the condition for keeping/removing a string in the list? I belive it is so subjective, and that there are so many criteria, that I don't really know what to do with lists like these. I even kept the list which I generated as is, due to the lack of an objective criteria for removing items from it.
I believe we need to have some kind of labelling efort for adding (multiple) tags for each string in the lists, so that we can have different categories. I also don't know which would be the common categories, but there are many reasons why an edit adding a given string might be considered damaging:
- It talks to the reader (e.g. "you", "go **** yourself", "<someone>, I love you!"), and this is not acceptable in an enciclopedic article
- It is related to sex, and the article isn't
- It is about a part of the human body (likely inappropriate in an article about Math)
- It is in a language other than the article's or wiki's language (if it is a "badword" in that other language, should it be in the list for this language too?)
- It is not a word ("lol", "hahaha", "kkkkkk")
- It is a personal attack/name calling
- It is an accronym only used in informal talking (e.g. on chats)
- It is a website or brand (e.g. "easyspace", "redtube")
- It discriminates some group of people, for whatever reason (ideology, beliefes, etc...)
- It is a mispelled (bad)word
- It is an uncommon word
- It is something else
- It is many of the above things simultaneously
Helder 13:44, 8 May 2015 (UTC)
- Anyway, here is a possible split of the Portuguese list you generated: https://gist.github.com/he7d3r/6a5ecf56941a323cb568. Helder 13:47, 8 May 2015 (UTC)
- Helder: Thank you for your feedback, It generates a list automatically, it's on us if we want to use some words and don't use others. I'm using a more advanced technique to create better results, I will update it and please review it and tell me whether it's improved or not. Best Amir (talk) 23:46, 8 May 2015 (UTC)
- The problem is that I don't have a clear criteria for doing such a review, per above. Helder 12:46, 10 May 2015 (UTC)
- Helder: Thank you for your feedback, It generates a list automatically, it's on us if we want to use some words and don't use others. I'm using a more advanced technique to create better results, I will update it and please review it and tell me whether it's improved or not. Best Amir (talk) 23:46, 8 May 2015 (UTC)
Progress report: 2015-05-08
editHello all,
Checking in with the weekly report.
- We have achieved our mile-stone as our hand coder Wiki-Labels is live as of 8 May on four languages: English, Persian, Portuguese, Turkish.
- We engaged in further community engagement prior to 8 May to promote the first wiki labels campaign: quality labeling. [95]
- We updated the gadget to point to ORES server. [96]
- We deployed Wiki labels to labels.wmflabs.org and updated the docs [97]
- We loaded 20k samples for all languages. This was over-saturated by very high number of SUL notifications so we resampled from a year back. [98]
- We configured Wiki labels to run from wmflabs [99]
- We generalized paths so that wiki labels works from localhost or labels.wmflabs as expected. [100]
- Wikilabels bugfix: Worksets were shuffled when user navigates away from the page and then back. Now worksets are sorted by task ID. [101]
- We translated the interface of Wiki labels to Turkish, Persian and Portuguese. [102] [103] [104]
- We implemented a language fallback chain in wikilabels for cases where translation isn't available. [105]
- We detailed documentation on individual revscoring features that are used by classifiers. [106]
- We implemented enchant spell checker for Revscoring. [107]
That was your weekly report. -- とある白い猫 chi? 08:32, 13 May 2015 (UTC)
- @とある白い猫, Halfak: So, if the 20k samples are from 2014, we should probably rename the campaigns, because they are saying the edits are from 2015. Helder 13:44, 13 May 2015 (UTC)
- The edits are from the year ending in 2015-04-15, so some may be from 2014. --Halfak (WMF) (talk) 22:37, 13 May 2015 (UTC)
- @Halfak: and the previous sample was from which period? Helder 09:18, 14 May 2015 (UTC)
- I'm sorry. What previous sample? Could you be thinking of the sample we trained on revert/not-reverted? That was also extracted in 2015. It turns out that our test dataset for loading Wiki labels contain campaign names that reference 2014, but that's just test data. --Halfak (WMF) (talk) 15:17, 14 May 2015 (UTC)
- @Halfak (WMF): I'm referring to this:
Helder 15:57, 14 May 2015 (UTC)This was over-saturated by very high number of SUL notifications so we resampled from a year back.
- Ahh yes. So the last sample was from the last 30 days since that is what the recentchanges table keeps. But the new sample uses the revision table and a whole year's worth of revisions. So, a "year back" from 2015-04-15. --Halfak (WMF) (talk) 16:15, 14 May 2015 (UTC)
- Got it! Thanks for clarifying. Helder 18:17, 14 May 2015 (UTC)
- Ahh yes. So the last sample was from the last 30 days since that is what the recentchanges table keeps. But the new sample uses the revision table and a whole year's worth of revisions. So, a "year back" from 2015-04-15. --Halfak (WMF) (talk) 16:15, 14 May 2015 (UTC)
- @Halfak (WMF): I'm referring to this:
- I'm sorry. What previous sample? Could you be thinking of the sample we trained on revert/not-reverted? That was also extracted in 2015. It turns out that our test dataset for loading Wiki labels contain campaign names that reference 2014, but that's just test data. --Halfak (WMF) (talk) 15:17, 14 May 2015 (UTC)
- @Halfak: and the previous sample was from which period? Helder 09:18, 14 May 2015 (UTC)
- The edits are from the year ending in 2015-04-15, so some may be from 2014. --Halfak (WMF) (talk) 22:37, 13 May 2015 (UTC)
Progress report: 2015-05-15
editHello all,
Checking in with the weekly report.
- Wikilabels bugfix: We had a strange bug where full screen button appeared twice. Up on further investigation this was because the global.js was being loaded twice. We modified our code to prevent double-loading of the UI. [108]
- We generated bad words list for az, en, fa, pt and tr wikis. Unlike the previous lists these are generated from known bad revisions. [109]
- We have updated the translation for Portuguese. [110]
- We have conducted some maintenance and administrative work concerning Wikilabels. [111]
That was your weekly report. -- とある白い猫 chi? 17:25, 21 May 2015 (UTC)
Progress report: 2015-05-22
editHello all,
Checking in with the weekly report.
- We have conducted some maintenance and administrative work concerning Wikilabels. [112]
- We have filtered likely-non-damaging edits from tasks on en, fa, pt, tr wikis. [113]
- We posted a progress report on enwiki, fawiki, ptwiki and trwiki on the first campaign.[114][115]
- We will attend Mediawikiwiki:Wikimedia Hackathon 2015 to recruit and hack-a-thon away...[116]
That was your weekly report. -- とある白い猫 chi? 15:11, 24 May 2015 (UTC)
ORES performance improvements
editHey folks,
I've been working with Yuvipanda to work out some performance and scalability improvements for ORES. I've captured our discussions about upcoming work in a series of diagrams that describe the work we plan to do.
My plan is to work from left to right implementing improvements incrementally and testing against the server's performance. I've already been doing that actually as we have been implementing improvements along the way. Right now, the basic flow as seen substantial improvements in the misspellings look-up speed and request batching against the API.
I plan to have this figure updated as we roll out upcoming performance improvements. --EpochFail (talk) 11:03, 29 May 2015 (UTC)
Progress report: 2015-05-29
editHello all,
Checking in with the weekly report.
- We have conducted some maintenance and administrative work concerning Wikilabels. [117]
- We completed Portuguese translation of documentation on meta. [118]
- We observed a CSS conflict. User:Hedonil/XTools/XTools.js conflicts with Wikilabels and prevents the display of Wikilabels UI. [119]
- We started up a compute server on labs (testing redis/celery & generating models). [120]
- We attended Wikimedia Hackathon 2015
- Day 1: Lots of hacking, pywikipediabot meeting, first ever in person meeting. [121], [122]
- Day 2: More hacking, Hackathon hack session (T90034). [123][124]
- Day 3: Even more hacking, added French language specific utilities to revision scoring, live demo of French language of revision scoring at closing showcase. [125]
- We added French language specific utilities. Special thanks goes to fr:User:Paannd a for the French translation and review of the French bad word list.[126]
That was your weekly report. -- とある白い猫 chi? 02:44, 8 June 2015 (UTC)
Progress report: 2015-06-05
editHello all,
Checking in with the weekly report.
- We refactored revision scoring dependency management. While this did not make a difference in the interface, it has improved performance and management. A total of 46 files were modified in some capacity with 847 added lines and 703 removed. [127][128]
This was your weekly report. -- とある白い猫 chi? 02:52, 8 June 2015 (UTC)
- Also we have docs now pythonhosted.org/revscoring. --Halfak (WMF) (talk) 14:50, 9 June 2015 (UTC)
Fluid animation
editWhile sorting media I noticed the above property of our animated ORES logo. :) -- とある白い猫 chi? 13:48, 14 June 2015 (UTC)
- LOL. Helder 14:31, 14 June 2015 (UTC)
- So awesome. :) --Halfak (WMF) (talk) 15:10, 14 June 2015 (UTC)
- I wonder if we can use this property for our purposes. :/ -- とある白い猫 chi? 18:01, 11 July 2015 (UTC)
Article quality scores
editHello User:Halfak (WMF) pinging you after your Wikimania talk. Super interesting stuff! I asked about direct database access to a full set of article quality scores. I'm maintaining the WikiMiniAtlas for which I need a metric to prioritize articles shown on the mat at a given zoomlevel. I'd prefer to show high quality articles as prominently as possible (my current ranking is just based on article size). --Dschwen (talk) 15:47, 19 July 2015 (UTC)
- Hi Dschwen! We're looking into it now. I wonder if you are just sizing things in the interface if requesting scores from ORES would work for you in the short term. In the long term, we have a phab ticket that you can subscribe to. See Phab:T106278. We're currently working out what the initial table will contain. --Halfak (WMF) (talk) 19:55, 22 July 2015 (UTC)
- I'm not just resizing stuff (or needing a small set of data at a time), but I need to process the entire set of all geocoded Wikipedia articles at once to build a database of map labels for each zoom level. I will subscribe to the Phabricator ticekt. Thanks! --Dschwen (talk) 21:14, 26 July 2015 (UTC)
New stats on ORES speed.
editNot much time to type. Here's the plot. Should be self explanatory. Woot!
Perpetuating bias
editThis is great work, add me to the list of people who hope it opens the door to a much more permissive wiki culture!
Apologies if this discussion is already under way somewhere else... I wanted to ask about the training data for the revscoring:reverted models, and whether you plan to unpack the various motivations behind the "undo" action. In short, I think it's imperative that we present a multiple-choice field during undo, to allow the editor to categorize their reason for reverting. This will allow us to provide much higher quality predictions in the future.
To make an sloppy analogy, what we're currently doing is like training a neural network on the yes-no question, "is this a letter of the alphabet?". What we want to do is, have it learning which specific letter is which.
Meanwhile, to continue with the analogy, imagine the written language is evolving and new letters are being invented, old ones are changing form.
I'd like to see documentation on exactly how the training data is harvested, because I'm concerned that our revert model is actually capturing something ephemeral about on-wiki culture, which has shifted over time. Clearly you've considered this problem, the introduction to ORES/reverted suggests as much. For historical data, we would probably need to correlate reverts with debate about the revert--was this a contentious revert? Can we guess whether it was done in good faith? Was the outcome to rollback the revert? Was there an edit war? Also, was the revert destructive, was the original author offended? Engaged? Retained? This seems like a really hard problem, which is why I'd suggest we focus on recent data only, to capture current norms around acceptable article style, and also introduce a self-reporting mechanism where the editor can categorize the reason for their revert.
Adamw (talk) 05:45, 23 July 2015 (UTC)
- +1 to all that you've said.
- I think it is an interesting idea to have a multiple choice option for 'undo'. It seems like different actions should be taken given the user's reasoning. E.g. if blatantly offensive vandalism, level 4 warning & revert. If playful vandalism, level 1 (or N+1) warning & revert. If test edit (key mash, "hi there", etc.), then test edit warning & revert. If good-faith, but still does not belong, revert and post reason on talk page. If good, no undo for you! Assuming that judgements made by editors had sufficient coverage (there's reason to believe that reverts have coverage), then we could use this to train and deploy better prediction models.
- Right now, we're looking at using our Wiki labels campaign to answer the questions: "Is this damaging?" and "Is this good-faith?" so that we can (1) check the biases in our 'reverted' model and (2) train a better classifier that focuses on damage. I'd really like to be able to stand up a classifier that specializes in bad-faith damage for quality control purposes.
- Your point about recency is well received as well. Once substantial concern we must manage is the periodic nature of vandalism. E.g. when school is in session in North America, we seem to get a lot more vandalism in enwiki and of a different type. Right now, we're training our models based on revisions from the entire year of 2014 because we started work in January, 2015 -- but we could also sample from the entire year before yesterday. I think we'll find it difficult to extend our wiki-labeling campaign once per year since it involves a substantial amount of effort. This might work for enwiki where we have been lucky to find many volunteer labelers, but I suspect that less active wikis will fall behind unless we integrate with mediawiki's undo/rollback. --EpochFail (talk) 18:29, 23 July 2015 (UTC)
Progress report: 2015-06-05 - 2015-08-02
editWe have neglected reporting lately but we have been working hard on expanding and developing our project.
Below list is in inverse chronological order where newest item is on the top.
- Init for wb_vandalism [129]
- Fix Wikilabels DB config issues [130]
- Add full-page view to Wikilabels [131]
- Remove USB 2.0 driver to resolve an issue with ORES vagrant [132]
- Revision scoring in production discussion [133]
- Create a Debian package for python3-jsonpify [134]
- Create a python package for ‘stopit’ module [135]
- Create a Mediawiki utilities Debian package [136]
- Create Debian package for yamlconf [137]
- Setup Wikilabels infrastructure and deployment [138]
- Fix revision scoring requests version issues [139]
- Fix language cache issue in API Extractor [140]
- Implemented Regex Language generalization in revscoring [141]
- Scored Revisions to use test server [142]
- Develop Vagrant for Revision Scoring [143]
- Fix DocumentNotFound error for article/page creations [144]
- Spec out research for edit type classifier [145]
- Select imports for languages (revision scoring) [146]
- Add basic metric/ outcome goals to IEG renewal [147]
- Add promise of longer-term viability plan to renewal [148]
- Indonesian, Spanish, Vietnamese language utilities [149] [150][151]
- We made a presentation “Revision Scoring Service – Exposing Quality Wiki tools” in Wikimania 2015 [152]
- We made a presentation “Would you like some AI with that” in Wikimania 2015 [153]
- We implemented a pre-caching daemon for ORES [154]
- HACKING Tools that use revision scoring [155]
- We implemented puppet for ORES celery [156]
- We implemented distributed processing for ORES. [157]
- We got Wikilabels off of NFS that was causing some instability on labs [158]
- Community outreach at Wikimania 2015 to explain what AI can do for the communities [159]
- We proposed the renewal of the Revision Scoring IEG [160] and also posted a plan for it [161]
- We created documentation for the Wikilabels Visual Editor Experiment Campaign [162]
- We had published the Revision Scoring IEG final report [163][164]
- We added the languages French, Spanish, German, and Russian TF-IDF generated bad words list [165][166]
- We expanded our TF-IDF generated badword lists to 250 words. [167][168]
- We migrated to Phabricator! Work board
That was your Revscoring report.
Wikitrust
editHi. Wikitrust has been to my opinion one of the most advanced tool to do users & revisions scoring. I'm surprised to not see here an analysis of it, it's even not in the list of tools. What is the reason for that? Regards Kelson (talk) 12:33, 3 September 2015 (UTC)
- Hi Kelson, the short answer is that WikiTrust solves a different type of problem. In this project, we score revisions. WikiTrust does not score revisions directly and it doesn't do it's scoring in real time. It scores editors and applies a trustworthiness score to their contributions and applies an implicit review pattern. WikiTrust is actually just one of many algorithms that use this strategy. For my work in this space, see R:Measuring value-added. For a summary of other content persistence algorithms, see R:Content persistence. --Halfak (WMF) (talk) 13:21, 3 September 2015 (UTC)
- Also, FYI, you can do cross-wiki links like this: en:WikiTrust --Halfak (WMF) (talk) 13:21, 3 September 2015 (UTC)
- Thank you for your quick&clear answer. Kelson (talk) 13:59, 3 September 2015 (UTC)
FYI: some ORES downtime today.
editSee Talk:Objective Revision Evaluation Service and the post mortem. --EpochFail (talk) 17:11, 8 September 2015 (UTC)
Progress report: 2015-09-19
editHey folks. It's been about a month since you've gotten a progress report. I figured one was due.
- We minimized the rate of duplicate score generation in ores [169]. Parallel requests to score the same revision will now share the same celery AsyncResult.
- We turned
pylru
andredis
into optional dependencies of ORES [170]. This makes deployment a little easier since we don't have to make Debian packages for libraries we don't use in production. - We did a bunch of homework around detecting systemic bias in subjective algorithms (like our classifiers). See our notes here: [171]
- We made the color scheme in ScoredRevisions configurable. [172]
- We primed lists of stopwords by applying a en:TFiDF strategy to edits in various Wikipedias (af, ar, az, de, et, fa, he, hy, it, nl, pl, ru, uk) [173]
- We added language feature Hebrew [174] and Vietnamese [175] to
revscoring
- We read and discussed critiques of subjective algorithms in computer-mediated social spaces [176]
- We implemented a regex-based badwords detector that handles multiple-token badwords (important for turkish and persian) [177]
- We *deployed* a series of performance improvements to ores.wmflabs.org. [178]
- We implemented a diff-detecting system for Wikidata's JSON data format [179]
- We accidentally nuked our Celery Flower monitoring system for ores.wmflabs.org and then brought it back online [180]
- We built a script for extracting data from the recently-finished Wikilabels edit quality campaigns [181] and we're now working on building models with the data.
- We substantially improved the stability of ORES worker nodes and the redis backend that they use [182]
Also worth noting, ORES has been adopted by Huggle and we've been working with them to address performance issues by suggesting they request scores in parallel. ORES can take it! --EpochFail (talk) 16:22, 19 September 2015 (UTC)
FYI: Code for extracting features re. edit type classification
editSee work here: https://bitbucket.org/diyiy11/wiki_edit
We'll need to adapt and export the feature extraction code for use inside revscoring. F-scores for each class are comparable to the state of the art. --EpochFail (talk) 18:00, 23 September 2015 (UTC)
- @EpochFail: I get this on that URL: "You do not have access to this repository.Use the links at the top to get back." Helder 21:33, 23 September 2015 (UTC)
- Gotcha. I'll have to talk to Diyi about opening it up. She may prefer that I rewrite before publishing. I'll report back when I can get to it. --EpochFail (talk) 21:51, 23 September 2015 (UTC)
Progress report: 2015-09-26
editSo your weekly reports should actually be weekly reports now. Sorry for the last hiccup.
- We used clustered reverted edits by a k-means algorithm. T110581
- We Prepared a summary of SigClust and other methods for choosing number of clusters. T113057
- We started working on our mid point report which should provide a good overview of our work since the last tri-monthly report. We hope to have a draft by 1st of October. T109845
- We are in the process of deploying the results of data collected through wiki labels edit quality campaign. We will compare the results of the newer model generated from this data with the older revert based model in the upcoming weeks. T108679
- We are winding down our community outreach efforts and will focus on assisting more responsive communities in the meanwhile. T107609
- We are re-generating stop words by ommiting interwiki links from them. Interwiki links do not provide an indicative signal for edit quality. T109844
We have a lot of things going on in parallel. Stay tuned for next week!
Better usage statistics in graphite
editHey folks,
We just had a good work session on our midterm report for the IEG. It became blatantly obvious that our methods for gathering metrics on requests, cache-hit/miss and activity of our precached service left a lot to be desired. So I put in a couple of marathon sessions this week and got our new metrics collection system (using graphite.wmflabs.org) up and running.
Check out the screenshot to the right. You can also find similar statistic by navigating to graphite.wmflabs.org. --EpochFail (talk) 19:52, 3 October 2015 (UTC)
Progress report: 2015-10-03
editYour weekly report.
- ORES celery workers made more quiet by error handling. This allows unexpected errors to be more prominent as other known issues wont bog it down. T112472[183]
- Batch feature extraction is now implemented. This will expedite model creation. T114248
- Midpoint report drafted early to meet the schedules of IEG and Grantees such that delays are avoided. T109845
- We included model version in ORES response structure so that scores will be regenerated with updated model instead of having persistent outdated scores of the earlier model. new scores will be generated on demand. T112995
- Model testing statistics and model_info utilities are added to revscoring. T114535
- Metrics collection added for ORES. This will help quantify ORES usage. T114301
That was your weekly report. -- とある白い猫 chi? 15:49, 11 October 2015 (UTC)
Progress report: 2015-10-10
editYour weekly report.
- We have trained and deployed the handcoded "damaging" and "good faith" models gathered through the now complete edit quality campaigns via wikilabels. This has been deployed for enwiki, fawiki and ptwiki. T108679
- Turkish wikilabels campaign has completed and is ready for modelling provided AUC confirms gain.
- Preparing many wikis for their own edit quality campaign. Wikilabels interface awaits translation by local communities.
- Midpoint report was approved.
That was your weekly report. -- とある白い猫 chi? 15:57, 11 October 2015 (UTC)
- See ORES for details on the new models. --EpochFail (talk) 16:17, 11 October 2015 (UTC)
Dealing with Wikidata
editPer the revscoring sync meeting here I my thoughts on the matter. First off, we have established that bots are almost never reverted on wikidata. Secondly I think everyone can agree that on wikidata bots FAR out weight humans in terms of number of edits. As a consequence we are dealing with an over-fitting problem where probably all human edits will be treated as bad because the algorithm will give too much weight on features that basically distinguish bots from humans. This is more of an intuitive assessment than actual analysis. I could very well be wrong. I came to this assessment because on wikidata vast majority of good edits will come from bots and bots will always dominate the random sample set. We can perform a more selective sampling but honestly I do not see the benefit of it.
Based on the two assessments above I propose a different type of modelling for wikidata than what we use on wikipedias. First off, we need to segregate bot edits from other edits. Indeed this will have the potential of a bias but I will explain how this can be avoided. So we would have two models for wikidata, one for bots and other for non-bots. Two independent classifiers would be trained. For instance different classifiers can be used. In such a case bot edits can have Naive Bayes while human edits can be processed with SVM. This is kind of a top level decision tree which delegates first two branches to classifiers.
If ORES is asked to score a revision and the edit is from a bot, the bot model would generate the score. If ORES is asked to score a revision and the edit is not from a bot, the non-bot model would generate a score. The output would still be "damaging/not damaging" and "good faith/bad faith". Bias would be avoided because the way bots edit and humans edit is very different. If humans make bot like edits and this isn't reverted (and vice versa) it would still be treated as good.
-- とある白い猫 chi? 13:42, 17 October 2015 (UTC)
- Seems like this is putting the cart before the horse to me. If we are worried about winding up in a bots vs. humans situation, then we can leave the
user.is_bot
flag out of the feature set. We need representative training data to train any model (your suggestion or a more straightforward approach) and that is the issue I brought up at our meeting. We don't want to have to extract features for 2m edits. That would take forever. Also, bots are *never* reverted, so in order to get *any* signal about bots, we'd need to have humans handcode ~2m edits to get a representative sample of bot damage. IMO, we should build a sample stratified on whether the edit was reverted or not and exclude theuser.is_anon
anduser.is_bot
flags. We can do this is a relatively straightforward way now that we know the rough percentage of edits that are reverted by processing the XML dumps.
- Honestly, I'm really hoping that we don't need to have a hierarchical model because that implies a substantial increase in code complexity. Right now, we don't even know that we do have a problem with bias yet. --EpochFail (talk) 15:48, 17 October 2015 (UTC)
Major update to revscoring package documentation
editHey folks,
I just updated the revscoring
package documentation for 0.6.7. It's got a new theme (alabaster is the new default), better examples, and simplified access patterns for basic types (e.g. from revscoring import ScorerModel
vs. from revscoring.scorer_models import ScorerModel
). Check it out. :) --EpochFail (talk) 13:34, 22 October 2015 (UTC)