Article validation feature

Prototype #1:

An article validation feature had been implemented into the CVS HEAD by Magnus Manske. The code is in the module "phase 3", mainly in the file "includes/SpecialValidate.php", and spread through some other files (tab creation etc.).

The demo version had been running on test.leuksman.com. It had been shown in the Monobook skin and had worked for both anon and logged-in users.

Prototype #2:

The "article validation feature" is being rewritten by Magnus as an extension; the feature is called "Review feature" from now on.

The demo version of the new "review feature" is running on www.magnusmanske.de/wikipeerdia. It is shown in the Monobook skin and works for both anon and logged-in users.

Bug reports

Possible problems people foresee should be listed on Article validation possible problems.
(Bug submitted as #1004067, seems to have been fixed by brion)
- This can be used for validation spamming, if you enter something like this in the comment field:

\'),(1,"Main_Page",20040803125415,0,3,"blabla") #

comments for a field where "no opinion" is selected, aren't saved. This is probably intended, but I think they should be saved so that you can make a comment about why you didn't rate the criteria. -- Stw 10:46, 2 Aug 2004 (UTC)

Feature requests

List of validated articles: option for showing only articles whose latest version hasn't been validated yet. -- Stw 10:51, 2 Aug 2004 (UTC)
We need a category to review the degree of NPOV. That is our primary policy and so we need to be able to assess it ChrisG 00:07, 9 Sep 2004 (UTC)
- I second this emotion. Kaldari 22:06, 9 Jun 2005 (UTC)
- NPOV is good, but it's not. NPOV is simply getting as close to zero on as many bias axis as possible. If you insist on coarse granularity I would argue the world is better served by at least tripartition - Both extremes and NPOV, but I'd much rather for an arbitrary-axis HotOrNOT deca scale rating system. (Naturally, this would recurse as to the axis themselves)
When you view an article version you should only see the reviews for that version, with a better lay out so you can read people's comments. Validation statistics or metrics give the overall view of the article as it develops ChrisG 00:07, 9 Sep 2004 (UTC)
Being able to validate users (e.g. a specific user's edit is always safe). Jon Harald Søby (talk, contrib) 16:18, 28 May 2005 (UTC)[reply]
- Mmm. Methinks there are already too many deities in the world, and far too little divinity. How about bias rating on users, where a specific user's edit skews the article's bias rating?
  - You mean like "I think this user is a troll" so mark all their edits down/up by some factor? That seems a good idea in principle, but it must be tied to the article review itself so that good edits by otherwise bad editors can still be marked as good edits, and so that bad edits by good editors can be marked likewise. It might even mean RFC ceases to exist - troublesome editors and POV pushers will just get super negative scores.
- Actually, we need a kind of automatic meta-validation as well - there should be an automatic score for how polarised the user is - if loads of people say an editor is good but loads of people say the editor is bad, the editor should be marked down for being divisive. That way people who adhere as closely as they can to NPOV will score the highest (on this sort of scale), as very few will mark them as super bad or super good.
What about this instead of user-validation. Automatic validation of users based on the score of their articles in general up to that point. So that the score of an edit will have a factor applied to it that depends on the score of all the user's previous edits. You can add the automatic meta-user-validation thing in as well. So that people are only validating edits, but

editors get accredited automatically for the result of those validations. When people stand for admins and things we can look at their scores and go "well this is a good editor" and "this is a polarising editor", and make some sort of policy that people who are too bad or too polarising can't become X, Y, or Z.

A validation system requiring any sort of extra HTTP interaction will almost certainly fail. Switiching to another tab entirely? Dunno about you, but I think I waited for a re-load... Twice? after HP asked me if "this page answered my question"
OTOH, an invalidation system might be worth a mouse click, and you could add multiple reasons for invalidating said article, such as "Too technical", "Too simple", "Steaming pile of...". Thus, a categorization system in disguise, which is the objective, no?
Would be nice to implement an editorial-board concept, in which the rating is displayed according to the voters' membership in one of a number of editorial boards? These could function like user groups, though per speciality so with more groups and fewer users per group. Maybe even to append an optional 'editor's commentary' page (analogous to discussion) that's protected from editing by non-editors.

Ratings need to decay over time so that articles don't get stuck. If a version of the article got a rating of 9 two years ago, it will still have that rating unless enough of those people who rated it go back and change their rating, and thus it will still be the "leading page", even though it's 2 years old and there's a more recent version with, say, a rating of 8, which is actually better. Kevin Baas^talk 21:42, 2 April 2006 (UTC)[reply]
On the statistics, instead of colors, there should be bars. And ideally this would be on the article history page. (and diffs would be nice too.) Furthermore, it would be nice if one could filter: see only versions which a rating above X, in the history. Kevin Baas^talk 21:42, 2 April 2006 (UTC)[reply]

Usability

"Rate this page" - MediaZilla:4117

The option to validate an article might be better called 'Review' because it will be clearer to the average person ChrisG 00:08, 9 Sep 2004 (UTC)
- "Rate this page" - everyone will understand that - David Gerard 20:48, 2 October 2005 (UTC)[reply]
  - I agree: "Validate" is simply the wrong word; if I want to vote an article down, than I don't want to validate it. "Review" is also not quite correct, because a review is typically a long written assessment. What we're really doing is rating, so I would label the button "Rating". AxelBoldt 07:03, 24 November 2005 (UTC)[reply]
    - "Rate" is better than "Rating".

In addition to the validate button for each article you should have a direct link to the metrics about the article as a whole, i.e. Validation statistics for this article. Shouldn't have to access this information through the validate option. ChrisG 00:07, 9 Sep 2004 (UTC)
The numbered options are confusing. Is "1" best or worst? If an article has a bad POV problem, do I rate NPOV=1 or NPOV=6? Note that in Germany for instance, the lowest grade in school is a 6, the highest is a 1. At the very least, the meaning of the numbers needs to be explained; it would be better to eschew numbers altogether and use words. AxelBoldt 07:03, 24 November 2005 (UTC)[reply]
- Or letters A to F? <- that doesn't seem to fix Axel's problem (it's US-centric)

Too much work, won't be used enough

Any validation approach which needs extra work by the users won't bring a significant improvement compared to simply trusting people to improve bad articles through editing. There is more potential gain in having implicit validation facilities, like discussed on Wikipedia. --70.230.73.20 13:43, 4 April 2006 (UTC)[reply]

Agreed. Less than 1/10th of 1% of page hits to knowledgebase articles, due users click on this type of functionality.

Random thoughts

On Reviewed article version are quite a lot (fairly) structured thoughts and ideas about article validation (and its problems). Should problably be merged with this article here. Arnomane 16:30, 2 Jan 2005 (UTC)

w:Wikipedia:Baseline revision

Just so people know, I had an idea for what I call baseline revisions. It doesn't scale as well, but it's a bit more thorough (IMO). I've tried to make the experiment so it doesn't interrupt or disrupt Wikipedia.

If it doesn't scale well, doesn't that really mean it doesn't work? The Wikipedia is fairly huge and getting rapidly bigger.87.115.225.40 05:14, 17 December 2005 (UTC)[reply]

A starting scope for version attributes

I would think each version of each article ought to have multiple sliding (two or three binary bit for 4 or 8 levels) attributes that are available for all editors to "set" (or to opine on, or skew, though that might be anti-wiki) when they edit. And these attributes (with the 4 or 8 levels) could be used at any time to produce Wikipedia selections for any purpose. I suggest we be exceedingly generous in dabbling with attributes, and pull back later from those that don't work. There could be attributes for content that incites common sensitivities (ethnic, religious, erotic, culinary). There could be an attribute for article reliability that automatically sets itself according to the community reliability (trust) of the one who saved that version. There could be attributes for development level (language, completeness, organization, neutrality, references, etc.). The sky is the limit, and we should start with a big list that we gradually trim as we hone in on the most usefull attributes. By the way, is there anywhere I could test Mangus's present work? Tom Haws 03:51, 23 Mar 2005 (UTC)

I'd argue for using 4 bits to represent a 10 point scale. 8 point scale would prioritize storage efficiency over human efficiency, and at today's hard-drive prices... Better yet, drop a whole byte on it and make it a signed percentage - Puts 0 as the middle of the scale and keeps a natural feel.... sort of. I refer back to my previous edit - arbitrary (user-defined) categories (axis) without extraneous HTTP overhead filtered by the rating system itself.
- Thats a bit overkill isn't it? Who would ever go "oh, I think 93%, cos 92% wouldn't be fair, and its not quite 94%"?

Applications

What will validation data be used for? Application needs should drive software development, so we need to start thinking about them. Here's a starter list:

Selecting articles or specific versions of articles for inclusion in a distribution (e.g. a full-text DVD, a selected-text print version, or a single-topic reader).
Identifying articles or types of articles that need specific types of editor attention.
Identifying featured content candidates.
Establishing credibility with the outside world.
Experimentation - publishing aggregate data and analysis to see if any interesting patterns appear.
Source data for a user reputation system.
Establishing appropriateness of articles for a certain audience (e.g. children, students)

Are there any others of note? The last two don't seem that important, wheras the first two seem like the prime motivators. -- Beland 02:36, 25 May 2005 (UTC)[reply]

The ratings should provide a pile of data you could slice'n'dice all manner of ways, depending on the query you want to run. I'm sure the test-phase data will be of vast interest to a LOT of people, and we'll see what queries people are actually interested in so as to decide what's worth including right here on wikipedia.org - David Gerard 17:33, 25 May 2005 (UTC)[reply]

This discussion is going somewhere! Thumbs up! I can scarcely belive it. Excellent points, comrades. 216.160.222.166 17:07, 21 Jun 2005 (UTC)

The last two concern me. I do not think it would be appropriate, for Wikipedia or Wiktionary at least, to have any kind of "user reputation system". A users revision is not necessarily a users work. Further, many users put countless hours into grunt work, these would inevitably be at a disadvantage against prolific prose writers. The 'establishment of appropriateness... for children' seems entirely counter-NPOV. It would open us up to countless problems - some would disagree over articles with sexual content, etc. --Oldak Quill 16:35, 28 July 2005 (UTC)[reply]

"User reputation" - if this is about fields of expertise, this is problematic as it would be difficult to confine the reputation to only have an effect in the subjects about which they are experts. And then we have the problem of who decides what qualifies as expertise.
"Appropriateness" - this appears to be "censorship by the mob". Ironically, it is unlikely to work in the manner which its supporters would want - out of the areas of the world that have a large population online in a substantial english-speaking way, only the US is strongly sexually conservative, and europe is mostly liberal about explicitness, so it would probably backfire and have george bush censored for minors but graphic images of anal fisting on the front page.

It could be used to evaluate an author's contributions, e.g. academics contribution to a body of knowledge. K1v1n 02:09, 16 August 2005 (UTC)[reply]

It's always possible to abuse these things. For instance, an evil admin keeps deleting my great article Why my classmate is so ghey LOL that I spent nearly two minutes to create. Suppose I were to deal with this blatant violation of my rights under the Charter, the Magna Carta, the Bill of Rights and the United Nations by finding every one of this person's entries on special:contributions and giving each the record-low rating of negative infinity, regardless of content. What's to stop this person from retaliating by badmouthing my contributions? Retaliatory feedback is already a common problem on eBay (buy an item, don't bother to claim it or pay for it, get negative feedback from vendor, give negative feedback in return). There's no reason why it couldn't be misused here too. --user:carlb 22:16, 1 April 2006 (UTC)[reply]

There is a possible solution. What is being described above is a problem when there is participation bias. That is, highly motivated people will be more likely to rate strongly (either positive or negative), and people can strongly rate for lots of reasons, some legitimate (people with knowledge may have strong opinions), and some not (ignorant fanatics also have strong opinions). How to tell the difference? There *is* an answer, happens to be my pet project, see my user page. Briefly, if we had a Delegable Proxy network of users on Wikipedia, where users can -- voluntarily -- name a proxy, i.e., another user whom they trust to represent them when they can't represent themselves, whenever there is a vote (and ratings of articles are votes), a proxy's vote will have added to it the votes of all the users who have trusted that proxy, unless they themselves vote. It is a method of *estimating* consensus, not in a rigid and fixed way, but by, in a sense, predicting it. So a small number of votes can actually represent a large number of users. --141.154.151.61 19:07, 1 November 2007 (UTC)[reply]

Delegable proxy == pretty bad idea. I think I've explained my views on that elsewhere. FT2 ^{(Talk | email)} 18:12, 3 April 2008 (UTC)[reply]