Grants:IdeaLab/Cross-account measure of uniqueness

Cross-account measure of uniqueness

Because previous and present users reemerge on new and throw-away accounts, often to stalk good-faith editors, it is important to detect them early and take preventive actions. That is although not a trivial task.

idea creator• Jeblad

this project needs...

volunteer

developer

researcher

give feedback

join

endorse

created on14:19, 29 June 2016 (UTC)

Project idea

What is the problem you're trying to solve?

New users, and very often anonymous users, are more often than not a disgruntled editor that has gone rogue on some topic. It is often very difficult to clearly say that a user is some other user, even if we have a rough idea about who (s)he is. Asking for a CU on more or less spurious accounts can easily turn into a wild goose hunt with a lot of users caught in the crossfire.

What is your solution?

Any user that tries to edit a protected page should be checked for uniqueness against all (possibly within a time limit) previous editors on the page. If a hit is found an entry in a "cross-account similarity" log should be made, and if deemed necessary a further analysis could be made.

The entry in the log could be made so that a CU-user could read the full report, but it should be possible for other users to at least see that something strange is going on.

A check should be done by estimating the probability over several features, including but not limited to the present features used by the CheckUser tool. The estimation of the probability of each feature should use a proper analysis based upon actual occurrences, possibly adjusted by use of Bayes.

Only edits made by users identified as someone else with a very high probability should be flagged, not simply that "a user editing as anon on a specific article".

Not even detection of one and the same cookie should be seen as final proof, it should go through the same scrutiny as any other extracted feature.

In addition to use of cookies, IP-addresses, and browser fingerprint, features should be added for time analysis (geopos by triangulation), fingerprinting of user by typing frequency, fingerprinting of user by word frequency, and fingerprinting of user by edits to pages over a set of contributions.

Note that some users needs special protection, and it should be possible to place those users in a group (actually give them a right) that blocks checks for cross-accounts measure of uniqueness. This right should not be given to any other group in general, it should not be used as a general freecard for admins for example. It could perhaps be administered by OTRS-volunteers.

Background

Cookies are used for identification of users, and is readily available. It is often seen as a safe identifikator of the user, but is actually not. It identifies a machine where some user has edited. It should be deleted when the user logs out and leave the machine, but users often do share accounts. It is pretty safe to assume that the identification is pretty good if the cookie exist, but it is not failsafe.

It is possible to estimate the uniqueness of the cookies for a specific user by observing reuse of the same IP-address for other users within a short timespan.

IP-address is assigned to the machine as it goes on-line. Depending on how it connects to Internet it will be given a more or less unique IP-address. The address can be a sort-time lease, and other users can repurpose the address to do vandalism. It is a hard problem to figure out if the address is still used by the same user.

It is possible to estimate typical lease times for addresses in a submask by observing the frequency of edits.

Fingerprinting of browser is often assumed to give a very high confidence, but those claims are often based on pure bitcounts of the available strings in the header fields. Confidence based on fingerprinting of header fields often fluctuates heavily, especially when such fields includes version numbers on extensions. When a new version is rolled out the confidence on an identification would be very high, but a short time later it would drop back to the usual much lower level.

A better way to estimate the probability would be to calculate it for the individual independent parts, and reject use of version numbers. The probabilities should be as observed by Wikimedias servers, but that could give an unfortunate bias.

Fingerprinting of user by typing frequency or w:keystroke dynamics is a rather new biometric method that creates a fingerprint on the typing patterns. Even if this is sort of a "fingerprint" it is still not unique and the user might also fake it. The best versions also needs cooperation by the user as (s)he must write a short text. If the fingerprint is created by free typing it is usually much weaker. It is somewhat difficult to get numbers for w:precision and recall, given a specific user, and that can make it difficult to calculate probabilities.

Fingerprinting of user by word frequency is more well-known, and is used for such things as who wrote the different parts of w:United States Declaration of Independence. For short texts it is usually only possible to use single words, and the added confidence is pretty small.

Fingerprinting of user by edits to pages over a set of contributions is based on the assumption that users tend to edit the same material even if they switch to another user. In fact the reason why they often switch is to be able to continue editing undetected. By checking coincidence of editing over a set of articles it is possible to get a feature-specific probability that can then be used for a joint probability.

Goals

Get Involved

About the idea creator

I've been a contributor on Wikimedia projects for more than ten years, and have a cand.sci. in math and computer sciences.

Participants

Endorsements

Expand your idea

Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.

Expand into a Rapid Grant

Expand into a Project Grant
(launching July 1st)