Research:Patroller use of IPs

Contact

Wikimedia Foundation

Duration: 2020-June – 2020-September

This page documents a completed research project.

Introduction

Every device that accesses the internet is given a unique IP address by its internet service provider. These addresses take the form of a string of numbers (for IPv4) or a string of numbers and letters (for IPv6). The allocation of IP addresses differs by country and company, though there is no standard practice for address allocation.

Some information about IP addresses is stored in public databases, which can be easily queried; proprietary databases also provide their own stores of information, promising better accuracy and precision than the public registries. Because of this, IP addresses can expose some crucial pieces of information about the device to which they are registered, and consequently the user associated with that IP address. The page What can an IP address tell you? goes into more detail on this topic.

Our community feedback indicates that there are two major pieces of IP address information that are routinely used by administrators. Firstly, IP addresses can be geolocated to provide the rough geographic location of the device. Secondly, the organization with which that IP address is registered can also be found. While this is typically the name of the internet service provider, larger organizations such as companies, educational institutes, or governments are generally registered under their own name. With a combination of this information and other knowledge about editing patterns and activity, a very accurate and identifying profile can be created about a given IP address, and the editor(s) behind it.

IP addresses have long been publicly visible on MediaWiki, and thus form a foundational part of our communities’ anti-vandalism workflows. One major point of concern from the community is the potential disruptive impact of IP masking on anti-vandalism work. Patrollers, users who watch the Recent Changes feed or other lists of new edits for potential vandalism, work closely with edits coming from logged-out users. Administrators, who are in charge of dealing with major reports of vandalism or sorting out community disputes, also often interact with logged-out users.

Initial community feedback from the IP masking project page suggests that patrollers and administrators use IP address information in order to properly assess the severity of a vandalism case.

Prior studies

Previous research on patrolling on our projects has generally focused on the workload or workflow of patrollers. Most recently, the Patrolling on Wikipedia study focuses on the workflows of patrollers and identifying potential threats to current anti-vandal practices. Older studies, such as the New Page Patrol survey and the Patroller work load study, focused on English Wikipedia. They also look solely at the workload of patrollers, and more specifically on how bot patrolling tools have affected patroller workloads.

None of these studies identified any systematic way in which IP address information was used by patrollers, focusing instead on how patrollers approach all new edits. This suggested that either there is no systematic way in which IP addresses are used, or IP information comes into the patrolling workflow after initial assessments of an edit’s quality. However, these assumptions were limited by the fact that the majority of these studies focus on a few wikis, and even those studies that cast a wider net did not target any specific wikis or look into any key qualities of study sites. That is to say, although some studies tried to incorporate the practices of multiple wikis, the selection of these wikis was not intentional.

Method

Based on this previous research, we wanted to look at a few key features for our target study wikis:

Percentage of monthly edits made by logged-out editors,
Community attitudes towards edits by logged-out editors, or of anonymous editing broadly,
Representation across different sizes and types of project,
Any unique editing conditions or constraints.

Therefore, we focused on working with patrollers from these five main target projects:

Japanese Wikipedia
Dutch Wikipedia
German Wikipedia
Chinese Wikipedia
English Wikiquote

Japanese Wikipedia has the highest proportion of logged-out edits on any of our major language Wikipedias. On average, 27% of all edits made to Japanese Wikipedia are done by logged-out editors, comprising about 97,000 edits[]. From previous consultations and interactions with Japanese Wikipedia, we know that edits from logged-out editors are not treated as automatically suspicious or indicative of vandalism, in contrast to some of our other large wikis such as English Wikipedia. Based on conversations with Dutch Wikipedia administrators at Wikimania 2019, Dutch Wikipedia treats edits from logged-out editors with some level of suspicion, or at least it is normal to treat logged-out edits as a flag that signals possible vandalism. We wanted to include this community as a counterpoint to wikis with higher percentages of logged-out edits.

German Wikipedia is one of the few projects to use the Pending Changes feature for all edits from logged-out users. This means that all edits from logged-out users must first be reviewed and approved by a patroller, or other logged-in user, before the change goes live. We therefore predicted that there should be significantly more work for patrollers to do, with a different workflow to that of projects without Pending Changes on.

Chinese Wikipedians edit under quite challenging constraints: our projects are banned in China, and so all editing done from within mainland China must use a VPN or some other way of circumventing this ban. Thus, while it is against policy on other projects to edit using a VPN, this is not the case for Chinese Wikipedia. Editors residing outside of mainland China also edit Chinese Wikipedia, without requiring the use of VPNs, but our assumption is that this allowance should mean that there are quite different attitudes towards logged-out editing, compared against the other projects in this study.

Of all of our different projects, Wikiquote receives the highest percentage of logged-out edits. English Wikiquote, our largest Wikiquote project, receives about 5,600 edits from logged-out editors a month, or 34.5% of all edits. This is a huge proportion, especially compared to English Wikipedia’s monthly logged-out edit average of 16.9%. We have also done very little research into non-Wikipedia projects in the past.

Participants were recruited via open calls on Village Pumps or the local equivalent. Where possible, we also posted on Wiki Embassy pages. Unfortunately, while we had interpretation support for the interviews themselves, we did not extend translation support to the messages, which may have accounted for low response rates. All interviews were conducted via Zoom, with a note-taker in attendance.

Results

Patrollers work by immediately classifying edits by order of priority, based on various signs or flags that are used to infer characteristics of the editor in question. Due to the lack of a centralized dashboard for new edits, only small wikis can be effectively patrolled from the Recent Changes page. If a wiki is busy enough, the Recent Changes page simply updates too quickly to be of use, and patrollers resort to watchlists that can cover thousands, or tens of thousands of articles. The quickest and most simple form of triage comes in the form of checking if the edit is made to a controversial page, relying on the patroller’s knowledge of contentious topics. The second step is looking at the account that made the edit, and trying to assess from the edit summary if the user is new. New users were universally understood as more likely to be responsible for vandalism, whether due to unfamiliarity with community rules, or symptomatic of a sophisticated long-term abuser trying to get around previous blocks. New user status was inferred from either the lack of a user page (called a “red link user”, since the link to the userpage would be in red) or because the edit would be attributed to an IP address.

Supporting the findings from previous studies, we did not find a systematic or unified use of IP information. Additionally, this information was only sought out after a certain threshold of suspicion. Most further investigation of suspicious user activity begins with publicly available on-wiki information, such as checking previous local edits, Global Contributions, or looking for previous bans.

The investigation could then expand out into looking at other new accounts that had edited the same targets of vandalism in a similar manner, generating a group of suspicious accounts. If these accounts were mainly IP addresses, only then would patrollers seek out IP information. IP information was largely obtained from free services. Different patrollers dealing with users from different parts of the world had their own preferences for IP information services, based on the quality of information offered and the descriptive nature of the provided summaries. These free services limit how often users can access the tool, and our interviewees hit the limit more than once while demonstrating cases for us.

Precision and accuracy were less important qualities for IP information: upon seeing that one chosen IP information site returned three different results for the geographical location of the same IP address, one of our interviewees mentioned that precision in location was not as important as consistency. That is to say, so long as an IP address was consistently exposed as being from one country, it mattered less if it was correct or precise. This fits with our understanding of how IP address information is used: as a semi-unique piece of information associated with a single device or person, that is relatively hard to spoof for the average person. The accuracy or precision of the information attached to the user is less important than the fact that it is attached and difficult to change.

Additionally, the level of precision expected or required of a service varies depending on the place of origin of the IP. If the IP is registered to a larger country with a more spread-out population, city-level precision is satisfactory. But if it is registered to a small country with only one or two major urban centers - for example, Hong Kong - it is not useful if the service is only precise to the city level. For these densely populated areas, our interviewees preferred to get district-level geolocation where possible.

Proxies provided an interesting challenge in two ways. Firstly, not every IP information provider gives a threat score or proxy rating score; secondly, even those that do can be inaccurate; thirdly, even with all the information provided, a human may still have to make an educated guess based on the organization name and the editing behaviour coming from that IP or IP range. One of the IP information services we observed, IPIP.net, was good at identifying proxies based in Southeast Asia, but incorrectly labelled a proxy based in Australia as a “data center” type service. A North American IP information service, WhatIsMyIPAddress, was unable to identify a Chinese based proxy.

Compounding this is the fact that certain proxy services purposefully have exits inside residential ranges, to allow users to better bypass automated proxy checks. We were shown one such proxy, which all the IP information services we saw labelled as a static residential IP. The address in question was identified as a proxy only because administrators noticed wildly different types of editing behavior coming from the same address.

In short, even among patrollers who understand the unreliable nature of IP address information, their preference was universally for simple, clean results that summarized the information provided, even if the summary was not always the most accurate or precise. It seems to be the persistent nature of IP addresses, and the fact that much of the IP address information such as geolocation and registered organization is hard to spoof or change quickly, that is valuable to patrolling work.

Product recommendation

Our findings highlight a few key design aspects for the IP info tool:

Provide at-a-glance conclusions over raw data
Cover key aspects of IP information:
Geolocation (to a city or district level where possible)
Registered organization
Connection type (high-traffic, such as data center or mobile network versus low-traffic, such as residential broadband)
Proxy status as binary yes or no

As an ethical point, it is important to be able to explain how any conclusions are reached, and the inaccuracy or imprecisions inherent in pulling IP information. While this was not a major concern for the patrollers we talked to, if we are to create a tool that will be used to provide justifications for administrative action, we should be careful to make it clear what the limitations of our tools are.

Based on this and prior research, we anticipate that patrolled edits will rarely require the use of an IP information tool. However, because the cases that prompt the search for IP information tend to be more complex and take up more time and energy, we should still endeavor to make it as user-friendly as possible, considering that they will be used for cases that are already complex by their nature.