Talk:Learning and Evaluation/Evaluation reports/2013/On-wiki writing contests
New program evaluation released: On-wiki writing contests
editHappy New Year everyone! The Program Evaluation and Design team at the Wikimedia Foundation has a new evaluation to share with you.
We have released the latest program evaluation about on-wiki writing contests.
As we have asked previously, we would love your thoughts and comments on this report. Preferably, on the talk page of the evaluation page. On-wiki writing contests have shown that they do meet their priority goals of quality content improvement and editor retention, but more research needs to be done, and more data collected (like all of the evaluations we have produced so far).
We hope you'll find time to share and review this report, we are excited about it:
https://meta.wikimedia.org/wiki/Programs:Evaluation_portal/Library/On-wiki_writing_contests
Also, I'll be contacting many of you about survey collection. If you have produced surveys at anytime in the past, or have a survey you would like to share with us, we want it! We need to collect surveys so we can develop high demand surveying tools that so many of you have asked us about.[1]
Thanks everyone, and happy new year,
SarahStierch (talk) 18:43, 4 January 2014 (UTC)
Number of contests looked at
editHi Sarah. Many thanks for posting this. :-) 8 contests is a very small number to be looking at in order to draw statistically robust results - e.g. with the retention statistics, the difference between 83% and 79% isn't going to be at all statistically significant. (And BTW, is the 0% for program E a mistake? it doesn't seem to be covered in the text.) Which contests did you look at - perhaps there are more that could be looked at in order to obtain more robust conclusions? Thanks. Mike Peel (talk) 18:49, 4 January 2014 (UTC)
- Hi User:Mike Peel! This is more of a question for User:JAnstee (WMF). She might not respond until Monday. However, we anonymized all of the contests (like the edit-a-thons, workshops, etc) so we can't share what exact contests we analyzed, as that would be going against our word of data anonymization we promised program leaders in our call for data. We do discuss the data limitations here.
- We had to mine data on six additional contests because we had such a low response rate (despite efforts in reaching out directly to program leaders/chapters doing contests over the past year). We acknowledge that more data needs to be collected in order to learn more, but, this is a promising start - and it's the first data our team has seen that shows strong user retention. We do know more contests exists, but, it's up to program leaders to share that data with us, we can't dig it up ourselves (it took weeks for our team to mine that additional data for those six additional contests, when we know the program leaders could have provided us the data in a matter of hours most likely.) or we'd never succeed at our goal to improve the evaluation skills and abilities of program leaders in order to support htem in their own self-evaluation. SarahStierch (talk) 19:15, 4 January 2014 (UTC)
- Can you say what other contests you mined data for? Ed [talk] [en] 23:25, 4 January 2014 (UTC)
- Hmm, ok, thanks. I can't understand the rationale for not saying which programs were looked at, I somewhat understand why things like retention rates might need to be kept confidential, but that doesn't help to figure out which kind of contest works best. If you haven't done already, I'd recommend talking to User:Casliber about the enwp Core Contest competition. Thanks. Mike Peel (talk) 14:04, 5 January 2014 (UTC)
- There's also the Military history WikiProject with their long-running contest, and the oft-mentioned WikiCup, which I hope they were able to mine data for. Ed [talk] [en] 16:23, 5 January 2014 (UTC)
- Hi Mike and Ed! I think we perhaps could share a list of those writing contests which we mined as an example of the representativeness of the current data set since those data are publicly available, however, those program leaders who self-reported were assured their participation and reporting would be kept confidential, so that they must remain. We provided this assurance to reduce reporting fears and will continue to keep confidence except where we do eventually find promising practices of potential model programs. It seems we may wish to ask people in future reporting if we can disclose their participation but maintain confidence in their reported data only.
- However, our biggest need at this time is for more data, and better data; comparing one contest to another is not really possible, as the data are too variable to provide enough statistical power for valid examination. Currently, we see that overall, the contests seem to get at what they intend to and that we need to investigate further to find promising practices and develop resources for replication and sharing. Further investigation will require better tracking and reporting of contests by more program leaders, and identifying the promising practices from such reporting is something the Program Evaluation and Design team is tasked with and will do in good time. We see this pilot reporting as a beta version of what eventually will be a smooth and much more fruitful process of data gathering and evaluation as more program leaders participate in the process. JAnstee (WMF) (talk) 18:06, 6 January 2014 (UTC)
Why anonymize so heavily?
editThe amount of anonymization done here - extending to the point of not naming the languages or contests studied - seems to make the resulting data very hard to build on, or to learn anything further from. In the data spheres I work with, anonymization is pared down to the minimum, and only for specific reasons of concern.
What concerns did you have about being transparent about your sources and data? Are these concerns expressed by the organizers, or anticipated by you? The contests themselves are all carried out in public on wikis with public edit histories, so presumably it's not the editors or communities who are concerned about public sharing of granular data.
Warmly, –SJ talk 00:16, 5 January 2014 (UTC)
- +1 Ed [talk] [en] 00:57, 5 January 2014 (UTC)
- Also +1 from me. Mike Peel (talk) 14:00, 5 January 2014 (UTC)
- I think it is important to distinguish between the anonymising of the data and anonymising of the results and conclusions. I agree that anonymising the data loses information that may be valuable in the analysis, but it is appropriate to not disclose certain information in public findings because of considerations of privacy or because of association with negative findings. I agree that it may all be public data, but people do rely to some extent on the anonymity of being "one of the crowd". By identifying people in findings you deny them that anonymity. I certainly would say that we should not have findings saying something like "User ABC is an example of an editor who made high-quality edits whereas user XYZ is an example of low-quality edits" (ABC might not mind this but XYZ will probably be unhappy). Similarly I think we should be careful about negative commentary even on a higher scale of aggregation, e.g. national groups. Certainly if we create the fear among editors that they may be personally put under the microscope and talked about in public findings through the analysis of "public data", it's hardly likely to encourage people to contribute. I would note that in many countries your criminal record is expunged after a certain number of years of good behaviour -- Wikipedia is not so forgiving. Kerry Raymond (talk) 02:58, 5 January 2014 (UTC)
- Even so, those concerns can be balanced against the need to be able to examine the data for ourselves. Realistically speaking, the heavy amount of anonymization makes it nearly impossible to tease out different conclusions—and given this evaluation's roots in a voluntary survey, I don't see why the anonymizing was necessary unless the evaluators believed that the number of responses would be heavily reduced without it.
- Given that the data agreement is already in place, could the WMF at least release a list of the analyzed contests? Ed [talk] [en] 16:32, 5 January 2014 (UTC)
It's the weekend and I'm the only team member who is checking out meta this weekend, and I pinged Jaime and Frank about your queries, so I'm sure one or both of them will respond on Monday. I can't answer all questions, as I'm not the program evaluation specialist (I'm still new to it, also!). SarahStierch (talk) 16:40, 5 January 2014 (UTC)
- Please note that the data reported at the bottom of each report page have unique "Report ID" numbers that can be matched across the last three tables so that you can actually regenerate the full dataset except for event names and dates (See Appendix heading "More Data" for the complete input, output, and outcome data used in the report). However, I must restate the need for caution with interpretations between implementations; at this early stage in the reporting, with such small numbers, we are aware that the data do not represent all programming, and that the data are too variable to draw comparisons between programs statistically. I think we perhaps could share a list of those writing contests which we mined as an example of the representativeness of the current data set since those data are publicly available, however, those program leaders who self-reported were assured their participation and reporting would be kept confidential, so that they must remain. We provided this assurance to reduce reporting fears and will continue to keep confidence except where we do eventually find promising practices of potential model programs. It seems we may wish to ask people in future reporting if we can disclose their participation but maintain confidence in their reported data only.
Contests mined:
- WikiCup (EN)
- Military history (EN)
- Military history (EN)
- WikiCup (DE)
- Ibero-American Women (ES)
- SW (DE)
JAnstee (WMF) (talk) 19:39, 6 January 2014 (UTC)
- Thank you. I do realize that you can't violate an agreement that is already in place—I'm hopeful that it won't be quite so restrictive for future studies, which you appear to be on board with. :-) Thank you very much for your detailed responses. Ed [talk] [en] 21:30, 7 January 2014 (UTC)
Scale/Maturity of Wikipedia
editNow that this list is known, we see the study is only representative, to the extent it is representative, of large and mature Wikipedias (English, German, Spanish). It decidedly doesn't tell us anything about writing competitions on smaller and/or less mature Wikipedias; in those, the numbers and ratios, as well as the methods and framing, may be quite different.
It would therefore be crucial to add data from many more writing competitions across the movement, and to make an effort to include data about small and mid-sized Wikipedias as well. Is this planned? If so, what is the approximate time line for it? Asaf Bartov (WMF Grants) talk 21:40, 6 January 2014 (UTC)
- Yes, on all counts, we are reliant on program leaders performing self-evaluation aligned to standardized metrics in order for our understanding of the program and different implementation contexts. In order for that to happen, program leaders going forward will need to capture these important implementation data, as well also get permission to track their event participants to assess project-based activity outputs and outcomes. In addition to their doing such self-evaluation we are currently developing tracking tools and form collectors for program leaders to keep personal record of their program tracking data, as well as on-going collectors for program leaders to continue to voluntarily submit these data to the Program Evaluation and Design team to be included in future for analyses and reports. We are targeting February for the roll out of such tracking and reporting tools. JAnstee (WMF) (talk) 19:04, 13 January 2014 (UTC)
- Why only "going forward"? What prevents us from reaching out (I did say "make an effort" in my original comment :)) to small and mid-sized communities to obtain the same kinds of results and data (however under-tooled) as we got with less effort from the larger wikis?
- For example, both the Hebrew and the Arabic Wikipedias have run long-term writing contests a bit similar to the EN/DE WikiCup model for several years now, i.e. have a good deal of data to contribute. They were perhaps not reached by, or did not actively respond to, the general calls put out by the PED team, but may well agree to cooperate with PED if approached more individually. We at grantmaking are happy to make the connections! All you need to do is say you'll devote time to pursuing this. Asaf Bartov (WMF Grants) talk 02:22, 18 January 2014 (UTC)
Sister projects?
editThe other Wikimedia projects beside Wikipedia may also have competitions worthy of study. Clearly, the Commons-centered Wiki Loves {Monuments, Earth, Public Art, ...} competitions are already a target for the PED team (or is it only WLM that would be studied?), but perhaps there have been enough competitions on, say, Wiktionary, to merit study? (Indeed, perhaps there haven't been, but then even that fact would be interesting to ascertain.) Asaf Bartov (WMF Grants) talk 21:40, 6 January 2014 (UTC)
- Yes, we are hoping to build out tools that program leaders can adapt and use for other programs as well. We asked for reporting of other photo upload events and will be reporting on the small set of reports we received for those in the upcoming weeks. Again, we are reliant on program leader self-reporting and are doing our best to work on building out a common language and understanding of evaluation and standardized metrics as we at the same time move forward in developing both metrics and tools for such evaluations and collect pilot data to test such tools and gain an initial understanding of the varying scope and impact of these programs. We fully anticipate the types of programs reported on to grow as well as to, over time, generate enough understanding of program evaluation, metrics, and reporting mechanisms that will also satisfy monitoring and evaluation needs for innovative programs in the future. JAnstee (WMF) (talk) 19:12, 13 January 2014 (UTC)
- Thanks. I do understand PED relies on program leaders' self-reporting and does not gather the data itself.
- I make a distinction, though, between program leaders who might be willing to do self-reporting if asked (and reminded, perhaps) and those proactively getting the data to PED or responding to first mass invitation to do so. I am guessing the former group is significantly larger than the latter, and would urge you to examine whether PED can devote some time (once a new liaison is available, presumably) to reaching out to that larger, former group, more intensively, so as to get more people to do self-reporting. Asaf Bartov (WMF Grants) talk 02:33, 18 January 2014 (UTC)
Time is finite
editI think the comparison of level of edit activity during the contest with afterwards should be done over a much longer time frame than 3 to 6 months. We know that Wikipedia editing is seasonal and that's a cumulative effect arising from the time commitments of many individual editors. Participation in the contest is self-selected. If editors choose to participate in a contest and presumably plan to increase their contributions during that period, then they believe they will be willing and able to do that at the expense of other things that would normally occupy that time (work, family, going to the gym, stamp collecting, whatever). People who can forsee heavy level of demands in other aspects of their life (e.g. family holiday, college study) during the timeframe of the contest are unlikely to participate. However, once the contest is over, there are two factors that come into play.
- Once the contest is over, there probably is some sense that time is now "owed" to other things in their life ("honey, you've done nothing but edit Wikipedia lately, let's go see a movie for a change"). So I would expect to see some slump in editing activity in the immediate post-contest because of catching up with other things.
- Signing up to the contest implies increased expected availability to edit during the contest; it doesn't say anything about their availability to edit after the time frame of the contest (indeed, if the contest period was a "good time" to be editing more, it sort-of follows that the non-contest period may be a "less good time" to be editing more). So the contest participants may be busy with their college exams, family holidays, 3 months or 6 months later.
So I think to evaluate benefit of contests, I think you need some reasonable data on the level of activity over (say) a year before the contest to use as a baseline to measure the impact of the contest itself and its longer-term legacy.
Also, even if contests show increased activity in the immediate contest phrase and/or afterwards (and I would expect contests to do this), one should be cautious about then concluding "let's do lots more contests". It may be that the pool of potential participants (those with extra time to commit) is quickly exhausted and that people's capacity to contribute at higher levels in a sustained way simply isn't there (the rest of their life hasn't gone away). While WLM clearly attracts a lot of content each time it runs, you can easily take "yet another photo of a monument" (which is largely similar to an existing photo), but you can't add "near-duplicate" content to an article. Kerry Raymond (talk) 02:35, 5 January 2014 (UTC)
- Thank you for your input, Kerry. We are only at the beginning of evaluation of these programs. Many of those reported programs we looked at had not yet reached six month follow-up point, but it will be up to program leaders to choose which goals they set and assess in terms of follow-up (whether retention at 3, 6, or 12 months post event) - it is something that we will continue to help in working to standardize the way metrics are pulled so that whatever people report we will understand consistently. We tried to look at the 3 and 6 month active editor retention since those were the time periods that came out of the logic model sessions around mapping the theory of change in these programs, still, the longer the window, the more consistent participation will be required in the evaluation, while our team will likely continue to handle basic reporting and follow-up analysis, in-depth variations will be up to program leaders self-evaluation and molding the path in many cases. Importantly, we have not said, people should start doing more and more contests; only that they seem to get at their goals and we need more tracking and reporting to investigate them further. ... Which of course depends on more program leaders engaging in the evaluation practices we are working to lay the path for. JAnstee (WMF) (talk) 18:30, 6 January 2014 (UTC)
Topic-focussed or not?
editI suspect that contests focussed on specific topics are likely to result in an improvement in quality of the articles in that topic, whereas without a tight topic focus I'd be suspecting that the contributions will tend to be more about quantity. There is a "contagious" aspect to Wikipedia editing (how often you do react to a change in a watched page by making another edit yourself?) which is likely to benefit from a topic-focussed contest. Of course, a specific topic focus is likely to attract a smaller group of participants (who presumably have some interest in the topic area, which in turn is likely to lead to higher quality contributions as they are likely to be more knowledgeable and have more access to resources on that topic in their possession). Is the data able to reveal anything about the benefits (or otherwise) of having a topic focus? Kerry Raymond (talk) 02:35, 5 January 2014 (UTC)
- At this time, the number of reports is too low to make comparisons across contests in this way JAnstee (WMF) (talk) 18:30, 6 January 2014 (UTC)
Motivations
editAnother aspect of contests not discussed is the motivation of the participants and how to sustain it. Possible motivations include rewards for winning and rewards for participating. If we only reward winning (whether it is by material prizes or social capital of respect), then there is the high likelihood of people giving up if they can perceive they are unlikely to win. This means that there are advantages to a competition where participants can't tell who the other participants are and/or can't compare their relative performance (e.g. edit counts). However, for the judges, it is less work and fewer arguments to have something easily measured and objective (e.g. edit counts) than something more subjective like "quality". Pretty obviously a well-run competition won't be susceptible to gaming (e.g. edit counts are easily gamed using tools like AWB). The alternative is that there are also prizes simply for participating, e.g. a lottery among the participants, possibly skewed in favour of higher levels of participation but not strictly linear) to keep people participating even if they know they are unlikely to be a winner.
Note that material prizes must be aspirational to have motivational value (this is particularly so with material prizes). The prize must be something that people would like to have but don't have. That's a very challenging problem for a global competition. What might motivate in the Global South (an e-reader was used in such a contest recently) isn't likely to be a motivator in the Global North. But if you offer a material prize that is attractive to the Global North, then if it is won by someone in the Global South, you have just given an expensive "toy" to someone whose life would probably have been much more improved if you had given them the equivalent cash instead. While I know people may shy from cash prizes (spectres of paid editing and more difficult to get sponsors to donate), but if we are serious about Global South engagement, random cash prizes for participation are more likely to be motivators than the opportunity to be the overall winner of an expensive toy.
Forgetting prizes, I note that a lot of WP editors have altruistic motivations, so getting them to alter their current patterns of behaviour via a contest will depend on pitching the contest to their particular altruism. In that regard, it's worth looking at the editor survey data (see section 1.5 in https://meta.wikimedia.org/wiki/Editor_Survey_2011/Editing_Activities ). In terms of those motivations, I can't think of what you can really do to appeal to someone who is purely motivated by "volunteering" or "openness" (that's present in WP with or without a contest). But clearly some people could be motivated by the topic area reflecting their own expertise, while others appear to be displaying "perfectionist" motivations (fixing errors, completing topics, removing bias) and presumably contests around improving quality may attract these folks. Kerry Raymond (talk) 02:35, 5 January 2014 (UTC)
- While a contest may not appeal to those who edit for reasons of volunteering and openness, I wonder if the TripAdvisor approach may help. After I submit a review to TripAdvisor, I get a few weeks later an email saying "Thanks for your review. [some number] of people from [some list of countries] have read it" or something like that. I wonder if we should be doing something similar with WP editors, give them concrete feedback that their edit was worthwhile because so many people have read that article since their edit. I suspect it would often make the editor look at the article again and probably do a little more work on it. I also think I'd be quite impressed to receive a monthly (or quarterly or ...) email saying "The articles you have edited in the past month have since been read by [some number] of readers, over the last year by [some number] readers, over all your editing by [some number] of readers - thank you and keep up the great work". Kerry Raymond (talk) 03:18, 5 January 2014 (UTC)
- Our next steps are to develop self-reporting tools for assessing participant experiences, motivations, and intentions. JAnstee (WMF) (talk) 18:30, 6 January 2014 (UTC)
Repeat contests and return participants
editIt might be worth looking at how many people participate again in repeat competitions - are the competitions attracting new people each time, or are the same people coming back again and again to enter the competitions? Thanks. Mike Peel (talk) 18:39, 5 January 2014 (UTC)
- Yes, that will be interesting to do as we gather more data and analytics is able to further develop the capacity of Wikimetrics to analyze cohort intersections and generated cohorts as intended and/or we are able to develop better user tagging mechanisms. For now, it will likely be an important self-report component that we could collect via participant self-report though. JAnstee (WMF) (talk) 18:30, 6 January 2014 (UTC)
Median is being confused with average
editIn quite a few places an "average" was given in the main text that match the "median" in the footnote, not the average. I've corrected the ones I noticed, but can someone check that this doesn't happen again—and that the graphs are not using supposed averages that are really medians? Tony (talk) 04:18, 10 August 2014 (UTC)
- Hi User:Tony1, Thanks for pointing this out. This was actually intentional. Some readers may not be familiar with statistics and might not know what the median is or why we use it, so we use "average" since it denotes one of the many types of "averages" out there. We use the median because the number of reports is too few and the arithmetic mean might be skewed. --EGalvez (WMF) (talk) 14:14, 10 August 2014 (UTC)
- Edward, I'm pretty sure of my solid ground here: median and mean are quite different, and this is taught in grade school. I'm not aware of anyone who thinks that average doesn't mean "mean"; I believe the terms used in the main text should reflect this. Tony (talk) 23:19, 10 August 2014 (UTC)
- Thanks, Tony. We really appreciate your feedback about this. We are intentionally not reporting the mean because the mean is very skewed when we have so few data points. So we are reporting the median. But the term "median" can be difficult for some to interpret. What does the median signify in terms of these data? They signify the "middle of the road," which is what the word "average" represents. We also include a footnote every time to clarify that we are reporting the median. Once we have sufficient data, we will probably report the mean. Let me know if this helps clarify our reasoning. Thanks! --EGalvez (WMF) (talk) 21:03, 19 August 2014 (UTC)
- Edward—you can't call it "average"; it needs to be called "median", if that's what you decide to report in the main text (with average and SD in parentheses, sure). "Average" is "the result obtained by adding several amounts together and then dividing this total by the number of amounts". "Median" is "a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it". They are mutually exclusive, and kids are taught in grade school the difference between these two terms. Tony (talk) 00:37, 20 August 2014 (UTC)
- Thanks, Tony. We really appreciate your feedback about this. We are intentionally not reporting the mean because the mean is very skewed when we have so few data points. So we are reporting the median. But the term "median" can be difficult for some to interpret. What does the median signify in terms of these data? They signify the "middle of the road," which is what the word "average" represents. We also include a footnote every time to clarify that we are reporting the median. Once we have sufficient data, we will probably report the mean. Let me know if this helps clarify our reasoning. Thanks! --EGalvez (WMF) (talk) 21:03, 19 August 2014 (UTC)
- Edward, I'm pretty sure of my solid ground here: median and mean are quite different, and this is taught in grade school. I'm not aware of anyone who thinks that average doesn't mean "mean"; I believe the terms used in the main text should reflect this. Tony (talk) 23:19, 10 August 2014 (UTC)
- Hi User:Tony1, Thanks for pointing this out. This was actually intentional. Some readers may not be familiar with statistics and might not know what the median is or why we use it, so we use "average" since it denotes one of the many types of "averages" out there. We use the median because the number of reports is too few and the arithmetic mean might be skewed. --EGalvez (WMF) (talk) 14:14, 10 August 2014 (UTC)