Grants:IdeaLab/Automated review & flagging for citations

Automated review & flagging for citations

A toolset and automatic flag that could be used to indicate both automatic & human review of citations for currency, working urls, etc. Readers could use this to learn what citations have been recently checked; editors could use it to find citations with problems.

idea creator• Phoebe

this project needs...

volunteer

developer

designer

community organizer

researcher

give feedback

join

endorse

created on17:42, 1 March 2016 (UTC)

Project idea

What is the problem you're trying to solve?

Our biggest and best articles in all languages (though I am most familiar with en.wp standards) have grown to include sometimes hundreds of citations. These vary from organization websites to news articles to scientific journals. It's difficult to tell for a novice editor or reader which citations are up-to-date and which might need work (broken url, bad journal name, missing info in the citation, refers to outdated info, not a good source for the information claimed, etc). There are some template flags for these (e.g. broken url) but these are clunky to add. Additionally, there is a need for better tools to build bibliographies, suggest correlating citations, etc.

What is your solution?

I do not have a specific technical solution in mind or the expertise to build one, so I'm throwing this out for the community to consider. There are several possible toolsets:

having a lightweight flag to identify citations that may need work, based on formatting, broken urls, etc. -- right now there's only article-level cleanup templates to identify this.
building a tool that would focus in on just editing citations, where citations that are identified as potentially bad (missing or broken information) could be served up in an interface, a la Magnus Manske's editing game tools, for individual fixing. Librarians and others might have a field day with this :)
A lightweight dated flag to indicate that a reference has been hand-checked and the information in the article that is being cited is indeed stated in the reference could also be potentially very useful. Right now, I and many other editors do the painstaking work of checking citations to make sure they contain what is claimed, but there's no way to indicate or capture this for readers. (I am thinking about something like the open access logo flag, but for "verified")
Expanding the existing "citation needed fix-it" tool to serve up citation neededs by article topic, or maybe other parameters...

There are also potential tie-ins with Wikidata, and cross-language citations; I sometimes find myself fixing the same citation in both English & Spanish, when the article has been translated from one language to another, and it would be nice to have a better way to do this.

For an example of the sheer amount of metadata disambiguation and messiness we find in citations, see this list; link rot resources.

Please add alternative/additional solutions!!

Goals

Get Involved

About the idea creator

Science librarian; long-time Wikipedian; citation editor when I get a chance.

Participants

Endorsements

Huge support There is a large audience of people who could participate better in our community, such as librarians and archivists, who enjoy working on this kind of problem, but probably don't want to wade through issues like the CS1 templates and wikitext. That being said, we probably need greater structured citation data before these is entirely feasible. @Harej: you will probably have something to say here, Sadads (talk) 18:52, 1 March 2016 (UTC)

I think we need better structured data for some of this idea but not all of it. We could go just by citations that are in one of the citation templates, for instance... already semi-structured. We also have the existing reference cleanup categories, at least in English (for bare urls, etc). I think there's a bit of a chicken & egg problem indicated by the list of journal names above in that although someday we will have a framework for more structured data, the actual input will still be a huge mess and we'll have to put some effort into cleaning that up. -- phoebe | talk 20:10, 1 March 2016 (UTC)

@Phoebe: True that, we also have some pretty substantial demonstrations from #1lib1ref that we can call on targeted audiences for this kind of work: for example, what we learned with #1Lib1ref and use of citation hunt. I am going to let the developer of citation hunt know about this -- he might have some ideas. Sadads (talk) 23:32, 1 March 2016 (UTC)

nice report, thanks! -- phoebe | talk 15:24, 3 March 2016 (UTC)

This is a good idea. I agree that more works and clean up should be done. However, I believe it is exactly the kind of problem to be solved by machines. Why should humans spend precious time on maintaining and formatting citations when bots should be able to do that. Machines can detect if the link is dead or not. They can migrate links to WebArchive (already being done). They can find a book by ISBN number and create a nicely formatted citation. They can take a poorly formatted web citation, scrape the source page for headers / titles (especially on larger, more popular websites) and create a nicely formatted one. They can compare if the text preceding the citation was directly copied from the source, violating the copyrights. They can even check and flag suspicious citations where the information preceding the citation is not found in the source. I think this is th direction this should go. Here's a harvesting analogy. Instead of helping coordinate 100 people with scythes, Wikipedia should build a harvester that humans could drive. So that 100 humans driving 100 harvesters would be a million times more productive than 100 people doing automatable work by hand. SSneg (talk) 12:18, 2 March 2016 (UTC)

@SSneg: bots can do some things, but (without better machine learning than we have) they definitely can't do it all. I'm thinking about problems like:

recognizing "Wiley" isn't a valid journal name (but knowing it *is* valid in the field for book publisher, so figuring out if the editor put in the wrong template or the wrong journal name for that reference).
recognizing that though a URL doesn't seem broken, it's actually directing to a different page than originally indicated. Finding the page, adding both the new URL & an internet archive link.
noting that a cited government statistical report was from 2007; figuring out if there's a more recent edition of the same report, perhaps with a slightly different title, and updating the information in the article if so.

These are all problems that I've fixed recently on English Wikipedia. I have a hard time imagining a bot being programmed with both the fuzzy logic and level of library knowledge needed to solve them. But you are right that there are many other types of citation problems that are easier to automate, so perhaps a dual approach. -- phoebe | talk 14:55, 3 March 2016 (UTC)

you should start with librarian input; task flow, and then "train" algorithm to replicate their methods, leaving the librarians to do more patron / reader management. Slowking4 (talk) 16:02, 3 March 2016 (UTC)

support yes, we had a case of librarians adding 3 external links and tripping spam filter. need to retire that filter, and replace with this. Slowking4 (talk) 15:58, 3 March 2016 (UTC)

Expand your idea

Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.

Expand into an Individual Engagement Grant

Expand into a Project and Event Grant