Community Wishlist Survey 2019/Bots and gadgets/Machine readable diffs
Machine readable diffs
- Problem: Diffs cannot be read without screen scraping. Even the API output requires HTML parsing to get at the wikitext changes.
- Who would benefit: Any semi-automated or fully automated consumer of diffs (e.g. bot operators, data scientists, tools, researchers).
- Proposed solution: Exactly what it says on the tin. Add a different diff format to the API that JSON/XML parsers can understand.
- Phabricator tickets: phab:T56328
- Proposer: MER-C (talk) 20:29, 30 October 2018 (UTC)
Discussion
I like this idea. Gryllida 22:17, 30 October 2018 (UTC)
- Could you provide an example of how this API's output might look? MaxSem (WMF) (talk) 23:56, 30 October 2018 (UTC)
- And examples of use cases where the lack of this format proved prohibitively expensive or blocking? —TheDJ (talk • contribs) 08:00, 31 October 2018 (UTC)
- I note there are likely four parts to this request:
- Determine the "machine readable" format that will be used. Is there an existing standard that could be used, or do we have to invent something?
- Create a DiffFormatter subclass that generates that format.
- Update the wikidiff2 extension to be able to output structured diff data, either in a format that can be handled by DiffFormatter or in the "machine readable" format directly.
- Adjust exiting code to expose the structured output via the API.
- The problem I want to solve is that I want to perform analysis on the content added or removed. As for the output - I can work with something like this, although the text of the initial revision should also be returned for completeness. The refactoring required for this task will also knock out the technical debt preventing phab:T104072, phab:T117279 and phab:T38902 (moving MobileDiff into core, and making it available through the API) so there are more use cases than that given here. MER-C (talk) 19:39, 31 October 2018 (UTC)
- On benefit: Yeah, various applications would need such an interface to inspect and analyse edits.
- In rECHO/DiscussionParser.php the
interpretDiff()
might enjoy a robust access rather than current scraping.
- In rECHO/DiscussionParser.php the
- On output format:
- JSON or XML or best both, when on workbench anyway.
- The contents will be ruled by current wikidiff structure. Other systems would be possible, if collected information is not directly sent to output stream.
- Simply an array of difference groups, each containing the same information as visible by two column output today.
- Each group consisting of two objects, before and later.
- Each object with line range, recently suggested: last detected headline, and Array of paragraphs.
- Each paragraph with +/- state, and single line content, if any.
- Each line content as an Array of tupels, each tupel of changed/unchanged flag and string (escaped according to output format).
- The same as HTML today, just in different syntax according to JSON or XML.
- The diff itself needs to be wrapped by some informative data:
- Both revID involved.
- Method/structure, currently constant
wikidiff2
but may be subject to changes over decades. - The wikiID (can be derived from request URL, but for sake of completeness).
- Other information, if present anyway, like pageID or nick or timestamp or page name, but these may be derived from revIDs later.
- On special features:
- An API request might provide control information, like number of paragraphs ahead and after, which are constant values for HTML special pages, but parameter values
numContextLines
already. - A research application might drop paragraphs around which are helpful for human readers to identfy the context.
- An API request might provide control information, like number of paragraphs ahead and after, which are constant values for HTML special pages, but parameter values
- On implementation efforts:
- Unfortunately the 15 years old procedure does not create a complete diff object first, then starting output formatting.
- Formatted output is collected immediately when each diff is found.
- Otherwise the entire output object could be just thrown into a serializer for JSON or XML.
- The two column output is formatted by TableDiff.cpp today.
- Two copies of this need to be made,
JsonDiff.cpp
andXmlDiff.cpp
. - Then appropriate atomic syntax is to be generated like HTML, with proper encoding of some
"
and<
characters.
- Two copies of this need to be made,
- The stream needs to be wrapped into output head and termination, and usual administrative business.
- Unfortunately the 15 years old procedure does not create a complete diff object first, then starting output formatting.
Greetings --13:03, 4 November 2018 (UTC)PerfektesChaos (talk)
- Note that the diff engine should not generate JSON or XML directly. It should generate a data structure built out of PHP associative arrays and other PHP native types, which the API will then turn into JSON, XML, or serialized PHP based on the 'format' parameter as it does for every other API request. Anomie (talk) 15:54, 5 November 2018 (UTC)
- Or HTML for the front end in either desktop or mobile format. MER-C (talk) 18:58, 6 November 2018 (UTC)
- This sounds as if it would help Wikifundi's asynchronous wiki editing, the grant proposal is here. MarkAHershberger, comments? HLHJ (talk) 07:03, 14 November 2018 (UTC)
- Thanks for the ping. Yes, MABS does machine-readable diffs. I'm still working on this project. --☠MarkAHershberger☢(talk)☣ 15:10, 15 November 2018 (UTC)
- Does the new comparison API endpoint solve this? --EProdromou (WMF) (talk) 23:04, 16 December 2019 (UTC)
Voting
- Support MER-C (talk) 18:59, 16 November 2018 (UTC)
- Support More moral than anything else. I doubt that a proposal like this is going to be too popular, despite how useful it seems it might be to particular editors. — Insertcleverphrasehere (or here) 00:28, 17 November 2018 (UTC)
- Support Ellery (talk) 02:39, 17 November 2018 (UTC)
- Support Liuxinyu970226 (talk) 03:42, 17 November 2018 (UTC)
- Support Also, what ICPH said. Enterprisey (talk) 04:06, 17 November 2018 (UTC)
- Support Fabiorahamim (talk) 07:01, 17 November 2018 (UTC)
- Support Afernand74 (talk) 09:37, 17 November 2018 (UTC)
- Support Victor Schmidt (talk) 16:59, 17 November 2018 (UTC)
- Support Iluvatar (talk) 20:37, 17 November 2018 (UTC)
- Support Dirk Beetstra T C (en: U, T) 03:57, 18 November 2018 (UTC)
- Support Temp3600 (talk) 05:39, 18 November 2018 (UTC)
- Support NMaia (talk) 10:17, 18 November 2018 (UTC)
- Support ~ Amory (u • t • c) 11:43, 18 November 2018 (UTC)
- Support Sebastian Wallroth (talk) 13:16, 18 November 2018 (UTC)
- Support β16 - (talk) 10:18, 19 November 2018 (UTC)
- Support Benjamin (talk) 10:23, 19 November 2018 (UTC)
- Support --Frozen Hippopotamus (talk) 11:25, 19 November 2018 (UTC)
- Support Yes... Doc James (talk · contribs · email) 03:52, 20 November 2018 (UTC)
- Support This would enable many useful tools to reject junk or number-changing editors. Johnuniq (talk) 06:17, 20 November 2018 (UTC)
- Support Jamesmcmahon0 (talk) 10:29, 20 November 2018 (UTC)
- Support Gareth (talk) 11:01, 20 November 2018 (UTC)
- Support Philk84 (talk) 13:57, 20 November 2018 (UTC)
- Support Lots of potential for this idea. I don't know what, but other people would. Headbomb (talk) 15:53, 20 November 2018 (UTC)
- Support Lofhi (talk) 17:48, 20 November 2018 (UTC)
- Support Novak Watchmen (talk) 23:59, 20 November 2018 (UTC)
- Support Vulphere 07:26, 21 November 2018 (UTC)
- Support Framawiki (talk) 19:46, 21 November 2018 (UTC)
- Support Nihlus 22:17, 21 November 2018 (UTC)
- Support ElanHR (talk) 22:45, 21 November 2018 (UTC)
- Support Krinkle (talk) 01:24, 22 November 2018 (UTC)
- Support as it seems to have a lot of applications. Anything that makes the Community Tech Team's future work easier (not to mention other people's) seems like a good idea. It might promote some bad reuses, too, not sure if anything can be done about that. HLHJ (talk) 04:01, 22 November 2018 (UTC)
- Support A+ Gryllida 08:13, 23 November 2018 (UTC)
- Support MisterSynergy (talk) 10:26, 23 November 2018 (UTC)
- Support big time. Smjalageri (talk) 12:43, 23 November 2018 (UTC)
- Support ~Cybularny Speak? 15:55, 23 November 2018 (UTC)
- Support NaBUru38 (talk) 18:24, 23 November 2018 (UTC)
- Support Mbrickn (talk) 21:26, 23 November 2018 (UTC)
- Support Viztor (talk) 04:50, 24 November 2018 (UTC)
- Support Winged Blades of Godric (talk) 06:24, 24 November 2018 (UTC)
- Support Matěj Suchánek (talk) 08:45, 24 November 2018 (UTC)
- Support Hmxhmx 09:58, 24 November 2018 (UTC)
- Support Alexei Kopylov (talk) 18:22, 24 November 2018 (UTC)
- Support Tgr (talk) 04:40, 25 November 2018 (UTC)
- Support — AfroThundr (u · t · c) 01:50, 26 November 2018 (UTC)
- Support Dreamy Jazz (talk) 08:48, 26 November 2018 (UTC)
- Support Izno (talk) 01:08, 27 November 2018 (UTC)
- Support PJTraill (talk) 01:09, 27 November 2018 (UTC)
- Support Zache (talk) 03:58, 27 November 2018 (UTC)
- Support Ahm masum (talk) 21:19, 28 November 2018 (UTC)
- Support Kpjas (talk) 09:50, 29 November 2018 (UTC)
- Support GravityUp (talk) 22:46, 29 November 2018 (UTC)