Wikipedia:Version 0.5 and things that would make our life easier

edit

Hi. As you already know, the Version 0.5 selection process is already underway, but we still need some things before anything is actually published. We decided to select the articles to provide momentum to the "static movement", but there's still questions in the back end, such as the software that will run the CD version, a publisher that wants to print a paper version, copyright/logo issues, issues relating to fair use images, etc. Going further into detail:

  • Software: Would we use a view-only version of MediaWiki, or would we publish the files as a PDF (or a free alternative) format? The CD version can have a hyperlink to point to the "current" version of Wikipedia, and some users have suggested regular automatic updates when a user looks at the CD while online. Is that a good idea? Is that even feasible?
  • Publisher: Who is going to publish it? There's no way a single user can approach a company and say that we are inquiring about the publication of the encyclopedia—we don't have the authority to do so. That has something to be done at the Foundation or at the Board level.
  • Copyright: Would the offline version need to contain a list of every single user who made an edit to the selected articles, with a diff of the edit? That would be extremely paper-consuming, so would a link to the edit history pages (with the offset filled to point to the printed version, like this link pointing to the 2005 Atlantic hurricane season at the time of recieving FA status) satisfy GFDL requirements? Do we need to throw out the GFDL and adopt a Wikimedia Free License based on the GFDL so we can avoid that?
  • Logo: currently, the 0.5 logo, en:Image:WP0.5 Icon.png, is a free image licensed under the GFDL. The 1.0 logo, en:Image:WP1 0 Icon.png, was originally released to the public domain. While the image page has been edited to reflect {{CopyrightByWikimedia}}, there have been subsequent recommendations to change the image to a different format, and to make it higher-quality, as the background image was originally a JPG. Would it be better for someone at the Foundation-level to make a new image, with copyright attributed since the first edit so we can replace it?
  • Fair use images: While we haven't made the decision to include them or not, would someone from juriwiki help us out what the appropriate course of action would be? The decision would also depend on the requirements of the publisher, for example.

We would also like to know the lessons learned by WikiReaders and by the German Wikipedia-CD and Wikipedia-Distribution, and we would love to have more coordination with the SPC and others at the foundation level. Titoxd(?!?) 16:26, 1 July 2006 (UTC)Reply

I was also asked to comment, but Tito has thoroughly covered the main points - getting official support for the project from Wikimedia Foundation, clarification of copyright issues (use of images, article histories), publisher and software. We would definitely like to learn lessons from the German release, and we are working with the SOS Children people to get their advice too. The only things I would like to add to Tito's list are longer term (beyond en Version 0.5):
  • Getting something in the software to allow users to have a non-editable, fully validated (not merely assessed) version of an article to look at if available. I realise that is a much wider issue, but (once resolved) this would help us a lot.
  • Set up a framework for collaboration between different language WP1.0 projects - for example, to compare selections of articles being used.

Thanks, Walkerma 22:35, 2 July 2006 (UTC)Reply


I think it's possible to split the problem like follows:

  • content validation : wich article, images ? (quality, legal issues, etc)
  • extraction format : HTML ? wiki ? (with/without skin skin, with complete images/only thumbmail,etc...)
  • offline-reader : what for a software ?(technology, portability, search engine...)
  • editor/distributor : which one ? (what type of partnership).

You work on the validation, and that's greate, because that's the first part of the global work. Maybe you can precise what you exactly need to extract from mediawiki (the second question) ? I personnaly work on small software which is like a small browser linking with an offline search engine called Xapian. The result is not bad, looks like mediawiki with a good search possibilities. Kelson 21:24, 3 July 2006 (UTC)Reply

Publisher?

edit

The English test version (Version 0.5) is likely to have over 1000 articles selected and reviewed, with plans for a release in autumn 2006. I would appreciate help from folks here on finding a publisher to work with. I am making a few phone calls myself, but if others have contacts it would be very helpful. Walkerma 04:30, 19 August 2006 (UTC)Reply

We plan to have a discussion on this and other related issues, probably IRC and probably on the night of Thursday Aug 31 - more details here. Interested parties are welcome to join us for this, or for the other upcoming IRC discussion (probably on Sunday Sept 3). Walkerma 23:10, 29 August 2006 (UTC)Reply

IRC discussion

edit

As we mentioned at the last IRC meeting, some of us are holding another IRC discussion on Sunday September 10th at 4pm EST, 20:00 UTC, on #wikipedia-static. Please see more details and sign up at w:Wikipedia_talk:Version_1.0_Editorial_Team#Next_IRC_meeting_on_Sunday. Walkerma 18:37, 8 September 2006 (UTC)Reply

Another IRC discussion on static content on en is tentatively planned for Saturday September 30th at 22:00 UTC. Please sign up here. Walkerma 06:24, 29 September 2006 (UTC)Reply
We have an IRC discussion on static content planned for Saturday November 4th. At present it looks like we will mainly cover the French and English projects. Please sign up here. Walkerma 16:06, 2 November 2006 (UTC)Reply

Static content generation

edit

There is some discussion at en:Wikipedia_talk:Version_1.0_Editorial_Team on the CD production process. Some tools can be found at en:User:Wikiwizzy/CDTools. Wizzy 18:40, 14 October 2006 (UTC)Reply

Relevant side projects

edit

Unprintworthy redirects

where is the indexing project? they should be a superset of garderers of thhe above category

+sj | help with translation |+ 19:00, 2 November 2006 (UTC)Reply

Interesting page, I hadn't seen it before! I have two questions:
  • What indexing project? Do you mean w:WP:WVWP? Perhaps this page in particular?
  • What do you mean by "a superset of garderers?"
Thanks, Walkerma 19:58, 3 November 2006 (UTC)Reply

Update on English Version 0.5

edit

A quick summary of where things stand at the end of November, 2006 with w:Wikipedia:Version 0.5:

  • A collection of 1960 articles has been prepared, covering a blend of major topics and Featured Articles.
  • A small French software company, Linterweb, is planning on producing the CD for us very soon. The company has developed an offline reader/search capability for use with the article collection. Contact with this company was set up by fr:utilisateur:Kelson from the French Wikipedia, who continues to work with us.
  • w:User:BozMo, who was involved in producing the 2006 Wikipedia CD Selection, is assisting with scripts, and other members of the 1.0 Team are helping deal with image files, navigational pages, etc.
  • The CD is expected to sell for around 10 Euros, with 1.50 Euros going to the Wikimedia Foundation. Version 0.5 will also be available for free download, including a Torrent version.
  • A basic software repository has been set up here on meta to allow projects to share standard scripts and tools for the future. Walkerma 06:53, 30 November 2006 (UTC)Reply
The release should come out during this month (January 2007), as almost everything seems to be ready now. We are having an IRC discussion (to tie up the loose ends) on Saturday January 13th at 20:00 UTC, please sign up if you're interested. Walkerma 18:44, 11 January 2007 (UTC)Reply
OK, things have drifted into February, but we are now beta testing until around Feb 15. After that, hopefully we can release the CD. Walkerma 04:31, 6 February 2007 (UTC)Reply

Static content template

edit

Does anyone know why the static content template {{Wikipedia 1.0 Navigation}} now shows up on the left instead of the right, and thereby messes up the formatting of all the pages on which it occurs? Neither the template nor this page have been edited in months, so I can only think it is a change in the Mediawiki software or something else "global". Any ideas? Walkerma 04:30, 6 February 2007 (UTC)Reply

IRC discussion on Version 0.7

edit

We will hold an IRC meeting at #wikipedia-1.0 on Monday, August 11th at 1900h UTC. Complete details are on the main 1.0 talk page, please sign up there. Walkerma 17:29, 6 August 2008 (UTC)Reply

Curation / selection algorithm for size-limited snapshots

edit

Form a discussion at the Kiwix hackathon, June 2016 SJ talk  15:35, 20 June 2016 (UTC)Reply

Algorithm
  1. decide on snapshot type (full, compact, topical) & languages.
  2. construct a seed [choose initial ~ranked whitelist & blacklist] & approx total size.
    —> the seed description should be small enough to store as a wiki page
  3. expand seed to a ~ranked lens (of all included articles).
    Data per article: ORES + other rankings); avg ranking in top-5 wikis; # of interlang links; wikilabels for different 'lists
    —> changes slowly over time
    Data per topical slice: (distance from whitelis/blacklist, via spidering; use topical wikilabels)
    —> changes per topic / audience
  4. generate the zim file, filling the file in stages to optimize depth & breadth of content.
    1. Apply mw-offliner: grabs all the items you need, fixes links, corrects js, gets cats… integrated with 3.2
    2. Feed the output of mw-offliner to the zim file writer as you go, applying a knapsack-filling algorithm
    (first get double-res images for Retina displays until you hit some size threshhold; then normal-res; then v compressed; then just 1 image;
    then just text; then just lede; then just a wikidata-sentence*) [* - does this include wikilinks?]
    [if creating a topical slice, where # of articles rather than size is the limiting factor, you might instead keep breakpoints in terms of article counts, and cover each article in the seed to a certain depth before reducing the depth for the next ones.]
Example knapsack algorithm
Rank files in order
Set N breakpoints.
Fill to 1st w/ full articles + thumbs
Fill to 2d w/ arts + 1 thumb [Or: check for multipl. of # of images]
Fill to 3d w/ arts [confirm: this should always allow for ~all arts]


Three regimes along this spectrum

edit
Full snapshots
Almost all articles. Minimal blacklist. Compression, but no variable space saving by article. [Size: 25G+]
Compact snapshots
Most articles. Seed (white+blacklist), editorial lens, low bar for completion/quality. Ranking to save space. [Size: 0.5-5G]
Topical slices
Full coverage of a topic, sparse articles. Higher bar for inclusion. Space conserving by ranking. [Size: 10-100M]
Return to "Static content group" page.