Talk:List of Wikipedias by sample of articles/Source code (original)
Modified source
editSuggestions:
include .encode('cp437','replace') whenever printing to console to avoid errors- optimize by caching English pages
- remove interwiki text for article length calculation
- weight text length
- color code score
Modifying source
editI was looking at modifying this program for my own use (namely, directing it towards a different page; for example, Vital Articles / Extended, a specific wikiproject's topic list, or a specific topic outline's list. Who would be the right person to ask about doing such? Almafeta 05:50, 1 October 2009 (UTC)
- Smeira is the original author but he has been missing for couple of years. I could probably help. I've been working on code to create a extended article list (see below). It may need some tweaking for your needs but it can read from the lists you mentioned. --MarsRover 07:36, 1 October 2009 (UTC)
- I've been working on that (apparently my installation of Python had... issues), and finally have it working with the groups I'm interested in. Thank you. =)
- Also, it's too bad Smeira's gone... it occurs to me that the original was probably the most significant piece of code ever written in Volapük. Almafeta 16:49, 26 October 2009 (UTC)
GetExtendedArticleList.py
edit# -*- coding: utf_8 -*-
import sys
sys.path.append('./pywikipedia')
import wikipedia
import pagegenerators
import re
entry_re = re.compile(r"([\*|#]+)(\s*)('*)\[\[([^\]]+)\]\](\s*)\(?(\[\[([^\]]+)\]\])?\)?")
link_re = re.compile(r'(:?([a-z\-]+):)?([^\]\|:]+)(\|([^\]]+))?')
def parseEntry(line):
m = entry_re.search(line)
if m:
return {'name':m.group(4),'sibling':m.group(7),'indent':len(m.group(1)),'span':m.span()}
def parseLink(link, wiki_name):
m = link_re.search(link)
if m:
linkWiki = m.group(2) or wiki_name
return {'wiki':linkWiki,'name':m.group(3),'alias':m.group(5)}
def findAll(text, parseFunction):
return_list = []
pos = 0
item = parseFunction(text)
while item:
pos += item['span'][1]
item['pos'] = pos
del item['span']
return_list.append(item)
item = parseFunction(text[pos:])
return return_list
def getArticle(wiki_name, wiki_family, article_name):
print "reading %s" % (article_name)
wiki = wikipedia.Site(wiki_name, wiki_family)
page = wikipedia.Page(wiki, article_name)
article_text = page.get(get_redirect=False)
return {'text':article_text}
def getArticleList(wiki_name, wiki_family, article_name):
article = getArticle(wiki_name, wiki_family, article_name)['text']
arts = findAll(article, parseEntry)
for art in arts:
art['link'] = parseLink(art['name'], wiki_name)
return arts
print "working..."
lists = {}
lists[':en:Wikipedia:Vital articles/Expanded/People'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/People')
lists[':en:Wikipedia:Vital articles/Expanded/History'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/History')
lists[':en:Wikipedia:Vital articles/Expanded/Geography'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Geography')
lists[':en:Wikipedia:Vital articles/Expanded/Arts'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Arts')
lists[':en:Wikipedia:Vital articles/Expanded/Philosophy and religion'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Philosophy and religion')
lists[':en:Wikipedia:Vital articles/Expanded/Everyday life'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Everyday life')
lists[':en:Wikipedia:Vital articles/Expanded/Society and social sciences']= getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Society and social sciences')
lists[':en:Wikipedia:Vital articles/Expanded/Health and medicine'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Health and medicine')
lists[':en:Wikipedia:Vital articles/Expanded/Science'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Science')
lists[':en:Wikipedia:Vital articles/Expanded/Technology'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Technology')
lists[':en:Wikipedia:Vital articles/Expanded/Mathematics'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Mathematics')
lists[':en:Wikipedia:Vital articles/Expanded/Measurement'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Measurement')
lists[':m:List of articles every Wikipedia should have/Version 1.1'] = getArticleList('meta','meta', 'List of articles every Wikipedia should have/Version 1.1')
lists[':en:Films considered the greatest ever'] = getArticleList('en', 'wikipedia','Films considered the greatest ever')
lists[':en:Outline of biology'] = getArticleList('en', 'wikipedia','Outline of biology')
print "merge lists..."
fullList = {}
for x in lists.values():
for i in x:
if i['link']['name'].lower() not in fullList:
fullList[i['link']['name'].lower()] = i['link']['name']
print len(fullList)
print "sorting..."
sortedFullList = sorted(fullList.values(), key=str.lower)
for i in sortedFullList:
print i
Perl version?
editHas anybody implemented this in Perl? I've been working on a similar routine, just for my own amusement, looking only at articles in my home WP (= Latin), and I can't get the sizes to come out right. Grab the page, take out the inter-wiki links, take out the comments, see how many characters you've got, and multiply by the language weight -- how hard can it be? I'm wondering if I've run up against some Perl-ish oddity about Unicode (which I thought I was handling correctly), or just made some fluff-ball error. A. Mahoney 18:09, 8 November 2011 (UTC)
- I think you might be the first to try Perl. Yeah, I think you're right. It is probably related to Unicode. Make sure the length() function returns the number of characters and not the number of bytes (http://stackoverflow.com/questions/1326539/how-do-i-find-the-length-of-a-unicode-string-in-perl). --MarsRover 22:53, 8 November 2011 (UTC)
- What I ended up having to do was to trim trailing white space; Unicodeity was OK. The numbers are still a little off but close enough for planning purposes. If anybody else wants to use Perl, the MediaWiki::Bot package is the way to go; it's quite straightforward. A. Mahoney 17:36, 13 January 2012 (UTC)
Tuvan language
editbtw, pywikipedia doesn't seem to support "tyv." yet. --MarsRover 06:33, 5 September 2013 (UTC)
Color code change
editI have changed the color code in order to make it more accessible for color blind people, and easier to understand as a scale. Theklan (talk) 08:54, 5 January 2023 (UTC)