Talk:List of Wikipedias by expanded sample of articles

Latest comment: 1 month ago by Theklan in topic November update

Stats

edit

I see the mean and median sizes are identical for every WP here. I realize this is statistically possible, but it seems a bit implausible!  :-) I also like the idea of using the alternate language weights. This will be useful. A. Mahoney (talk) 12:25, 21 August 2013 (UTC)Reply

Oops, that was a mistake. Thanks for spotting it. Boivie (talk) 06:25, 22 August 2013 (UTC)Reply

Hello, Boivie. How long does your bot make this sheet? :) Zemliakov (talk) 09:38, 6 September 2013 (UTC)Reply

I don't understand exactly what you're asking for. But it takes a few hours to run the code, and I intend to run it once a month for a while. When I have found time to clean up the messy parts of the code I plan to publish it here somewhere, so it will be easy for someone else to update this page when I no longer do it. Boivie (talk) 16:57, 6 September 2013 (UTC)Reply

Any ideas why (760*2 + 1453 *3 + 7721*4) / 400 ≠ 92.30 for enwiki? Since I see the same for other wikies there is no complains, but I am curious. --Igel B TyMaHe (talk) 19:18, 17 March 2014 (UTC)Reply

That score should show percent of the maximum possible points. The formula as it is written is based of it being 10000 articles in the list. So it should really be 100 * (stubs*2 + articles*3 + long.articles*4) / (total.items*4). That means enwiki should get 100 * (760*2 + 1453 *3 + 7721*4) / (9957*4) = 92.30. Boivie (talk) 20:37, 17 March 2014 (UTC)Reply
16 May 2014. enwiki: 100*(755*2+1457*3+7765*4)/(9957*4) = 92.75 ≠ 92.35. (755*2+1457*3+7765*4)/400 = 92.35. total.items is now 10000? --Igel B TyMaHe (talk) 09:18, 25 May 2014 (UTC)Reply
Yes, it was 10000 on the 16th of May. I forgot to update the number in the top of the page. Boivie (talk) 19:43, 25 May 2014 (UTC)Reply

"Shortest"

edit

I wonder what's the point of the "shortest articles" listing. At this scale, it only displays 200-entry subset of missing articles anyway (except for :enwiki). Perhaps something like the Neglected article list from the List of Wikipedias by sample of articles would be more useful. — Yerpo Eh? 09:56, 31 March 2015 (UTC)Reply

The point is to answer the question (that no one has asked): "If I want to improve my Wikipedia, where should I start?". So I suppose it's similar to the point of the Neglected page. I see some problems with using the Neglected page here. First, I see the Neglected page like a complement to the Absent Articles page. "What can I do besides creating the absent articles?" And here we don't have a page for (all) absent articles, because it would be to large. Secondly, I don't really like the edge factor. It seems to be more focused on improving scores, than improving Wikipedia. But the popularity factor is carried over to this page in a way. The absent articles are sorted with the most popular first. So you get the 200 most popular articles that are absent in each Wikipedia. Popularity is here counted by number of languages that have the article. Boivie (talk) 12:53, 31 March 2015 (UTC)Reply
Oh, if they are sorted by popularity, then it makes much more sense, yes. Sorry, I didn't look at it too closely, so I thought they were only selected by name or position within the expanded list of articles. — Yerpo Eh? 14:14, 31 March 2015 (UTC)Reply

Maithili

edit

I suggest adding :maiwiki to the list, the pywikimedia framework has been finally updated this month so the wiki doesn't register as missing anymore. Plus, the community seems to be quite active. — Yerpo Eh? 07:10, 16 June 2015 (UTC)Reply

Please update

edit

Please update the list every early month.It will be more use ful--AJITH MS (talk) 17:16, 7 September 2015 (UTC)Reply

I've been trying to update this list the 16th each month. Why would it be more useful if it was updated on another date? Boivie (talk) 05:56, 8 September 2015 (UTC)Reply

Here internet is very limited so every early month we get the internet.I understood the reality.Sorry for my suggestion and thank for your information--AJITH MS (talk) 10:11, 8 September 2015 (UTC)Reply

Gothic Wikipedia

edit

Why the language column for Gothic Wikipedia is ðミフᄇðミフ﾿ðミヘトðミフᄚðミヘツðミフᄚðミフᄊðミフᄈðミフᄚ, and not 𐌲𐌿𐍄𐌹𐍃𐌺 as in List of Wikipedias by sample of articles? Hanif Al Husaini (talk) 13:23, 26 February 2017 (UTC)Reply

It's a code table issue. I fix the List of Wikipedias by sample of articles by hand every month (but not the sub-pages - see e.g. List of Wikipedias by sample of articles/Stubs). — Yerpo Eh? 15:56, 26 February 2017 (UTC)Reply

Absent Articles page

edit

It would be helpful if Absent Articles page (https://meta.wikimedia.org/wiki/List_of_Wikipedias_by_expanded_sample_of_articles/Shortest) can be extended for all language wikis. This could help Editors to easily identify missing articles and start them - currently this page is populated for first 40 wikis only. — The preceding unsigned comment was added by 132.183.13.69 (talk) 13:30, 5. julij 2017 (UTC)

Unfortunately, such a page would be huge, so it is not practically possible. If the community is active and diverse, I encourage someone to figure out how to run the script locally and make a separate list somewhere in the project space. It can be easily modified to show all absent articles for one language. — Yerpo Eh? 16:49, 6 July 2017 (UTC)Reply
I've done this for Latin -- I set it up as a copy of this list, but with links to the Latin pages if they exist, or a selection of other languages if they don't. See la:Vicipaedia:Paginae_quas_omnibus_Wikipediis_contineri_oportet/Expansio for the list, and see la:Usor:Amahoney/Myrias_epitome for our statistics. I'm happy to share the Perl code if it's useful. A. Mahoney (talk) 16:57, 12 July 2017 (UTC)Reply

Weights of Chinese wikipedias

edit

I noticed that the weights of zh.wiki and zh-classical.wiki are both 3.786. I think there should be more in zh-classical.wiki because classical Chinese uses much shorter sentences to express one thing.

Language Example 1 Example 2 Example 3
Chinese 走一千里路,是从迈第一步开始的。(14) 我怎么能够将你比作夏天?
你比夏天更美丽温婉(20)
过氧乙酸可以通过乙醛的自氧化反应制得。(18)
Classical Chinese 千里之行,始于足下。(8) 卿如夏日,载欣载和。
西风列列,众芳独嗟。(16)
过氧乙酸者,乙醛自氧化制之。(12)
English A journey of a thousand li begins with a single step. Shall I compare thee to a summer's day?
Thou art more lovely and more temperate
Peracetic acid is produced industrially by the autoxidation of acetaldehyde.

--Leiem (talk) 15:53, 8 July 2018 (UTC)Reply

Redirects are not encounted in absent column

edit

For example if I click on absent Russian wiki articles the first will be an "elephant". This article doesn't exist and redirects to elephantine. Yanpas (talk) 21:51, 20 July 2018 (UTC)Reply

The script completely relies on Wikidata, so if a redirect is included there, it will be counted as an article. I'm not sure what's current policy about listing redirects in Wikidata items, but it could probably be removed. In a wider context, it's a problem of content organization. Do we describe organisms in line with the common (usually English) use of their name or in line with taxonomy? We haven't really come to a consensus about it yet. — Yerpo Eh? 06:43, 22 July 2018 (UTC)Reply

Please help updating this

edit

The list supposes to be updated arround 16 August, but it has still not been updated after a week. Would somebody help updating this? Thank you very much.--Yaukasin (talk) 04:47, 23 August 2018 (UTC)Reply

It seems like I don't have time to get the script working on my computer, so I won't be able to keep on updating this list monthly anymore. If someone else would like to run the script and update the list, please do! A version of the script is at List of Wikipedias by expanded sample of articles/Source code. Boivie (talk) 09:59, 23 October 2018 (UTC)Reply
@Boivie: I have tried to run this with pywikibot but it seems that the code is out of date in print and it requires a json module that I don't have. -Theklan (talk) 10:02, 28 October 2018 (UTC)Reply
Yes, that kind of print statements was okay in Python 2, but not in Python 3 that is mostly used nowadays. I don't think it should be too difficult to install a module if you can control your environment. But I can't guarantee that you won't run into more problems along the way. Boivie (talk) 04:29, 31 October 2018 (UTC)Reply
I've taken the liberty of updating the script so it doesn't return a ton of 'rvslots' notifications, and include language editons that were started since the last update. Unfortunately, I cannot take responsibility for updating both sample rankings, but I can help occasionally. — Yerpo Eh? 14:33, 2 November 2018 (UTC)Reply

Does anyone have any idea why the script would return "Q902 has no wikidata item" and then quit (also: "UnboundLocalError: local variable 'pagetext' referenced before assignment")? I stumbled upon that error with the expanded list for Djibouti (Q977), which is why I didn't update it two weeks ago, but now it's happening with the 1000 list too. I would imagine this to happen if there was a redirect linked, but I clicked through all the interwikis and didn't find any such case. — Yerpo Eh? 18:51, 6 November 2018 (UTC)Reply

@Yerpo: @Boivie: I can't get it running. I get this error message:
Traceback (most recent call last):
  File "####\core\pwb.py", line 263, in <module>
    if not main():
  File "####\core\pwb.py", line 256, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "####\core\pwb.py", line 121, in run_python_file
    main_mod.__dict__)
  File ".\scripts\ListExpandedSample.py", line 15, in <module>
    import simplejson as json
ModuleNotFoundError: No module named 'simplejson'
<class 'ModuleNotFoundError'>
CRITICAL: Closing network session.
I don't know what to do now. -Theklan (talk) 18:58, 8 November 2018 (UTC)Reply

Fixing

edit

Let's fixing.--Jacek Janowski nr2 (talk) 09:08, 16 March 2019 (UTC)Reply

@Jacek Janowski nr2: It is not clear what you are requesting. - dcljr (talk) 21:19, 10 April 2019 (UTC)Reply

Redirects for section

edit

How size is calculated, if redirect indicates for a section of article? For example, "Railroad" (d:Q22667) in Russian Wiki redirects to section of "Rail transport" article (d:Q3565868). In this case size will be taken into account of full article or only of section? --Rg102 10:55, 12 January 2020 (UTC)Reply

@Регион102: This is intended to be a list of representative articles. If a topic does not warrant a stand-alone article in the majority of languages, it should probably be replaced with a more significant one. Therefore, the code does not slice the text in an attempt to extract relevant content, which could be spread across multiple sections (e.g. reference might reside in other sections, etc). --Dcirovic (talk) 00:51, 16 January 2020 (UTC)Reply
See also the answer above. — Yerpo Eh? 12:31, 16 January 2020 (UTC)Reply

About "has too much untranslated English"

edit

I found a potential issue on calculating the scores. When I review m:List of Wikipedias by expanded sample of articles/Shortest#zh 中文, the No. 24 states: "Puppet state 0 Wrong language, zh:傀儡政權 has too much untranslated English." But when I enter into zh:傀儡政權, I didn't find any text in English, except the referrences. I'm thinking that the referrences shouldn't be counted, since we can't avoid to use English in referrences. So I think the reason may be there're a bunch of images with English names. So I think the calculating should be refined, right? -- Ma3r (talk) 09:08, 26 June 2020 (UTC)Reply

I agree, the main reason for this is the file names of all the images. It would be quite difficult to exclude file names from the English language common words word count. But if someone can do it, I agree that it would be good. Boivie (talk) 15:05, 26 June 2020 (UTC)Reply

I found it also happens on the Internal Link Assistant (like {{link-en|中文|English}}). I learned it from zh:政治局, which is No.49 (Politburo 0 Wrong language, zh:政治局 has too much untranslated English) in m:List of Wikipedias by expanded sample of articles/Shortest#zh 中文. So I suggest to exclude those texts in various templates and template-like items (such as [[File:filename]] when we say "has too much untranslated English". -- Ma3r (talk) 08:15, 28 June 2020 (UTC)Reply

That may be easier said than done, though, since the "File" keyword won't always be in English. One would probably need a list of all the ways to link to a file at Commons. A. Mahoney (talk) 15:09, 28 June 2020 (UTC)Reply
Actually I mean, we can use different counts between calculating scores and deciding "has too much untranslated English". During the latter, maybe we can consider ignore all the texts in {{}} and [[]]. Yeah, I know it's much more difficult to do it. So, this is only a suggestion as FYI and must be immature. Hope we can solve this issue eventually, and thanks for all the efforts. -- Ma3r (talk) 03:51, 30 June 2020 (UTC)Reply
Can we just count "File"? If so, at least I can replace all "文件" into "File" to avoid such fault. Doing something is better than doing nothing, right? Ma3r (talk) 06:41, 3 April 2023 (UTC)Reply

About score values 16 Dec.2020

edit

I manually recalculated values in the table and have some questions. There are score values for russian and chinese wikipedias: 82.46 и 82.47, respectively. There are formules for calculation:

rawscore = stubs*2 + articles*3 + long_articles*4

score = rawscore / (total_items * 0.04)

There are my calculations:

For the russian:

stubs = 2,227

articles = 2,526

long_articles = 5,238

rawscore = 2227*2 + 2526*3 + 5238*4 = 32984

score = 32984 / 40000 * 100 = 82.46

For the chinese:

stubs = 2,136

articles = 1,857

long_articles = 5,778

rawscore = 2136*2 + 1857*3 + 5778*4 = 32955

score = 32955 / 40000 * 100 = 82.38

I see a contradiction between actual table values and results of real recalculation. As a result, the chinese wiki has higher score than russian. How can it be? Maybe I misunderstand the algorithm?

P.S. I recalculated values for the french wiki and they are right.

P.P.S. In previous month there were the same situation with chinese wikipedia. Below you can see calculations:

stubs = 2,151

articles = 1,858

long_articles = 5,759

rawscore = 2151*2 + 1858*3 + 5759*4 = 4302 + 5574 + 23036 = 32912

score = 32912 / 40000 * 100 = 82.28

But the actual value in the table was 82.36. Ковалевич Тимофей (talk) 18:22, 17 December 2020 (UTC)Reply

I believe there is a bug in the code for pages with "Wrong language, ... has too much untranslated English." See List of Wikipedias by expanded sample of articles/Shortest#zh_中文. I don't remember exactly, but I think those articles are not included in the total number of articles for the calculation. So, if you have 10 "untranslated" articles, the maximum score isn't 4*10000, but 4*9990. That leads to the percentage getting higher than it should be. Boivie (talk) 10:31, 22 December 2020 (UTC)Reply
It is to be noted that in the article page it is indicated that the 8000 are characters NOT bytes (...weighted size in characters...), and that the counting does not take into account comments within the text, and any interwiki text at the end of the article. These aspects would provide a difference with respect if the size is computed in bytes. Best regards, --Uruk (talk) 18:24, 4 April 2023 (UTC)Reply

Update

edit

Greetings everyone,

It's been almost a month since last update. Has anyone ever experienced this long of a wait ?

Just curious if this is something that happens very often.

Боки (talk) 22:06, 10 October 2022 (UTC)Reply

Yes, it happens every month. Boivie (talk) 14:08, 11 October 2022 (UTC)Reply

Adding non-existing articles for the calculation of the median and the mean

edit

Hello! Some Wikipedias with very few articles in the list have a better result on the "mean article size" and "median", because non-existing articles are not counted. I think they should be counted as 0 bytes, so the statistics are more fair for languages with more articles. Imagine that a language makes every article in the list but all of them are around 5.000 bytes. The mean and median would be far away from languages with only 50-60 articles done but longer. What do you think? Theklan (talk) 21:05, 15 October 2022 (UTC)Reply

Changing color code to Viridis

edit

Dear @Dcirovic. Two months ago I proposed a change to the List of Wikipedias by sample of articles color code. You can see the discussion and code here. I think that Viridis is better because it gives better information for color blind people, and we would have the two lists with a more uniform information. What do you think about changing this in the code? Thanks. Theklan (talk) 09:40, 7 April 2023 (UTC)Reply

@Theklan: In my opinion the Viridis colors are ugly. The color blind people could use the first column with ordinal numbers instead of colors. However, if the user community overwhelmingly prefers those unsightly colors, I will implement them. --Dcirovic (talk) 14:19, 7 April 2023 (UTC)Reply
I think there's no "overwhelming community" in this discussion. The color schema is more logical than the current one, where green and orange don't have the meaning they usually have. Theklan (talk) 14:49, 7 April 2023 (UTC)Reply

November update

edit

Hello @Dcirovic! As there wasn't an update this month, I would like to ask if you need help, or if this is something that someone else could be doing. Let me know how we can help with that. Theklan (talk) 10:00, 18 November 2024 (UTC)Reply

hmh, let him some more days, maybe he's busy; but if he really needs someone else to take the leap, more people are able to help by sure, me included. Greetings.--Manlleus (talk) 16:48, 18 November 2024 (UTC)Reply
I'm trying to run the code myself... I'm having a couple of problems, but it would be something I could do. Theklan (talk) 17:43, 18 November 2024 (UTC)Reply
I ran the script, and, for some unknown reason, the last changes appear as non-existing, even if those articles exist. For example Amaterasu appears as non existing at some top Wikipedias, while the article exists. I have reverted my edits, but I would like to know why is this happening. Theklan (talk) 06:44, 19 November 2024 (UTC)Reply
@Theklan: Maybe Yerpo can help you better as he runs the 1,000 article list.--Manlleus (talk) 13:49, 19 November 2024 (UTC)Reply
I'll try in a few hours, but the result might be the same because I can't do anyhting but run the same script. — Yerpo Eh? 13:51, 19 November 2024 (UTC)Reply
I'm running it again, and solved the issue. It was an old item list in json format. I have deleted it, and I think now it's running properly. Theklan (talk) 15:50, 19 November 2024 (UTC)Reply
I'll leave it to you, then, my run won't finish today. — Yerpo Eh? 18:03, 19 November 2024 (UTC)Reply
Now that I know how to do it, if this is not longer possible for you, I think I could help. However, I don't know why this was done the 16th of every month, instead of at the start of the month. Theklan (talk) 15:49, 20 November 2024 (UTC)Reply
As I mentioned here, I have updated it the first day of the month. I have also changed the color code, as I mentioned above some months ago. If there's any opposition, please, let me know. Theklan (talk) 16:52, 1 December 2024 (UTC)Reply
Ok, I reverted it, as I'm having an error if asked to update again, due to the color code I used. Yerpo, you may know where the problem is. When running I get this error:
:::File "ListExpandedSample.py", line 585, in GetPreviousScores
:::    prev_score[tokens[0]] = float(tokens[1])
:::                            ^^^^^^^^^^^^^^^^
:::ValueError: could not convert string to float: 'color:#ffffff'
:::CRITICAL: Exiting due to uncaught exception ValueError: could not convert string to float: 'color:#ffffff'
:::
I know that the error comes because it takes 1 sr color:#ffffff instead of the previous result. However, I couldn't find how the program is finding the result, as there's nothing in the code about this. Theklan (talk) 17:16, 1 December 2024 (UTC)Reply
@Theklan: It looks like there's an extra backslash in your cell style output, perhaps that's causing the issue. Try using this code for the color code score:
       #color code score
       if score >= 100.00:
           color = "|style = \"background: "+'\u0023'+color10000+"\; color:#ffffff\""
       elif score >= 80.00:
           color = "|style = \"background: "+'\u0023'+color8000+"\; color:#ffffff\""
       elif score >= 60.00:
           color = "|style = \"background: "+'\u0023'+color6000+"\; color:#ffffff\""
       elif score >= 40.00:
           color = "|style = \"background: "+'\u0023'+color4000+"\; color:#ffffff\""
       elif score >= 30.00:
           color = "|style = \"background: "+'\u0023'+color3000+"\; color:#000000\""
       elif score >= 20.00:
           color = "|style = \"background: "+'\u0023'+color2000+"\; color:#000000\""
       elif score >= 10.00:
           color = "|style = \"background: "+'\u0023'+color1000+"\; color:#000000\""
       elif score >= 5.00:
           color = "|style = \"background: "+'\u0023'+color500+"\; color:#000000\""
       elif score >= 1.00:
           color = "|style = \"background: "+'\u0023'+color100+"\; color:#000000\""
       else:
          color = "|style = \"background: "+'\u0023'+color0+"\; color:#000000\""
PS: personally, I'd prefer this list to continue being updated in the middle of the month so the excitement is more spread out (the exact date doesn't really matter, as long as it's once per month), but it's your call. — Yerpo Eh? 18:32, 1 December 2024 (UTC)Reply
I'll try this code next time, but I copied the version from your source code, so it should be the same! Theklan (talk) 19:08, 1 December 2024 (UTC)Reply
Now I remember. The previous score retrieval code is very simple: article[score_first+31:score_last-1]. It depends on the character length of each line of table. That's why it broke when I first changed the color scheme, so I had to add ; color:#ffffff manually and change the number to fit the exact position of the score value (it's probably different from "+31" in your GetPreviousScores.py script). Please ignore the extra backslash comment, it's the same in my output and I'll recheck if that semicolon there really has to be escaped. — Yerpo Eh? 19:36, 1 December 2024 (UTC)Reply
That's right, there's a score_last = article.find('|', score_first+32)! Theklan (talk) 16:08, 2 December 2024 (UTC)Reply
Return to "List of Wikipedias by expanded sample of articles" page.