CIS-A2K/Indic Languages/Statistics/2011 September
After a long gap of almost 7 months, I have compiled the statistical report for the Indian Language wikipedias.
As all of you know recently I joined the India Programs of WMF to support the Indic language wiki projects. In the past I was interacting with various Indic language wikipedians for various community related and technical things (for example, the FAQ booklet translated to various Indic languages, the typing tool integrated to several Indic wikipedias (now rechristened as Narayam extension and now has the official backing of WMF), Wiki India Newsletter, and so on). Now onwards I will be able to spend more time on the Indic language wiki projects.
The data for this report is taken from http://stats.wikimedia.org/. I thank Erik Zachte for providing me the support for the same.
Due to the long processing time, the report at stats.wikimedia.org is getting generated after one month. Hence in this report I have captured the data for 2011 September (also data for 2011 August is given for comparison).
Unlike the previous reports, now onwards I will be analyzing data only for the languages spoken in India. So from this report I have excluded languages of neighboring countries of India like Sinhala, Burmese, and so on (even though personally I am interested to watch the growth of these language wikipedias also since those languages are closely related to one or the other language spoken in India).
Following are the languages of India I have selected for preparing this report. The number of speakers for each language is given against each language.
NOTE: I have used the Indian way way of denoting large numbers, since that make more sense in India. Others please note, Crore is equal to 10 million, and Lakh is 100,000.
Indian languages having wikipedia
editIn India, Hindi is the language with most number of native language speakers. There are few more languages with huge speaker base. But when it comes to Wikimedia movement of India, speaker base is not making much impact.
Language | Number of Speakers (in Crores) |
Hindi | 25.8 (native language speakers),
42.2 (if Bhojpuri, Awadhi, Chhattisgarhi, Rajasthani and other languages (or dialects) close to Hindi are included) |
Bengali | 23 |
Marathi | 9 |
Telugu | 8 |
Tamil | 6.6 |
Urdu | 6 |
Kannada | 4.7 |
Gujarati | 4.6 |
Sindhi | 4.1 |
Bhojpuri | 3.85 |
Malayalam | 3.7 |
Odia (Oriya) | 3.1 |
Nepali | 3 |
Punjabi | 2.9 |
Assamese | 1.3 |
Kashmiri | 50 lakh |
Nepal Bhasha/Newari | 8 lakh |
Bishnupriya Manipuri | 4.5 lakh |
Sanskrit | 50,000 |
Pali | No native language speakers |
The number of speakers for Punjabi (the language spoken in the Punjab state of India) is often misquoted at many places including the WMF stats. Punjabi language has two variants (Eastern Punjabi and Western Punjabi). According to various Indian statistics reports, the Punjbai language (also called Eastern Punjabi according to en Wikipedia) that uses Gurumukhi Script, is spoken by almost 2.9 crore people. The Western Punjabi (a language spoken by close to 6 crore people in Pakistan) has its own wikipedia (http://pnb.wikipedia.org). I assume the issue is similar to that of Hindi and Urdu where languages are closely related but use different scripts due to various reasons. Since my interest is in the Punjabi wikipedia (http://pa.wikipedia.org) that uses Gurumukhi script, I considered the number of speakers for that language.
Also I found that the Bhojpuri wikipedia (http://bh.wikipedia.org/) still uses the wrong language code (bh), the code that represents the Bihari language family. Bihari (ISO-639-1 bh) is a language family and the Bhojpuri language is just one of the languages in this family (Angika, Fiji Hindi and Maithili are few others). Bhojpuri has the language code bho. So we need to do two things in the case of Bhojpuri wikipedia. 1. update the language code to bho, 2. change the language name to Bhojpuri (instead of Bihari) in wikimedia records.
I have included almost all the important parameters (for which there is required and updated data) in this report. From the next report onwards I will be adding few more relevant parameters. The placement of languages in the all the tables of this report is based on the number of speakers.
Article statistics
editNumber of Articles
editHindi with more than 1 lakh (100,000) articles is on the top. Newari wikipedia with 69,826 articles comes second, and Telugu comes third with 48,803.
Wikipedia Language | 2011 August 31 | 2011 September 30 |
Hindi | 1,00,026 | 1,00,353 |
Bengali | 22,338 | 22,400 |
Marathi | 34,431 | 34,675 |
Telugu | 48,533 | 48,803 |
Tamil | 36,272 | 38,026 |
Urdu | 17,177 | 17,290 |
Kannada | 10,992 | 10,971 |
Gujarati | 20,984 | 21,302 |
Sindhi | 385 | 393 |
Bhojpuri | 2,692 | 2,694 |
Malayalam | 19,807 | 20,318 |
Odia (Oriya) | 1,523 | 1,626 |
Nepali | 15,569 | 16,004 |
Punjabi | 2,015 | 3,316 |
Assamese | 547 | 672 |
Kashmiri | 346 | 346 |
Nepal Bhasha/Newari | 69,826 | 69,826 |
Bishnupriya Manipuri | 24,754 | 24,763 |
Sanskrit | 5,199 | 6,062 |
Pali | 2,790 | 2,791 |
In the span of 9 months (from 2011 January) Hindi wikipedia has added more than 40,000 articles. But did the community size increased? See the next few parameters for more information.
Odia and Assamese wikipedias made much progress since my last report. Both the article number and community strength are increased for both. The article number in Punjabi wikipedia is going high.
Edits per article
editEdits per article shows the number of times a wikipedia article is edited. More edits for an articles means more people contributed to it and neutrality of the article is also high. For active wikipedias it is a rough indicator of quality. Wiki article will have more encyclopedic value when more people see and edit it.
Wikipedia Language | 2011 August | 2011 September |
Hindi | 10 | 11.2 |
Bengali | 30.6 | 31 |
Marathi | 17.2 | 17.5 |
Telugu | 9.7 | 9.8 |
Tamil | 17.6 | 17.4 |
Urdu | 18.7 | 19 |
Kannada | 15.6 | 15.8 |
Gujarati | 6.4 | 6.5 |
Sindhi | 21.6 | 21.5 |
Bhojpuri | 27.8 | 28.5 |
Malayalam | 30.3 | 30.3 |
Odia (Oriya) | 18.2 | 18.1 |
Nepali | 6.8 | 7.4 |
Punjabi | 18.1 | 12.3 |
Assamese | 19.8 | 18.9 |
Kashmiri | 40.5 | 41.5 |
Nepal Bhasha/Newari | 4.2 | 4.2 |
Bishnupriya Manipuri | 13.3 | 13.5 |
Sanskrit | 18.8 | 17.8 |
Pali | 22.4 | 23 |
Among active wikipedias Bengali and Malayalam got maximum edits per article. It is expected, since it has a very active community. (To see the community strength refer the next few tables).
For languages like Kashmiri, Pali the edit per article is high because it has very less number of articles and same articles are getting edited (mostly by bots) every time.
Editor and Reader Statistics
editNumber of active wikipedians (at least 5 edits per month)
editWikipedia Language | 2011 August | 2011 September |
Hindi | 54 | 56 |
Bengali | 45 | 42 |
Marathi | 32 | 31 |
Telugu | 32 | 31 |
Tamil | 80 | 83 |
Urdu | 16 | 17 |
Kannada | 16 | 12 |
Gujarati | 11 | 9 |
Sindhi | 1 | 1 |
Bhojpuri | 0 | 2 |
Malayalam | 99 | 85 |
Odia (Oriya) | 6 | 9 |
Nepali | 19 | 16 |
Punjabi | 4 | 3 |
Assamese | 14 | 17 |
Kashmiri | 1 | 0 |
Nepal Bhasha/Newari | 1 | 1 |
Bishnupriya Manipuri | 0 | 2 |
Sanskrit | 14 | 14 |
Pali | 1 | 1 |
Malayalam and Tamil tops the list with almost 85 active editors. But in Malayalam there is a reduction of 14 active editors from the previous month.
As said before, when it comes to Wikimedia movement, the speaker base is not making much impact.
For example, Sanskrit language with just 50,000 speakers is making huge impact in the Wikimedia world. It has 14 active users, a bigger community than many other big Indic languages. I am impressed by the efforts of Sansskrit wiki projects especially with the sister wiki projects (for example wikisource), their way of interacting and implementing the best practices from other Indic language wikipedias, and so on.
In the past few months I have conducted 3 wiki workshops for Sanskrit. I found each time they are maturing with the vision about the future of Sanskrit wiki projects. In the last Sanskrit wiki workshop the main focus was on defining the category tree for Sanskrit wikipedia.
When I published the report last time, Odia and Assamese wikipedias were inactive. Now the situation is changed. Now we have a community to work on it.
The progress made by Odia wiki project (http://or.wikipedia.org/) is note worthy. Kudos to Odia wiki community for all the online and offline initiatives for building the community and to increase the article count. I was actively involved in the community building for Odia wikipedia. I still remember the day (2011 January 15) when I introducted Odia wikipedia and Odia tying tool (developed by Junaid) to Odia speaker to Ashuthosh Kar during Wiki X celebration at Bangalore. Through him very soon we got a wonderful wikipedian Subhashish who is leading the efforts for Odia now. Initially Subhasish and I used to meet at my home and work on the basic things for Odia wiki. I remember us working on Odia wikipedia logo, FAQ booklet, Translate wiki, and so on. Soon we got more members to the team through the few Odia wiki workshops happened at Bangalore. Along with workshops Odia wikipedians translated the FAQ booklet to Odia and took efforts to integrate the Odia tyoing solution developed by Junaid to Odia wikipedia. Later with the support of Dhanada Mishra (the chairman of Human Development Foundation (http://www.hdf.org.in/)) and a young student Odia wikipedian Srikanth Kedia we had conducted a wiki workshop at Bhubaneshwar. Odia wikipedians from Bangalore are doing an excellent job and now many of them are participating in Wikimedia India chapter activities also.
Odia wiki project picked up not because Odia has got huge speaker base, high literacy, access to computers or any thing else; it become active only because it has receieved the volunteers who has passion and vision of developing a wikipedia in their mother language. We need similar volunteers for each Indic langauge.
The case is similar for Assamese also. I was trying to get a good volunteer for Assamese wikipedia for the past 3 years. Initially I tried to get the volunteers from Bangalore since Bangalore has good representation of Assamese community and it is easy for me to reach people. But that didn’t worked out. Then I tried for online outreach, initially through emails. Finally I got connected with Parabhakar who is a professor at NIT in Silchar, Assam. Together we try to do online outreach first through a google group (it didn’t worked out), then through facebook. A facebook page is created for Assamese wikipedia projects aimed at bringing together all Assamese people who are in Facebook and who are interested in Assamese wiki projects. It has more than 460 members now. Prabhakar used that group and his personal contacts effectively to build a community for Assamese wikipedia. That worked. Assamese wikipedia started becoming slowly active. Later Prabhakar started another Facebook group dedicated to NIT Silchar for promoting Indic language wikipedias among students (and aluminis) of NIT Silchar. Due to all these in the next few months we are going to see more wiki activity from the Assam state of India and in the Assamese wikipedia. Recently Narayam (the typing solution extension) is integrated to Assamese wikipedia. Thanks to the Assamese wikipedians Chaipu, Prabahakar and other volunteers who actively worked to make it a reality. A major roadblock for bringing Assamese people to Assamese wikipedia is removed now.
Assamese wiki community is currrently concentrated on online outreach, but soon they are planning to start offline outreach activties also.
It is intersting to note that the community size of smaller languages is either equal to or even larger than that of much bigger languages. I don’t fully understand this and like to hear your opinion on this. One hypothesis could be that, larger ratio of people in smaller languages are more passionate about their language and there fore are willing to put additional effort to showcase their mother language. Each Indic language wiki project is waiting for the few users who has vision and passion about the future of the respective wiki project.
The technical issues and the other road blocks for smaller languages are more.
Number of highly active wikipedians (more than 100 edits per month)
editHighly active wikipedians are the editors who do at least 100 edits per month. In fact we must say that they are people who are running the respective language wiki.
Wikipedia Language | 2011 August | 2011 September |
Hindi | 17 | 14 |
Bengali | 4 | 5 |
Marathi | 7 | 10 |
Telugu | 6 | 9 |
Tamil | 23 | 22 |
Urdu | 2 | 2 |
Kannada | 2 | 2 |
Gujarati | 2 | 3 |
Sindhi | 0 | 0 |
Bhojpuri | 0 | 0 |
Malayalam | 24 | 24 |
Odia (Oriya) | 2 | 2 |
Nepali | 4 | 4 |
Punjabi | 2 | 2 |
Assamese | 3 | 4 |
Kashmiri | 0 | 0 |
Nepal Bhasha/Newari | 1 | 1 |
Bishnupriya Manipuri | 0 | 0 |
Sanskrit | 2 | 7 |
Pali | 0 | 0 |
Here also Malayalam and Tamil tops the list. In fact if we have more high active editors you will be able to see more activities (not just article creation) coming out of that wiki community. Due to this you can see that offline project, photo events, article writing contest, community quiz, collaborating with respective state government, photo contest, wiki workshops, and many other innovative wiki projects are coming out from these two wiki communities. So ideally we should be able to convert more active editors to highly-active editors to make the wiki activism in each language wikipedia more vibrant.
Registered users who edited atleast 10 times since they arrived
editThis parameter shows how many of the registered users did actually turned into actual wiki editors and done at least few edits in wiki.
Newly registered users who edited atleast 10 times
editThis parameter is a subset of the preceding table. It shows how many of the newly registered users turned into wiki editors. Tamil wiki community is leading here. Among smaller communities, Assamese is also doing well (due to the reasons I told else where).
Wikipedia Language | 2011 August | 2011 September |
Hindi | 675 | 679 |
Bengali | 480 | 486 |
Marathi | 313 | 321 |
Telugu | 452 | 459 |
Tamil | 628 | 645 |
Urdu | 238 | 238 |
Kannada | 262 | 263 |
Gujarati | 127 | 127 |
Sindhi | 20 | 21 |
Bhojpuri | 19 | 19 |
Malayalam | 695 | 709 |
Odia (Oriya) | 26 | 27 |
Nepali | 103 | 106 |
Punjabi | 41 | 41 |
Assamese | 37 | 43 |
Kashmiri | 10 | 10 |
Nepal Bhasha/Newari | 24 | 24 |
Bishnupriya Manipuri | 27 | 29 |
Sanskrit | 68 | 72 |
Pali | 7 | 7 |
Even though many big languages has more number of registered users, still Malayalam continues to be on the top. Hindi and Tamil comes second and third. I wish all wiki communities be able to convert more registered users into active wiki editors.
Newly registered users who edited atleast 10 times
editWikipedia Language | 2011 August | 2011 September |
Hindi | 17 | 6 |
Bengali | 14 | 6 |
Marathi | 5 | 7 |
Telugu | 4 | 7 |
Tamil | 9 | 17 |
Urdu | 3 | 0 |
Kannada | 0 | 1 |
Gujarati | 2 | 0 |
Sindhi | 0 | 1 |
Bhojpuri | 0 | 0 |
Malayalam | 14 | 14 |
Odia (Oriya) | 0 | 1 |
Nepali | 2 | 3 |
Punjabi | 0 | 0 |
Assamese | 4 | 6 |
Kashmiri | 0 | 0 |
Nepal Bhasha/Newari | 0 | 0 |
Bishnupriya Manipuri | 0 | 2 |
Sanskrit | 4 | 4 |
Pali | 0 | 0 |
Page Views – Non Mobile (In Lakhs)
editThis parameter shows how many readers are waiting for us. Are we caring for them? The following table give the data for Non-mobile (mainly PC).Even though Hindi lags behind in some of the wikipedia editing/editor matrices, the speakers of Hindi are not lagging behind in using Hindi wikipedia. Hindi with 77 lakh page views tops the list. No other Indic langage is near Hindi. And this is expected considering the huge speaker base of Hindi.
Wikipedia Language | 2011 August | 2011 September |
Hindi | 88 | 77 |
Bengali | 24 | 29 |
Marathi | 43 | 46 |
Telugu | 24 | 23 |
Tamil | 39 | 41 |
Urdu | 15 | 16 |
Kannada | 15 | 13 |
Gujarati | 9.9 | 8.84 |
Sindhi | 0.82 | 0.72 |
Bhojpuri | 1.98 | 1.75 |
Malayalam | 31 | 28 |
Odia (Oriya) | 2.17 | 2.07 |
Nepali | 8.17 | 6.94 |
Punjabi | 2.5 | 2.6 |
Assamese | 1.39 | 1.28 |
Kashmiri | 0.87 | 0.72 |
Nepal Bhasha/Newari | 10 | 13 |
Bishnupriya Manipuri | 8.58 | 7.65 |
Sanskrit | 3.67 | 3.43 |
Pali | 1.25 | 1.39 |
Even for the inactive wikipedias like Sindhi, Kashmiri, Pali, and so on, we have thousands of people accessing it every month. But do we have enough content to offer for these readers? We need to build community for all these languages to serve our readers. In fact we should be converting some these readers into wikipedians of the respective language wikipedias.
Page Views – Mobile
editThis parameter shows how many are accessing each language wikipedia using mobile. Unlike the non-mobile data, this data is showing an upward trend for all the Indic languages. Eventhough rendering of indic scripts is not good in most of the mobile devices, many users are accessing it. This also shows from where our future readers are going to come.
Wikipedia Language | 2011 August | 2011 September |
Hindi | 1,94,000 | 2,53,000 |
Bengali | 28,000 | 45,000 |
Punjabi | 980 | 836 |
Marathi | 43,000 | 64,000 |
Telugu | 9,600 | 31,000 |
Tamil | 46,000 | 65,000 |
Urdu | 35,000 | 40,000 |
Kannada | 6,900 | 15,000 |
Gujarati | 15,000 | 27,000 |
Sindhi | 265 | 240 |
Bhojpuri | 1,100 | 4,600 |
Malayalam | 15,000 | 33,000 |
Odia (Oriya) | 1,600 | 1,600 |
Nepali | 8,800 | 22,000 |
Assamese | 967 | 1500 |
Kashmiri | 415 | 457 |
Nepal Bhasha/Newari | 2,000 | 1,900 |
Bishnupriya Manipuri | 1,800 | 1,900 |
Sanskrit | 602 | 3,800 |
Pali | 132 | 1,500 |
Here again Hindi comes first. Most of the other Indic languages are increasing its mobile reader base. For some lamguages the growth is more than 100% than the previous month. I assume this is going to increase in future and it is our duty to welcome all these new readers and convert some of them into editors.
Conclusion
editIn short, in terms of readers most of the Indian languages are doing good. But when it comes to editors that is not the case. One hypothesis could be that we are more knowledge consumers than knowledge creators. Another – and a probably more valid one – is that there are large gaps in basic awareness of the existence of Indic language wikipedias, relatively lower use of Indic language on the Internet, (expected) lack of familiarity with wiki editing, technical issues with regards to Indic languages, tiny community sizes, and many other things.
As wikipedians can we change this scenario?
For wikimedia India we have lot of things to do. In terms of building the community, overcoming the technical challeges, creating awareness about wikipedia (more inportant is creating awareness about Indic language wikipedia), and so on. The challenges are many – but the opportunities are massive. I look forward to working closely with the various language communities on realizing the enormous potential of their respective languages.
Once again, I welcome your views and comments and opinions on the above. Please express your views as comment here. You can also reach me at shiju@wikimedia.org in case you want to send a personal mail. Thanks for reading.