CIS-A2K/Indic Languages/Statistics/2011 September

After a long gap of almost 7 months, I have compiled the statistical report for the Indian Language wikipedias.

As all of you know recently I joined the India Programs of WMF to support the Indic language wiki projects. In the past I was interacting with various Indic language wikipedians for various community related and technical things (for example, the FAQ booklet translated to various Indic languages, the typing tool integrated to several Indic wikipedias (now rechristened as Narayam extension and now has the official backing of WMF), Wiki India Newsletter, and so on). Now onwards I will be able to spend more time on the Indic language wiki projects.

The data for this report is taken from http://stats.wikimedia.org/. I thank Erik Zachte for providing me the support for the same.

Due to the long processing time, the report at stats.wikimedia.org is getting generated after one month. Hence in this report I have captured the data for 2011 September (also data for 2011 August is given for comparison).

Unlike the previous reports, now onwards I will be analyzing data only for the languages spoken in India. So from this report I have excluded languages of neighboring countries of India like Sinhala, Burmese, and so on (even though personally I am interested to watch the growth of these language wikipedias also since those languages are closely related to one or the other language spoken in India).

Following are the languages of India I have selected for preparing this report. The number of speakers for each language is given against each language.

NOTE: I have used the Indian way way of denoting large numbers, since that make more sense in India. Others please note, Crore is equal to 10 million, and Lakh is 100,000.

Indian languages having wikipedia

edit

In India, Hindi is the language with most number of native language speakers. There are few more languages with huge speaker base. But when it comes to Wikimedia movement of India, speaker base is not making much impact.





Language Number of Speakers (in Crores)
Hindi 25.8 (native language speakers),

42.2 (if Bhojpuri, Awadhi, Chhattisgarhi, Rajasthani and other languages (or dialects) close to Hindi are included)

Bengali 23
Marathi 9
Telugu 8
Tamil 6.6
Urdu 6
Kannada 4.7
Gujarati 4.6
Sindhi 4.1
Bhojpuri 3.85
Malayalam 3.7
Odia (Oriya) 3.1
Nepali 3
Punjabi 2.9
Assamese 1.3
Kashmiri 50 lakh
Nepal Bhasha/Newari 8 lakh
Bishnupriya Manipuri 4.5 lakh
Sanskrit 50,000
Pali No native language speakers

The number of speakers for Punjabi (the language spoken in the Punjab state of India) is often misquoted at many places including the WMF stats. Punjabi language has two variants (Eastern Punjabi and Western Punjabi). According to various Indian statistics reports, the Punjbai language (also called Eastern Punjabi according to en Wikipedia) that uses Gurumukhi Script, is spoken by almost 2.9 crore people. The Western Punjabi (a language spoken by close to 6 crore people in Pakistan) has its own wikipedia (http://pnb.wikipedia.org). I assume the issue is similar to that of Hindi and Urdu where languages are closely related but use different scripts due to various reasons. Since my interest is in the Punjabi wikipedia (http://pa.wikipedia.org) that uses Gurumukhi script, I considered the number of speakers for that language.

Also I found that the Bhojpuri wikipedia (http://bh.wikipedia.org/) still uses the wrong language code (bh), the code that represents the Bihari language family. Bihari (ISO-639-1 bh) is a language family and the Bhojpuri language is just one of the languages in this family (Angika, Fiji Hindi and Maithili are few others). Bhojpuri has the language code bho. So we need to do two things in the case of Bhojpuri wikipedia. 1. update the language code to bho, 2. change the language name to Bhojpuri (instead of Bihari) in wikimedia records.

I have included almost all the important parameters (for which there is required and updated data) in this report. From the next report onwards I will be adding few more relevant parameters. The placement of languages in the all the tables of this report is based on the number of speakers.



Article statistics

edit

Number of Articles

edit

Hindi with more than 1 lakh (100,000) articles is on the top. Newari wikipedia with 69,826 articles comes second, and Telugu comes third with 48,803.

Wikipedia Language 2011 August 31 2011 September 30
Hindi 1,00,026 1,00,353
Bengali 22,338 22,400
Marathi 34,431 34,675
Telugu 48,533 48,803
Tamil 36,272 38,026
Urdu 17,177 17,290
Kannada 10,992 10,971
Gujarati 20,984 21,302
Sindhi 385 393
Bhojpuri 2,692 2,694
Malayalam 19,807 20,318
Odia (Oriya) 1,523 1,626
Nepali 15,569 16,004
Punjabi 2,015 3,316
Assamese 547 672
Kashmiri 346 346
Nepal Bhasha/Newari 69,826 69,826
Bishnupriya Manipuri 24,754 24,763
Sanskrit 5,199 6,062
Pali 2,790 2,791

In the span of 9 months (from 2011 January) Hindi wikipedia has added more than 40,000 articles. But did the community size increased? See the next few parameters for more information.

Odia and Assamese wikipedias made much progress since my last report. Both the article number and community strength are increased for both. The article number in Punjabi wikipedia is going high.


Edits per article

edit

Edits per article shows the number of times a wikipedia article is edited. More edits for an articles means more people contributed to it and neutrality of the article is also high. For active wikipedias it is a rough indicator of quality. Wiki article will have more encyclopedic value when more people see and edit it.


Wikipedia Language 2011 August 2011 September
Hindi 10 11.2
Bengali 30.6 31
Marathi 17.2 17.5
Telugu 9.7 9.8
Tamil 17.6 17.4
Urdu 18.7 19
Kannada 15.6 15.8
Gujarati 6.4 6.5
Sindhi 21.6 21.5
Bhojpuri 27.8 28.5
Malayalam 30.3 30.3
Odia (Oriya) 18.2 18.1
Nepali 6.8 7.4
Punjabi 18.1 12.3
Assamese 19.8 18.9
Kashmiri 40.5 41.5
Nepal Bhasha/Newari 4.2 4.2
Bishnupriya Manipuri 13.3 13.5
Sanskrit 18.8 17.8
Pali 22.4 23

Among active wikipedias Bengali and Malayalam got maximum edits per article. It is expected, since it has a very active community. (To see the community strength refer the next few tables).

For languages like Kashmiri, Pali the edit per article is high because it has very less number of articles and same articles are getting edited (mostly by bots) every time.

Editor and Reader Statistics

edit

Number of active wikipedians (at least 5 edits per month)

edit
Wikipedia Language 2011 August 2011 September
Hindi 54 56
Bengali 45 42
Marathi 32 31
Telugu 32 31
Tamil 80 83
Urdu 16 17
Kannada 16 12
Gujarati 11 9
Sindhi 1 1
Bhojpuri 0 2
Malayalam 99 85
Odia (Oriya) 6 9
Nepali 19 16
Punjabi 4 3
Assamese 14 17
Kashmiri 1 0
Nepal Bhasha/Newari 1 1
Bishnupriya Manipuri 0 2
Sanskrit 14 14
Pali 1 1

Malayalam and Tamil tops the list with almost 85 active editors. But in Malayalam there is a reduction of 14 active editors from the previous month.

As said before, when it comes to Wikimedia movement, the speaker base is not making much impact.

For example, Sanskrit language with just 50,000 speakers is making huge impact in the Wikimedia world. It has 14 active users, a bigger community than many other big Indic languages. I am impressed by the efforts of Sansskrit wiki projects especially with the sister wiki projects (for example wikisource), their way of interacting and implementing the best practices from other Indic language wikipedias, and so on.

In the past few months I have conducted 3 wiki workshops for Sanskrit. I found each time they are maturing with the vision about the future of Sanskrit wiki projects. In the last Sanskrit wiki workshop the main focus was on defining the category tree for Sanskrit wikipedia.

When I published the report last time, Odia and Assamese wikipedias were inactive. Now the situation is changed. Now we have a community to work on it.

The progress made by Odia wiki project (http://or.wikipedia.org/) is note worthy. Kudos to Odia wiki community for all the online and offline initiatives for building the community and to increase the article count. I was actively involved in the community building for Odia wikipedia. I still remember the day (2011 January 15) when I introducted Odia wikipedia and Odia tying tool (developed by Junaid) to Odia speaker to Ashuthosh Kar during Wiki X celebration at Bangalore. Through him very soon we got a wonderful wikipedian Subhashish who is leading the efforts for Odia now. Initially Subhasish and I used to meet at my home and work on the basic things for Odia wiki. I remember us working on Odia wikipedia logo, FAQ booklet, Translate wiki, and so on. Soon we got more members to the team through the few Odia wiki workshops happened at Bangalore. Along with workshops Odia wikipedians translated the FAQ booklet to Odia and took efforts to integrate the Odia tyoing solution developed by Junaid to Odia wikipedia. Later with the support of Dhanada Mishra (the chairman of Human Development Foundation (http://www.hdf.org.in/)) and a young student Odia wikipedian Srikanth Kedia we had conducted a wiki workshop at Bhubaneshwar. Odia wikipedians from Bangalore are doing an excellent job and now many of them are participating in Wikimedia India chapter activities also.

Odia wiki project picked up not because Odia has got huge speaker base, high literacy, access to computers or any thing else; it become active only because it has receieved the volunteers who has passion and vision of developing a wikipedia in their mother language. We need similar volunteers for each Indic langauge.

The case is similar for Assamese also. I was trying to get a good volunteer for Assamese wikipedia for the past 3 years. Initially I tried to get the volunteers from Bangalore since Bangalore has good representation of Assamese community and it is easy for me to reach people. But that didn’t worked out. Then I tried for online outreach, initially through emails. Finally I got connected with Parabhakar who is a professor at NIT in Silchar, Assam. Together we try to do online outreach first through a google group (it didn’t worked out), then through facebook. A facebook page is created for Assamese wikipedia projects aimed at bringing together all Assamese people who are in Facebook and who are interested in Assamese wiki projects. It has more than 460 members now. Prabhakar used that group and his personal contacts effectively to build a community for Assamese wikipedia. That worked. Assamese wikipedia started becoming slowly active. Later Prabhakar started another Facebook group dedicated to NIT Silchar for promoting Indic language wikipedias among students (and aluminis) of NIT Silchar. Due to all these in the next few months we are going to see more wiki activity from the Assam state of India and in the Assamese wikipedia. Recently Narayam (the typing solution extension) is integrated to Assamese wikipedia. Thanks to the Assamese wikipedians Chaipu, Prabahakar and other volunteers who actively worked to make it a reality. A major roadblock for bringing Assamese people to Assamese wikipedia is removed now.

Assamese wiki community is currrently concentrated on online outreach, but soon they are planning to start offline outreach activties also.

It is intersting to note that the community size of smaller languages is either equal to or even larger than that of much bigger languages. I don’t fully understand this and like to hear your opinion on this. One hypothesis could be that, larger ratio of people in smaller languages are more passionate about their language and there fore are willing to put additional effort to showcase their mother language. Each Indic language wiki project is waiting for the few users who has vision and passion about the future of the respective wiki project.

The technical issues and the other road blocks for smaller languages are more.

Number of highly active wikipedians (more than 100 edits per month)

edit

Highly active wikipedians are the editors who do at least 100 edits per month. In fact we must say that they are people who are running the respective language wiki.


Wikipedia Language 2011 August 2011 September
Hindi 17 14
Bengali 4 5
Marathi 7 10
Telugu 6 9
Tamil 23 22
Urdu 2 2
Kannada 2 2
Gujarati 2 3
Sindhi 0 0
Bhojpuri 0 0
Malayalam 24 24
Odia (Oriya) 2 2
Nepali 4 4
Punjabi 2 2
Assamese 3 4
Kashmiri 0 0
Nepal Bhasha/Newari 1 1
Bishnupriya Manipuri 0 0
Sanskrit 2 7
Pali 0 0

Here also Malayalam and Tamil tops the list. In fact if we have more high active editors you will be able to see more activities (not just article creation) coming out of that wiki community. Due to this you can see that offline project, photo events, article writing contest, community quiz, collaborating with respective state government, photo contest, wiki workshops, and many other innovative wiki projects are coming out from these two wiki communities. So ideally we should be able to convert more active editors to highly-active editors to make the wiki activism in each language wikipedia more vibrant.

Registered users who edited atleast 10 times since they arrived

edit

This parameter shows how many of the registered users did actually turned into actual wiki editors and done at least few edits in wiki.


Newly registered users who edited atleast 10 times

edit

This parameter is a subset of the preceding table. It shows how many of the newly registered users turned into wiki editors. Tamil wiki community is leading here. Among smaller communities, Assamese is also doing well (due to the reasons I told else where).

Wikipedia Language 2011 August 2011 September
Hindi 675 679
Bengali 480 486
Marathi 313 321
Telugu 452 459
Tamil 628 645
Urdu 238 238
Kannada 262 263
Gujarati 127 127
Sindhi 20 21
Bhojpuri 19 19
Malayalam 695 709
Odia (Oriya) 26 27
Nepali 103 106
Punjabi 41 41
Assamese 37 43
Kashmiri 10 10
Nepal Bhasha/Newari 24 24
Bishnupriya Manipuri 27 29
Sanskrit 68 72
Pali 7 7

Even though many big languages has more number of registered users, still Malayalam continues to be on the top. Hindi and Tamil comes second and third. I wish all wiki communities be able to convert more registered users into active wiki editors.

Newly registered users who edited atleast 10 times

edit
Wikipedia Language 2011 August 2011 September
Hindi 17 6
Bengali 14 6
Marathi 5 7
Telugu 4 7
Tamil 9 17
Urdu 3 0
Kannada 0 1
Gujarati 2 0
Sindhi 0 1
Bhojpuri 0 0
Malayalam 14 14
Odia (Oriya) 0 1
Nepali 2 3
Punjabi 0 0
Assamese 4 6
Kashmiri 0 0
Nepal Bhasha/Newari 0 0
Bishnupriya Manipuri 0 2
Sanskrit 4 4
Pali 0 0

Page Views – Non Mobile (In Lakhs)

edit

This parameter shows how many readers are waiting for us. Are we caring for them? The following table give the data for Non-mobile (mainly PC).Even though Hindi lags behind in some of the wikipedia editing/editor matrices, the speakers of Hindi are not lagging behind in using Hindi wikipedia. Hindi with 77 lakh page views tops the list. No other Indic langage is near Hindi. And this is expected considering the huge speaker base of Hindi.




Wikipedia Language 2011 August 2011 September
Hindi 88 77
Bengali 24 29
Marathi 43 46
Telugu 24 23
Tamil 39 41
Urdu 15 16
Kannada 15 13
Gujarati 9.9 8.84
Sindhi 0.82 0.72
Bhojpuri 1.98 1.75
Malayalam 31 28
Odia (Oriya) 2.17 2.07
Nepali 8.17 6.94
Punjabi 2.5 2.6
Assamese 1.39 1.28
Kashmiri 0.87 0.72
Nepal Bhasha/Newari 10 13
Bishnupriya Manipuri 8.58 7.65
Sanskrit 3.67 3.43
Pali 1.25 1.39

Even for the inactive wikipedias like Sindhi, Kashmiri, Pali, and so on, we have thousands of people accessing it every month. But do we have enough content to offer for these readers? We need to build community for all these languages to serve our readers. In fact we should be converting some these readers into wikipedians of the respective language wikipedias.



Page Views – Mobile

edit

This parameter shows how many are accessing each language wikipedia using mobile. Unlike the non-mobile data, this data is showing an upward trend for all the Indic languages. Eventhough rendering of indic scripts is not good in most of the mobile devices, many users are accessing it. This also shows from where our future readers are going to come.

Wikipedia Language 2011 August 2011 September
Hindi 1,94,000 2,53,000
Bengali 28,000 45,000
Punjabi 980 836
Marathi 43,000 64,000
Telugu 9,600 31,000
Tamil 46,000 65,000
Urdu 35,000 40,000
Kannada 6,900 15,000
Gujarati 15,000 27,000
Sindhi 265 240
Bhojpuri 1,100 4,600
Malayalam 15,000 33,000
Odia (Oriya) 1,600 1,600
Nepali 8,800 22,000
Assamese 967 1500
Kashmiri 415 457
Nepal Bhasha/Newari 2,000 1,900
Bishnupriya Manipuri 1,800 1,900
Sanskrit 602 3,800
Pali 132 1,500


Here again Hindi comes first. Most of the other Indic languages are increasing its mobile reader base. For some lamguages the growth is more than 100% than the previous month. I assume this is going to increase in future and it is our duty to welcome all these new readers and convert some of them into editors.

Conclusion

edit

In short, in terms of readers most of the Indian languages are doing good. But when it comes to editors that is not the case. One hypothesis could be that we are more knowledge consumers than knowledge creators. Another – and a probably more valid one – is that there are large gaps in basic awareness of the existence of Indic language wikipedias, relatively lower use of Indic language on the Internet, (expected) lack of familiarity with wiki editing, technical issues with regards to Indic languages, tiny community sizes, and many other things.

As wikipedians can we change this scenario?

For wikimedia India we have lot of things to do. In terms of building the community, overcoming the technical challeges, creating awareness about wikipedia (more inportant is creating awareness about Indic language wikipedia), and so on. The challenges are many – but the opportunities are massive. I look forward to working closely with the various language communities on realizing the enormous potential of their respective languages.

Once again, I welcome your views and comments and opinions on the above. Please express your views as comment here. You can also reach me at shiju@wikimedia.org in case you want to send a personal mail. Thanks for reading.