Wikisource reader app

Wikisource is an open and free digital library with an extensive online repository of copyright-free pre-published texts and media resources in various languages of the world. Functioning similarly to Wikipedia, Wikimedia Commons, and other open access knowledge Wikimedia projects, Wikisource strives to make a vast collection of public domain content accessible to the global community. Its diverse content encompasses books, scholarly articles, speeches, newspapers, periodicals, manuscripts etc. along with huge collections of audio recordings, and videos that are not subject to copyright restrictions and serves as a valuable resource for students, educators, researchers, and anyone seeking access to public domain materials.

The collaborative nature of Wikisource encourages active participation from the global community. Anyone with an interest in preserving and sharing public domain content can contribute to the site. Contributors can add new texts, meticulously proofread existing content, or engage in translating texts into various languages. This collective effort fosters a vibrant community of volunteers dedicated to enriching and maintaining the quality of Wikisource.

Requirement of mobile reading app

edit

English Wikisource was visited more than 138 million times in the year 2024 out of which around 40% visited from mobile web platform. The percentage is around 50% for major Indian language Wikisource projects like Tamil, Bangla etc. So, even without the existence of any reading app, the mobile users are visiting the Wikisource projects on a significant proportion and that itself can provide a good opportunity to develop a reading app.

Selection of books for the app

edit

Before selecting the books, we need to understand the workflow of Wikisource. That itself is a major problem to digest as different language versions follow different workflows for their work, where some steps overlap, while others do not.

So, to avoid this difficulty, a workflow for printed materials like books, periodicals, newspapers, etc is outlined in as comprehensive manner as possible and can be adopted by an ideal Wikisource language community, if they would like to. For convenience, workflow for audio and video contents, which are still not very well developed for Wikisource, are avoided.

The common workflow which almost all Wikisource language communities adopt for transcription of contents are as follows -

  1. Identification - The first step is to identify the printed materials which are within the scope of Wikisource considering the copyright status, publication status etc.
  2. Digitization - Scanning of these materials is the next step, which can be done fresh or can be collected from different digital and physical resources, where they are already digitised.
  3. Upload - Once the digitised materials are available, the next step is to upload them on Wikimedia Commons.
  4. Indexing - The uploaded materials are then transferred to Wikisource in Index namespaces and checked for missing or duplicate or bad scans etc to create indexed pagelists and table of contents.
  5. Proofreading - Wikisource community volunteers then OCR and proofread each and every pages of these material.
  6. Validation - A second group of volunteers check the proofread pages again and validate them.
  7. Transclusion - After the proofreading and/or validation is done, the contents are then transcluded into main namespaces after properly dividing them into chapters etc. according to the table of contents. This step makes it ready for readers to read.

Note: This is the basic workflow of Wikisource which is expected to be followed by all communities. Unfortunately, some communities miss some of the critical steps like creation of tables of contents or the entire step of transclusion etc. due to different reasons.

Now, apart from the above-mentioned basic workflow to transcribe and create e-book, communities differ while adding with the metadata of the materials like, name of authors, publishers, publication years etc. Now there are three kinds of practical scenarios adopted by communities and combinations within these three.

  1. No metadata - Volunteers sometimes do not add any kind of metadata anywhere or partially on Wikisource index pages or transcluded pages. That is the worst kind of scenario and needs to be avoided.
  2. Metadata stored locally - Majority of language communities store metadata locally on Wikimedia Commons at the file description and/or Wikisource at the index namespaces in designated fields and/or in header of transcluded pages. This can lead to duplication of efforts, increased chance of error, data redundancy etc.
  3. Metadata stored on Wikidata - A very few Wikisource language communities store metadata centrally as Wikidata items and roundtrip them back on Wikimedia Commons at the file description and/or Wikisource at the index namespaces in designated fields and/or in header of transcluded pages. This is an ideal scenario, which provides opportunities to fully leverage the power of Wikidata.

Now, for a Wikisource mobile reading app, both actual content and metadata are equally important, so that not only readers can read the content, but also can easily navigate and search them. Storing content at a central database like Wikidata is thus preferable to easily query and make use of the metadata.

Keeping the 1 to 7 steps of content transcription and Step 3 of metadata in mind, the suitable criteria to select a Wikisource content to be available to readers can be drafted.

 

The material needs to -

  1. be digitised and uploaded on Wikimedia sites
  2. have an index page
  3. completely proofread (at least, if not validated)
  4. completely transcluded with divisions of chapters, if any.
  5. have metadata stored centrally and accurately following Wikidata Books data model with the following linkages on respective Wikidata items.
    1. label in native language (mandatory)
    2. title in native language (mandatory)
    3. language of work (mandatory)
    4. author(s), editor(s) (if any), translator(s) (if any)
    5. date of publication, publisher, place of publication
    6. Wikisource index page url (mandatory)
    7. Wikisource sitelink of transcluded page with proofread and validated badges. (mandatory)

Let’s get such a list for Bangla Wikisource with this SPARQL query - https://w.wiki/BN3z

SELECT DISTINCT ?sitelink ?itemLabel WHERE {
  ?sitelink schema:isPartOf <https://bn.wikisource.org/>; schema:about ?item; schema:name ?ws.
  { ?sitelink wikibase:badge wd:Q20748092. }
  UNION
  { ?sitelink wikibase:badge wd:Q20748093. }
  ?item wdt:P1957 [].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "bn". }
}
ORDER BY (?itemLabel)

The data will be fetched through Wikidata API.

Development

edit

The planned components are

  1. Books meta-data API
  2. EPUB generator
  3. The client mobile app

The API

edit

An API was developed which serves a catalogue or index of books which follow the above described books data model. It currently contains works from English, French, and Bangla languages since they are already following the data model. Support for other languages can be added.

The API was built using Django and deployed on Toolforge. It periodically runs a set of SPARQL queries to retrieve data, process that data and update the DB.

Link: wsindex.toolforge.org
Repo: codeberg.org/ph4ni/wsindex

Data that can be fetched from the metadata API
Key Description Sourced from
wikidata_qid QID of the book Wikidata
title Title in English Wikidata label
title_native_language Title in the native language Wikidata label
languages List of languages the book is in Wikidata
date_of_publication Date published Wikidata
authors List with Author label and Wikidata QID Wikidata
editors List with Editor label and Wikidata QID Wikidata
translators List with Translator label and Wikidata QID Wikidata
genre List of genres of the book Wikidata
type_of_work Form of creative work Wikidata
ws_url Link to the Wikisource page Site link
thumbnail_url Link to the thumbnail version of cover page Commons
epub_url Link to the epub file ws-export
wikisource_index_url Link to index page Wikidata
view_count Number of views of the ws_url page in last one year Page views API
subjects Subjects of the work Wikidata
Querying the API
URL description
https://wsindex.toolforge.org/books/ Base URL which returns 32 results with pagination
http://wsindex.toolforge.org/books/?page=2 Example of page
https://wsindex.toolforge.org/books/Q51543972/ Get book by the QID. Also works without the prefix 'Q'
https://wsindex.toolforge.org/books/?languages=fr Get books by language code
https://wsindex.toolforge.org/books/?search=India Search books by title and author names.

EPUB generator

edit

EPUB files for the app are sourced using the ws-export tool.

Link: ws-export.wmcloud.org/
Repo: github.com/wikimedia/ws-export

Client app

edit

Existing free/open-source book-reading apps were examined which could be used to build the reader app and the Myne Android app was forked to build the reader app for Wikisource.

Screen recording of the app under development

Features currently functional:

  1. Viewing list of all books
  2. Exploring books by language
  3. Exploring books by genre
  4. Search function to search by title or author
  5. Downloading books to local device for offline access
  6. Share links to the Wikisource page of each book
  7. Manage local library - read, delete, completion %
  8. Change font size, look up words, dictionary
  9. Dark and Light mode
  10. Localisation support


The app is built with Jetpack Compose and was modified to work with the books metadata API and to support content from Wikimedia projects.