Integration of BookManagerv2 and ProofreadPage

edit

Hi Aarti and Thomas,

I'm working on the "Improve support for book structures" GSoC project, which is one of the several Wikisource-related GSoC projects. Part of what I'm doing involves requiring the user to fill out a form with metadata and structural information pertaining to the book. As I'm sure you know, this is very similar to the Index: pages that the ProofreadPage extension creates. I spoke with my mentors (Matthew and Raylton, who I've CCed) about this today, and we agreed that it would be best if, on wikis like Wikisource that have ProofreadPage installed, we could just use one single form.

This would probably be best accomplished by BookManager injecting additional fields into the ProofreadPage's index namespace, then saving the BookManager-specific data elsewhere. Matthew says this would be made relatively simple if you are planning to use HTMLForm, because the fields could be added to the existing form array. Is this something you have been considering?

Thanks,

Molly White

P.S. If you could forward this to Zaran, that would be great. I couldn't find an email address anywhere...


Hi Molly,

We hadn't considered the integration of BookManagerV2 and ProofreadPage. It looks like a good idea. Moreover, if something is already being done in ProofreadPage it can be avoided in other Wikisource related projects.

I have forwaded this to Zaran as well.

Thanks, Aarti Dwivedi


Hi Molly!

I think it isn't a good idea to merge main page of books and Index: pages because the index pages are about the can of a physical book and its proofreading and main pages of books about a book as presented in Wikisource. There isn't a 1 to 1 relationship between these two entities: a work printed in more volumes will have an Index: page per volume and only one "book" as presented in Wikisource. In the other hand a "complete work of XXX" physical volume represented by an Index: page contains more than one works that would have each a book main page.

But, as the two kind of pages (Index: pages and "book main pages") shares most of the metadata, I think that they should share the same data model (ie typing system...) and there should be a feature to pull the data of the index page in the main page of the book. It can maybe be done by a shared library extension.

An other possible way is, instead of creating a new datamodel, to reuse the Wikidata statement system, that is very powerful, for these two kind of pages. The Wikibase code is very nice so I think it's maybe possible without a lot of pain. It'll allow to easily share data with Wikidata and with Commons files if the proposal of the Wikidata team is implemented ( https://commons.wikimedia.org/wiki/Commons:Wikidata_for_media_info ). But, if we adopt this way, the migration from bad structured data of Index: pages to well structured data in Wikibase statements will be very difficult.

What do you think about it?

Tpt

PS: I have forwarded David Cuenda and Andrea Zanni of the "Elaborate Wikisource strategic vision" group.


Hi all!

Here my comments: 1) Nazmul is working in the automatic generation of metadata forms from templates. If possible, his work should be reused both by Proofread page and Book Manager extensions to generate the forms. Aarti and Molly, please get in touch with him (cc'ed) 2) Eventually the data will be shared using Wikidata, but we are not there yet. Until then (probably next year or next GsoC) we'll have to live with "sad unlinked data". The migration to linked data using Wikidata is going to be a challenge, specially for large Wikisources. 3) As Thomas says, it is not that easy to integrate the "Scan Index" and the "Book structure" because sometimes they are not the same, however there are three specific cases for a scan: "depicts one work", "depicts several works", "is part of a work". This could be chosen from the Index page (Proofread extension) and depending on the choice launch the Book Manager with different parameters:

- "depicts one work" is a clear 1-to-1 relationship. There can be a button next to the Title or Alternative title to "start book structure" that will launch the Book Manager with linked data fields.
- "depicts several works" is trickier. Besides of the title the user should have a "contains work" title field and be able to remove/add more. Each "contains work" should have its "start book structure" and the Book Manager would know that each one is part of a common volume. Data would be copied but NOT linked.
- "is part of a work" would give the chance to "link to book structure" from the Index page. So the Book Manager knows that it is not using only one scan, but several for a given book structure.

4) @Molly: how do you plan to store the book structure? Some machine readable format in the main page of the book structure?

Thanks,

David


Hi! Answer to David comments:

1) For Index: pages I think it won't be possible as index pages have parameters that aren't used by the Proofread_index_template as the default values for the header and footer of book pages. 2) My proposal is to reuse the Wikibase statement system (that is currently used by Wikidata but can be used outside) into Index: pages and "main pages of books", not to use Wikidata for storing these data. Data would be kept into Wikisource (as it's proposed for Commons files metadata https://commons.wikimedia.org/wiki/Commons:Wikidata_for_media_info). For "main pages of books", we start from nothing, so I think it would be nice to use directly a good data model in order to avoid migration cost in a few years. 3) As the "main pages of books" reuse Index: pages content, I think it would make more sense that the "main page" has a field to set the main related Index: page (from which metadata will be imported, of course these metadata can be overrided), and for each book page, set if an other index should be used for this specific page. With this system, we avoid metadata duplication in simple use cases (the book is related to only one Index:) and, for specific cases, we have something as powerful as needed. With this system, we can, of course, have a button in index pages to start a new "book main page".

Thanks for this very interesting discussion,

Thomas


Hi all, maybe you want to draw something for explaining better your proposals? I started a very simple draft for Index <> main page relationships http://www.lucidchart.com/invitations/accept/51cf503f-1cb4-46f2-9327-34770a008185 Feel free to improve it.

Aubrey


I just want to clarify to make sure we're on the same page. When I refer to the "main book page", I realize it's a bit ambiguous. I'm referring to the sort of "landing page": the one that holds the table of contents and maybe some of the actual book content. (Wikisource example, Wikibooks example). I don't really forsee that changing with the changes to ProofreadPage or the addition of the BookManagerv2 extension. I also forsee the view version of the Index page (as opposed to the editable form) staying the same, save for any changes that the ProofreadPage GSoC project may have in mind. However, the form would change a bit to include some additional metadata fields (for example, the "section" objects). Additionally, a [book name].book page would be added to store the JSON block for my project, and possibly display some of the metadata on view.

[Molly]


Hi everybody,

I think this discussion is paramount and spending some time finding the right solution is definitely worthwhile.

As already mentionned,Wikisource is specific in that there are two kind of entities, namely works (a novel, a dictionary, an encyclopedia, etc.) and scans, with different possible mappings (one-to-one, many-to-one, one-to-many) between them. These two kind of entities are represented by two kind of pages on Wikisource: the main book page (also referred to as "ns0 page", or "landing page" in previous discussions) for works, and index pages in the Index: namespace for scans.

In addition, BookManager will create a new kind of page: [book name].book which will contain the book structure as a JSON block of data. From now on, I will refer to this new kind of page as BM pages (by the way, will they be in a separate namespace?). To understand the interaction between BookManager and ProofreadPage, I think the main question to answer is the following: what will be the relationship between BM pages and main pages or index pages? I am sorry if the answer to this question is already obvious for everybody, but I cannot find a clear answer in previous emails.

From what I understood, the currently envisioned solution implies having a one-to-one mapping between main book pages and BM pages. I think that is the most reasonable option, each section listed in the BM page specifying (among other things):

  • the page in the main namespace containing this section's text (e.g. a book chapter, a poem, etc.). This is very often a subpage of the book's landing page
  • the index page associated with this section, and page range to transclude to obtain the text of the section

It would then be possible to use this information to automatically generate the main book page with a table of contents and all the subpages with <pages index="..." from=".." to=".." /> tags and navigation headers.

This solution is also what will make BookManager as independent as possible from ProofreadPage: in Wikibooks for example, the section list will simply not contain index pages and page ranks attributes.

The next question, which I think was the original purpose of this discussion (sorry for the long introduction, I just wanted to make sure that we are all on the same page), deals with metadata, where each specific piece of data should be stored and how they should be linked and imported from one page to the other.

Ideally, the BM page would only contain metadata related to the work (author(s), title, subtitle, translator(s), etc.), while index pages would contain edition-related metadata (publisher, printer, date of printing, volume number, etc.). The main page would them import metadata from both the BM page and Index pages to build a title box and table of contents.

With this in mind, I think that David's suggestion is the most satisfying from the conceptual perspective. In particular, I like the solution for the "index depicts several works" case, where we would have a possibility of listing the works contained in a scan (and their associated BM pages). However, it seems to me that this will imply big changes to Index pages, with eventually the terrifying necessity of converting all the already existing index pages.

In contrast, Thomas' suggestion is more conservative: keeping as much metadata as possible in index pages and linking a BM page to a reference index page containing most of the metadata (with the possibility of overriding it), thus reducing the BM page to only store the section list. This would indeed allow us to keep index pages as they are now.

There is also the question of where the section list will be edited. Molly, it seems to me that you are suggesting that the section list could be edited directly from the Index page. While these could indeed be done, I guess that you also plan to have an independent editing system for projects like Wikibooks?

From now on, it would be good to try to describe as precisely as possible where the metadata will be stored and how it will be linked and imported across pages in all possible scenarios (one-to-one, many-to-one or one-to-many mappings). I agree with Andrea that drawings could help. I will try to find some time to make drawings summarizing the two solutions suggested by David and Tpt (or at least, what I understand of those).

Sorry for this rather long email, and thanks to all of you for your interesting insights,

Thibaut (Zaran)


Thank you for that email, Thibaut. I feel like you have managed to clarify quite a lot.

You are correct in that I am planning to have one "BM page" per main book page—meaning one BM page per work, including each version/edition of a work. This should allow for one work or multiple works using the same source.

Regarding listing the pages in the BM page, then using that to generate the <pages index> tags... I'm still a little unsure about how I want to approach this. It might make sense to auto-generate these pages once, then allow the user to edit them as he or she sees fit, as there are some pages where editors want to add their own formatting surrounding the pages tags (example).

I'm a little unclear on your distinction between the metadata that should go on index pages and on BM pages. If there is one BM page per work, edition information would also be relevant for the BM page. I agree that we should reduce duplication of metadata, but I'm not quite sure how we should determine what metadata goes where. For the record, I'm not opposed to the additional metadata being included through ProofreadPage for wikis that have it installed; we can always pull the metadata from there for our purposes.

I didn't have any plans of allowing the section list to be edited from the index page; I was envisioning that happening through the form that produces the BM page block.

- Molly White


Thanks for your answer.

> Regarding listing the pages in the BM page, then using that to generate the <pages index> tags... I'm still a little unsure about how I want to approach this. It might make sense to auto-generate these pages once, then allow the user to edit them as he or she sees fit, as there are some pages where editors want to add their own formatting surrounding the pages tags (example).

I completely agree: users need to have a way to edit the auto-generated pages. It does not seem clear to me whether or not the auto-generation should be provided by BookManager, maybe a service provided by a bot running on the toolserver? Is there a use case for this service beyond Wikisource?

> I didn't have any plans of allowing the section list to be edited from the index page; I was envisioning that happening through the form that produces the BM page block.

Thanks for clarifying this. We will need to find how to add links from index pages to BM pages in a user-friendly fashion, but that's ProofreadPage's business.

> I'm a little unclear on your distinction between the metadata that should go on index pages and on BM pages. If there is one BM page per work, edition information would also be relevant for the BM page. I agree that we should reduce duplication of metadata, but I'm not quite sure how we should determine what metadata goes where. For the record, I'm not opposed to the additional metadata being included through ProofreadPage for wikis that have it installed; we can always pull the metadata from there for our purposes.

That's a good point, Wikisource has one main page per edition of a work (and if there are different editions of the same work available on Wikisource, we have on additional page listing all the editions). In this regard, BM pages could indeed contain most of the edition information. The only thing I can think of which doesn't fit in BM pages is when a work's source is split in several scans (e.g. a novel in several volumes): the volume number only makes sense on index pages and shouldn't appear on the BM page. To make things even more confusing, sometimes two volumes of the same edition do not have the same printing date (for example we can have volume 1 printed in 1886, and volume 2 printed in 1887). Dealing with all the corner cases is probably going to give us a lot of headaches...

Thibaut


Hi all, a quick reply. i don't know how many of you do know LucidChart, but I find it very clear and easy to use. Could you please "draw" your personal proposal of the Book metadata flow? I drafted the single elements here: http://www.lucidchart.com/invitations/accept/51b5e553-5c84-4d76-9498-7b700a001506

You can copy this and make your own document, and then link it to us (we can edit yours, if you want it). This free version allows just 60 elements, so make sure to submit a proposal per document.

An old proposal (probably outdated) is here: http://www.lucidchart.com/invitations/accept/51d2d554-6a14-4302-83f2-36360a008c55

Few notes:

  • I added Wikidata and Wikibase extension for WS and Commons because I think book metadata will be stored there.

For example, we are now talking about bookmetadata without mentioning that the scan file will be on Commons, and its metadata as well. These files will have also other metadata (eg EXIF), but we don't care about them in Wikisource.

  • Alex Brollo (cc'ed) made some experimental tests with a Lua Module for data. It contained a lot of structural metadata (ie relation of a nsPage scan with the ns0 page)

http://it.wikisource.org/wiki/Modulo:Dati ex: http://it.wikisource.org/wiki/Modulo:Dati/Opere_di_Niccol%C3%B2_Machiavelli_VI.djvu

Aubrey


Thibaut, you raise a good point about the autogeneration of <page index> tags. It is Wikisource-specific, so perhaps a bot would be better-suited to the task.

Aubrey, the flow that I'm currently envisioning for the BookManagerv2 project is quite simple: https://www.lucidchart.com/documents/edit/477f-dbac-51d2eeb0-93ff-45600a009535

Though I agree that Wikidata should be involved in this process, in terms of my actual project, I think integration with it may be a bit out of scope. I'd love to see some down-the-road plans for Wikidata integration, though. Keeping it in mind during the development of this extension may ease later changes.

- Molly


>Aubrey, the flow that I'm currently envisioning for the BookManagerv2 project is quite simple: >https://www.lucidchart.com/documents/edit/477f-dbac-51d2eeb0-93ff-45600a009535

Hi Molly, I don't see your document :-) Did you share that with us? (Share button on the top right > generate link OR add emails)

Aubrey


Oops! http://www.lucidchart.com/invitations/accept/51d2f91d-1abc-485d-b92c-69b90a0092a0

[Molly]


Hi,

A few people have expressed interest in seeing this email discussion on-wiki, so with your permission I'd love to post it. I'll remove email addresses. Is everyone okay with this going on-wiki?

- Molly


Hi Molly,

It would be better to have this discussion on-wiki. Just give us the link of that discussion. That should be good. :)

Aarti K. Dwivedi


Of course, please do :-)

Aubrey


Sure, which neutral wiki were you thinking about? Meta or the central wikisource?

Micru


I was going to put it on Meta.

[Molly]


I'm of course ok with moving the email thread to a neutral wiki (I think Meta is more adapted).

I'll put the question I have about BookManager (the one I mentioned yesterday on IRC) here so that I don't forget it. I will copy it on the wiki if needed, once the thread has been moved.

It seems to me that on Wikisource, many books have a structure that cannot be properly represented by a tree. Here is an example on the French Wikisource: https://fr.wikisource.org/wiki/Les_Fleurs_du_mal/1857. As you can see, there are two "chapters" (these are actually a dedication and a foreword) at the top of the table of contents. Subsequent "chapters" (poems in fact) are then organized into sections: "Spleen et idéal", "Fleurs du mal", "Révolte", etc. It is not clear to me how this organization should be represented by a tree: the first two chapters should somehow be at the same depth as other chapters like "Bénédiction", "Le Soleil", etc., but they are not contained in any section.

Should there be a way in BookManager to indicate the "depth", or "level" of a subtext (independently of its depth in the tree) to properly account for this kind of organization?

Another side remark: I think there are two ways to think about a book section. One is seeing it as a "container", containing subtexts, the other is seeing it as a separation between subtexts. Coming back to the example given above, you can see it as a big list of chapters, separate by titles, rather than a list of sections containing chapters. Besides, you will notice that there are no subpages for the sections, only for the chapters.

Any thoughts on that?

Thibaut