Grants:IEG/A graphical and interactive etymology dictionary based on Wiktionary/Final
This project is funded by a Project Grant
proposal | final report |
This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.
Part 1: The Project
editSummary
editThe final result of this project is the creation of a tool in the Wikimedia Labs:
that visualizes etymological relationships between words as extracted from the English Wiktionary.
With etytree, users can search any word and visualize the network of words etymologically related with it. The screenshot below shows the visualization for English word "gorgeous".
Etytree is also interactive: besides the image captured above, when users click on words in the graph etytree visualizes words' definitions and part of speech (Adverb, Preposition, Verb, etc) as well as links to Wiktionary pages where the etymological relationship was extracted from. When users click on circles in the visualization (network's nodes) it shows the language of the word .
The data behind the visualization is stored in a database (triplestore) that can be queried at the WMF Labs Virtuoso SPARQL endpoint and that is updated at each new release of a Wiktionary dump. The database is created automatically with a code that takes as input the English Wiktionary dump and, for any entry in any language, parses three sections:
- Section Etymology containing factual information about the way a word has entered the language and usually some sense of its semantic development;[1]
- Section Descendants containing data about descendants;
- Section Derived terms containing derived words of a given entry.
It also parses the etymtree template.
More specifically, the code applies regular expressions and a context free grammar, and creates a network of words that are linked to each other if they are etymologically related, i.e., a triplestore.
Using etytree
editUsing the graphical etymology dictionary
editClick here for a video presentation of etytree on April 4, 2017 at www2017 in Perth, Australia.
To test the first release of etytree go to http://tools.wmflabs.org/etytree.
There is a button that you can click if you need help.
To test etytree enter word "pistachio" in the search bar, then press enter. A circle will appear containing the language code "eng" for "English" and next to it the searched word "pistachio". Click on the word or the circle to see lexical data associated to word "pistachio". To see the network of words etymologically related to word "pistachio", double click on the circle. You will visualize a directed graph connecting words. An arrow from word A to word B means that word A is an ancestor of word B. For example "pistacchio" (Italian) is an ancestor of "pistachio" (English): in the graph you can see that there is an arrow going from "pistacchio" to "pistachio". If you click on "pistachio" you will visualize lexical data attached to English word "pistachio":
- part of speech: "noun";
- gloss: "A deciduous tree, Pistacia vera, grown in parts of Asia for its drupaceous fruit" and "The nutlike fruit of this tree.")
- Wiktionary pages where etymological and lexical data were extracted from: English pistachio, English pistick, Middle Persian pstk', Korean 피스타치오.
If in the search bar you write a word with homographs, for example word "door", you will get multiple circles, 5 circles in the case of word "door" because word "door" is available in 5 languages: Somali, Old Portuguese, English, Dutch, and Lojban (see picture below). You can explore homographs by clicking on circles and words. You can choose the word you are interested in by double clicking on the circle of choice.
A list of interesting examples:
- English "celery" ("celery" and "parsley" are etymologically related through Greek "σέλινον"),
- Haitian "gate" is etymologically related to English "vast", "devastate","waste",
- English "certain",
- English "wiki"
In some cases etytree returns a message "Sorry, the server cannot extract etymological relationships correctly for this word. We are working to fix this!". This happens because etytree is searching for words etymologically related to the searched word in the database and finds too many links: the request takes too long and the server returns a time out error. Some words, for example affixes have many connections. The most connected affixes in English are “-ly” (7070 connections), “non-” (6900 connections), “un-” (6873), “-ness” (5312). The most connected French affix is “-ment”(2573). Hungarian“-ok-”(2054),“-ek-”(1809),“-k- ” (1821) and Italian “-mente” (2035), “-ita`” (1670) are the most connected affixes in their respective languages. The most connected entries that are not affixes are English lemmas “man” (353 connections), “back” (303), “head” (290), followed by “work”, “house”, “wood”, “land”, “line”. These highly connected nodes slow down queries launched by the visualization tool. We are currently working on the design of more efficient queries given the available data.
Querying the database
editTo explore the database go to http://etytree-virtuoso.wmflabs.org/.
Note that, due to a bug, in order to explore resources you have to replace string "http://kaiko.getalp.org/" with "http://etytree-virtuoso.wmflabs.org/". This bug is easy to fix but I have put it on hold as the proposal is being reviewed and updating the database would mean having it offline for some hours.
To find all lexemes that contain string "butter" try:
SELECT DISTINCT ?s { ?s rdfs:label ?label . ?label bif:contains "butter" . }
To get all ancestors of English word "butter" try:
define input:inference "etymology_ontology" PREFIX dbetym: <http://kaiko.getalp.org/dbnaryetymology#> PREFIX eng: <http://kaiko.getalp.org/dbnary/eng/> SELECT DISTINCT ?o { eng:__ee_1_butter dbetym:etymologicallyRelatedTo{1,} ?o . }
To get the first ancestor of English word "butter" try:
PREFIX dbetym: <http://kaiko.getalp.org/dbnaryetymology#> PREFIX eng: <http://kaiko.getalp.org/dbnary/eng/> SELECT DISTINCT ?o { eng:__ee_1_butter dbetym:etymologicallyDerivesFrom ?o . }
Methods and activities
edit- Data extraction: code available on bitbucket; in the Midpoint Report I explain the logic behind data extraction.
- Database management system setup in Wikimedia-labs: wmflabs virtuoso sparql endpoint
- The database can be queried by anyone. Creating the database involved getting an account in the Wikimedia Labs, installing all resources, installing the extraction code, linking to the Wiktionary dump, running the extraction code, compressing the resulting data, setting up the Virtuoso server, uploading data onto the server, launching the server.
- Data visualization in the Tool Labs - code available on github - website http://tools.wmflabs.org/etytree
- The d3 visualization I am currently using is much simpler and probably less effective than the initially proposed one (see demo). While the demo uses trees, this release uses graphs. This is because the extracted data contains loops (see for example the visualization for the English word "door": it has a loop between Old English "dora", English "dor", and Middle English "dorre", "dore", "dor" which cannot be represented by trees (as a branch in a tree doesn't merge back into a branch ). However, I am hopeful once the data improves with the help of editors there will be no loops, or we will have alternative etymologies. And we will be able to use the nice visualization of the demo.
- Dissemination/Interaction: I interacted with the Wiktionary, Wikidata, Wikimedia-labs community and with academic people working on linguistics and natural language processing both on IRC channels and in person.
- More precisely, I had monthly meetings with my project advisor in the Information Engineering Department at the University of Bari, and periodic interactions with the developer of the DBnary software (+ a two day visit to the Laboratoire d'Informatique de Grenoble, where I met him).
- I met with some people from the Wiktionary and Wikidata community at Wikimania, and met part of the Wiktionary community in Lyon. I regularly interact with some people from the community in private or through the IRC Wiktionary channel. I have done some dissemination on the Etymology scriptorium, the Grease pit, the Beer parlour, the Wikidata Project Chat, the Wikidata-Wiktionary project page, and the Bar of the Wikizionario.
- I have presented the first version of etytree at WWW2017 with a lightning talk and a poster. The work on etytree will soon be published in the Conference proceedings.
Although I have done some dissemination I would have liked to present my work at meetings and/or conferences. However, because I only had a limited amount of time, I gave priority to the development of a functional platform before doing intensive dissemination. For this reason I did not spend the funds that were allocated for dissemination ($1200), but I had to work two more months than forecasted to complete the software. I am asking to reallocate the dissemination funds to compensation for the development of the platform. This was necessary as installing the server at the Wikimedia Foundation Lab took longer than expected, and testing could only start after the installation of the server.
Since I believe dissemination is a fundamental step, I am applying for funds to Wikimania, as well as to give talks at Chapter meetings. I am also (very likely) going to give a talk at a meeting organized in Lyon by the French Wiktionary community and hopefully in Paris (I am contacting people there too). I believe it is fundamental to interact with Wiktionarians and I would like them to interact on the topic introduced in the talk page of the project (specifically in Section Suggestions on how to make Etymology Sections easy to parse). In particular the interaction on the Etymology scriptorium was very productive. Among others, two Wiktionary contributors explained how diacritics in Arabic and Old English are used: for example Old English "Ceap", "cēap", and "ċēap" are the same word as described at Wiktionary:About Old English and Arabic "قَهْوَة" with diacritics and "قهوة" without are the same word too, while etytree is using different nodes in the network to represent them, which is incorrect.
As this work is based on the English Wiktionary I am planning to spend more energy for dissemination with the English Chapter although contributors to the English Wiktionary belong to many Chapters.
Outcomes and impact
editOutcomes
editWhat are the results of your project?
Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.
The main outcomes are the interactive tool and the database of etymological relationships.
The impact is threefold:
- On users: They have a new multilingual interactive tool to visualize etymological relationships and discover new words that are etymologically related.
- On editors: They can spot wrong etymological connections between words and check if they are due to inconsistent etymologies across different Wiktionary pages. Eventually, they could fix inconsistencies in Wiktionary thus improving the quality of Etymology/Descendants/Derived Terms sections. With this purpose in mind, in the visualization I use tooltips that contain links to the Wiktionary entries where the etymological relationship was extracted from. As a result editors can go back to Wiktionary and check why specific words have the etymological connections shown in the visualization.
- On researchers: They can query a database of etymological relationships containing data extracted from the whole English Wiktionary and investigate etymologies on a large scale.
Progress towards stated goals
editPlease use the below table to:
- List each of your original measures of success (your targets) from your project plan.
- List the actual outcome that was achieved.
- Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success (include numeric target, if applicable) |
Actual result | Explanation |
Implementation of a java framework based on DBnary to extract etymological relationships as well as other relevant information from the English Wiktionary (including foreign words) | Done, available on bitbucket | |
Creation of a RDF database of etymological relationships as well as other relevant information from the English Wiktionary (including foreign words) | Done, wmflabs virtuoso sparql endpoint | |
Implementation of a visualization tool to visualize the etymological tree of any word in the English Wiktionary. At least 50 users should be testing the beta version | Done, http://tools.wmflabs.org/etytree | I am not sure how to count how many people are visiting the web page, from private communications and from the Wiktionary scriptorium I would say at least 50 people. |
Drafting of a detailed mapping between the data structures resulting in the RDF database and Wikidata claims, statements, entities, and qualifiers using as a reference Wikidata-toolkit classes (including Wikimedia Language Codes classes). | Done, here | Ontologies are described in detail here and here. About languages, I am parsing Wiktionary codes as extracted from Wiktionary:List of languages and Wiktionary:Etymology only languages. |
Discussion on any of the Wiktionary discussion rooms about how to format etymological definitions whose format is not exportable by the etymology extraction tool (e.g: alternative etymologies of lexemes) or that cannot be parsed and would need editing - showing proof of interaction with at least 30 editors. | I have created a section on the talk page and I have interacted privately with 3 editors | I'm still waiting for feedback as this is very recent work and I just started dissemination |
Defining an ontology for etymological relationships after discussion with the Wiktionary community, both on the project wiki as well as on other wikis - showing proof of interaction with at least 30 editors. | This is not needed at this point | I am not putting tooltips on links to specify the kind of etymological relationship, there is only one type of link, "etymologically derives from". In the future I might specify the type of etymological relationship (back-formation, calque,inherited, borrowing, etc). |
Discussing the visualization tool user experience on the project wiki and on other wikis - showing proof of interaction with at least 30 editors. | I can only count 9 public interactions on the links provided above. | I think this will change soon as I am currently doing dissemination. |
Producing a list of 1000 visualizations that have been tested and are working correctly; this should correspond to more than 50000 lexemes in Wiktionary as each tree represents 50 lexemes on average. | Done | Visualizations are produces on the fly from the database, which contains all lexemes in the English Wiktionary. |
Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?
I do feel I have achieved my goals, because I have created and published a working tool, I am still working on dissemination though, which is a long process.
Global Metrics
editWe are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.
- Next to each metric, list the actual numerical outcome achieved through this project.
- Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."
For more information and a sample, see Global Metrics.
Metric | Achieved outcome | Explanation |
1. Number of active editors involved | 5 | 3 people are working with me and I have one volunteer |
2. Number of new editors | 0 | I don't have data to say if I have attracted new editors. Dissemination in academia and possibly schools might help attract new editors. |
3. Number of individuals involved | 8 | The people I am collaborating with and people I contact for help. |
4. Number of new images/media added to Wikimedia articles/pages | I have added a new tool, plus a new database of etymological relationships. | |
5. Number of articles added or improved on Wikimedia projects | ~ 300 | Through visualizations I could spot etymologies that were somehow incorrect or imprecise. |
6. Absolute value of bytes added to or deleted from Wikimedia projects | 8 Giga | This is the size of the RDF database uploaded to WMF Labs |
- Learning question
- Did your work increase the motivation of contributors, and how do you know?
- At this point I don't have data to answer this question.
Indicators of impact
editDo you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.
How did you improve quality on one or more Wikimedia projects?
I think this project can promote innovation and can represent a useful resource for users, editors and researchers. In particular this project makes word origin information available as a large, machine-readable network of words in many languages.
There is ongoing discussion on the inclusion of data extracted from Wiktionary to Wikidata and some users question how this could be possible/useful. The tool created with this project shows how a database of lexicographical data could be useful to explore relationships and properties of words in innovative, interactive and fascinating ways. Watching the demo of this tool made some Wiktionarians at Wikimania realize how interesting it would be to export data into a database. With this tool, users can discover new words that derive from the same ancestral word, both in their own language and in other languages. Editors can check the consistency of etymological information across multiple Wiktionary pages through the visualization of the whole etymological tree. The research community can look at lexicographical and in particular etymological data on a large scale which can have a strong impact in a field where no such resource was available before.
The data extracted in the database can be exported to Wikidata as the format is compatible. As already mentioned here, the Primary Sources Tool could be used to export data to Wikidata, since the dataset possibly contains extraction errors, and thus needs a validation step, and since it is too large for a direct inclusion via a bot. In this case, the dataset should be serialized into the QuickStatements syntax.
Project resources
editPlease provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.
- the etytree tool @ tools.wmflabs.org
- the sparql endpoint @ wmflabs.org
- Wiktionary Etymology scriptorium
- Wiktionary Grease pit
- Wiktionary Beer parlour
- Wikidata Project Chat
- Wikidata-Wiktionary project page
- Bar of the Wikizionario
- the extraction code @bitbucket
- the visualization code and a README file with general instructions @github
- d3-js google group
- www2017, lightning talk video
Learning
editThe best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.
What worked well
editWhat did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
- My learning pattern: Networking with outside experts to improve your project
What didn’t work
editWhat did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.
Other recommendations
editIf you have additional recommendations or reflections that don’t fit into the above sections, please list them here.
Next steps and opportunities
editAre there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.
- I am writing a renewal proposal, see here.
Part 2: The Grant
editFinances
editActual spending
editI have some funds ($1200) allocated for dissemination which I would like to reallocate as funds to Epantaleo for software development. This reallocation is motivated as follows. We gave priority to having a fully functional platform before involving users and contributors in the discussion. As we plan to continue this project and ask for a renewal, dissemination is anyways going to happen, just not in this project run.
Remaining funds
editDo you have any unspent funds from the grant?
Please answer yes or no. If yes, list the amount you did not use and explain why.
- I have not spent $1200 for dissemination and I would like to reallocate them as funds for software development to Epantaleo as the project length increased from 6 to 8 months. Priority was given to getting a fully functional platform. Dissemination is anyways going to happen, just not within this project.
Documentation
editDid you send documentation of all expenses paid with grant funds to grantsadmin wikimedia.org, according to the guidelines here?
Please answer yes or no. If no, include an explanation.
- No, I don't have expenses.
Confirmation of project status
editDid you comply with the requirements specified by WMF in the grant agreement?
Please answer yes or no.
- Yes
Is your project completed?
Please answer yes or no.
- Yes
Grantee reflection
editIt was great to the have the opportunity to work for the WMF. I had the opportunity to work with highly skilled people, both users and employees. I would love to continue this collaboration. I particularly enjoyed participating to Wikimania, which I think is a great opportunity to strenghten/start collaborations and also to promote innovation.