Grants:Project/Hjfocs/soweego/Timeline


Timeline for soweego

edit
Timeline Date
Target databases selection September 2018
Link validator October 2018
Link merger February 2019
Target databases linkers July 2019
Identifiers datasets July 2019
Software package July 2019


Overview

edit

Monthly updates

edit

Each update will cover a 1-month time span, starting from the 9th day of the current month. For instance, July 2018 means July 9th to August 8th 2018.

July 2018: target selection & small fishes

edit
       TL;DR:
  • Mix'n'match is the tool for small fishes. soweego will not handle them.

The very first task of this project is to select the target databases.[1] We see two directions here: either we focus on a few big and well known targets as per the project proposal, or we can try to find a technique to link a lot of small ones from the long tail, as suggested by ChristianKl[2] (thanks for the precious feedback!).

We used SQID as a starting point to get a list of people databases that are already used in Wikidata, sorted in descending order of usage.[3] This is useful to split the candidates into big and small fishes, namely the head and the (long) tail of the result list respectively. Let's start with the small fishes.

Quoting ChristianKl, it would be ideal to create a configurable tool that enables users to add links to new databases in a reasonable timeframe. Consequently, we carried out the following investigation: we considered as small fishes all the entries in SQID with an external ID datatype, used for class human (Q5), and with less than 15 uses in statements. We detail below a set of critical issues about this direction, as well as their eventual solutions.

The analysis of a small fish can be broken down into a set of steps. This is also useful to translate the process into software and to make each step flexible enough for dealing with the heterogeneity of the long tail targets. The steps have been implemented into a piece of software by MaxFrax96.[4]

Retrieving the dump
edit

This sounds pretty self-evident: if we aim at linking two databases, then we need access to all their entities. Since we focus on people, it is therefore necessary to download the appropriate dump for each small fish we consider.

Problem
In the real world, such a trivial step raises a first critical issue: not all the database websites give us the chance to download the dump.

Solutions

  1. Cheap, but not scalable: to contact the database administrator and discuss dump releases for Wikidata;
  2. expensive, but possibly scalable: to autonomously build the dump. If a valid URI exists for each entity, we can re-create the dump. However, this is not trivial to generalize: sometimes it is impossible to retrieve the list of entities, sometimes the URIs are merely HTML pages that require Web scraping. See the following examples:
Handling the format
edit

The long tail is roughly broken down as follows:

  • XML;
  • JSON;
  • RDF;
  • HTML pages with styling and whatever a Web page can contain.

Problem
Formats are heterogenous. We focus on open data and RDF, as dealing with custom APIs is out of scope for this investigation. We also hope that the open data trend of recent years would help us. However, a manual scan of the small fishes yielded poor results: out of 16 randomly picked candidates, only YCBA agent ID (P4169) was in RDF, and has thousands of uses in statements at the time of writing this report.

Solution
To define a way (by scripting for instance) to translate each input format into a standard project-wide one. This could be achieved during the next step, namely ontology mapping between a given small fish and Wikidata.

Mapping to Wikidata
edit

Linking Wikidata items to target entities requires a mapping between both metadata/schemas.

Solution
The mapping can be manually defined by the community: a piece of software will then apply it. To implement this step, we also need the common data format described above.

Side note: available entity metadata
Small fishes may contain entity metadata which are likely to be useful for automatic matching. The entity linking process may dramatically improve if the system is able to mine extra property mappings. This is obvious when metadata are in different languages, but in general we cannot be sure that two different databases hold the same set of properties, if they have some in common.

Conclusion
edit

It is out of scope for the project to perform entity linking over the whole set of small fishes. On the other hand, it may make sense to build a system that lets the community plug in new small small fishes with relative ease. Nevertheless, this would require a reshape of the original proposal, which comes with its own risks:

  1. it is probably not a safe investment of resources;
  2. eventual results would not be in the short term, as they would require a lot of work to create a flexible system for everybody's needs;
  3. it is likely that the team is not facing eventual extra problems in this phase.
Mix'n'match
edit

Most importantly, a system to plug new small fishes already exists. Mix'n'match[5] is specifically designed for the task.[6] Instead of reinventing the wheel, we will join efforts with our advisor Magnus Manske in his work on big fishes.[7]

August 2018: big fishes selection

edit
       TL;DR:
  • the soweego team selected 4 candidate targets:
  1. BIBSYS (Q4584301). Coverage = 21% discarded, see #September 2018
  2. Discogs (Q504063). Coverage = 33%
  3. Internet Movie Database (Q37312). Coverage = 42%
  4. MusicBrainz (Q14005). Coverage = 35%
  5. X (Q918). Coverage = 31%
  • the soweego team will join efforts with Magnus Manske's work on large catalogs.
  • Motivation #1: target investigation

    edit

    The following table displays the result of our investigation on candidate big fishes. We computed the Wikidata item counts as follows.

    • Wikidata item count queries on specific classes:
      • 4,502,944 humans, [1];
      • 495,177 authors, [2];
      • 230,962 musicians, [3];
      • 74,435 bands, [4];
      • 239,137 actors, [5];
      • 42,164 directors, [6];
      • 13,588 producers, [7];
    • Wikidata link count queries: use the property for identifiers:
      • humans, e.g., 581,397 for LoC, [8];
      • authors, e.g., 168,188 for GND, [9];
      • musicians, e.g., 77,640 for Discogs, [10];
      • bands, e.g., 37,158 for MusicBrainz, [11].
    Resource # entries Reference to # entries Dump download URL Online access (e.g., SPARQL) # Wikidata items with link / without link Available metadata Links to other sources In mix'n'match TL;DR: Candidate?
    [12] 7,037,189 [13] [14] SRU: [15], OAI-PHM: [16] humans: 571,357 / 3,931,587; authors: 168,188 / 326,989 id, context, preferredName, surname, forename, describedBy, type, dateOfBirth, dateOfDeath, sameAs GND, BNF, LoC, VIAF, ISNI, English Wikipedia, Wikidata Yes (large catalogs) Already processed by Mix'n'match large catalogs, see [17]   Oppose
    [18] > 8 millions names authority file [19] [20] Not found humans: 581,397 / 3,921,547; authors: 204,813 / 290,364 URI, Instance Of, Scheme Membership(s), Collection Membership(s), Fuller Name, Variants, Additional Information, Birth Date, Has Affiliation, Descriptor, Birth Place, Associated Locale, Birth Place, Gender, Associated Language, Field of Activity, Occupation, Related Terms, Exact Matching Concepts from Other Schemes, Sources Not found [21] Already well represented in Wikidata, low impact expected   Oppose
    [22] 8,738,217 [23] [24] Not found actors, directors, producers: 197,626 / 104,392 name, birth year, death year, profession, movies Not found No Metadata allows to run easy yet effective matching strategies; the license can be used for linking, see [25]; quite well represented in Wikidata (2/3 of the relevant subset)   Support
    [26] 2,181,744 authors, found in home page [27] SPARQL: [28] humans: 356,126 / 4,146,818; authors: 148,758 / 346,419 country, language, variants of name, pages in data.bnf.fr, sources and references LoC, GND, VIAF, IdRef, Geonames, Agrovoc, Thesaurus W Yes (large catalogs) Seems well shaped; already processed by Mix'n'match large catalogs, see [29]   Oppose
    [30] about 1.5 M dataset described at [31] [32] SPARQL humans: 94,009 / 4,408,935; authors: 40,656 / 454,521 depend on the links found for the ID VIAF, GND [33] Underrepresented in Wikidata, small subset (47k entries) in Mix'n'match, of which 67% is unmatched, high impact expected   Strong support
    [34] About 500 k Search for a,b,c,d... in the search window Not found SOLR humans: 378,261 / 4,124,683; authors: 153,024 / 342,153 name, language, nationality, notes [35], [36] No Discarded: no dump available   Oppose
    [37] 417 k [38] [39] API humans: 303,235 / 4,199,709; authors: 4,966 / 490,211 PersonId, EngName, ChName, IndexYear, Gender, YearBirth, DynastyBirth, EraBirth, EraYearBirth, YearDeath, DynastyDeath, EraDeath, EraYearDeath, YearsLived, Dynasty, JunWang, Notes, PersonSources, PersonAliases, PersonAddresses, PersonEntryInfo, PersonPostings, PersonSocialStatus, PersonKinshipInfo, PersonSocialAssociation, PersonTexts list of external sources: [40] [41] The database is in a proprietary format (Microsoft Access)   Weak support
    [42] About 500 k found as per [43] [44] API, SRU humans: 274,574 / 4,228,370; authors: 92,662 / 402,515 name, birth, death, identifier, SKOS preferred label, SKOS alternative labels VIAF, GeoNames, Wikipedia, LoC Subject Headings [45] Lots of links to 3 external databases, but few metadata; seems to be the same as BNF.   Oppose
    [46] 1,393,817 [47] [48] API humans: 98,115 / 4,404,829; musicians & bands: 114,798 / 190,599 URI, Type, Gender, Born, Born in, Died, Died in, Area, IPI code, ISNI code, Rating, Wikipedia bio, Name, Discography, Annotation, Releases, Recordings, Works, Events, Relationships, Aliases, Tags, Detail [49], [50], [51], [52], [53], [54], [55], [56], VIAF, Wikidata, English Wikipedia, YouTube/Vevo, [57], [58], resource official web page, [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75] No High quality data, plenty of external links, totally open source, regular dump releases   Strong support
    [76] About 1 M People found as per [77], with restriction to people Not found API humans: 128,536 / 4,374,408; authors: 42,991 / 452,186 birth, death, nationality, language, archival resources, realated resources, related external links, ark ID, SNAC ID [78], [79] [80] No dump available, 99.9% already matched in Mix'n'match   Oppose
    [81] 840,883 [82] [83] API humans: 167,663 / 4,335,281; authors: 410 / 494,767 name, country, keywords, other IDs, education, employment, works [84], [85] Yes (large catalogs) Already processed by Mix'n'match large catalogs, see [86]   Oppose
    [87] 6,921,024 [88] [89] API humans: 140,883 / 4,362,061; authors: 58,823 / 436,354 name, birth year, death year Not found [90] Only names, birth and death dates; no dedicated pages for people entries; source code: [91]   Neutral
    [92] Not found Not found Not found Not found Not found Not found Not found [93] Seems closed. The dataset providers claimed they would publish a new site, not happened so far [94]   Oppose
    [95] 336 M active [96] Not found API humans: 85,527 / 4,417,417 verified account, user name, screen name, description, location, followers, friends, tweets, listed, favorites, statuses Plenty No No official dump available, but the team has collected the dataset of verified accounts. Links stem from home page URLs, should be filtered according to a white list. Underrepresented in Wikidata, high impact expected   Strong support
    [97] Not found Not found Not found Not found Not found Not found Not found No Seems it does not contain people data, only books by author   Oppose
    [98] 20,255 [99] [100] SPARQL humans: 81,455 / 4,421,489; authors: 34,496 / 460,681 Subject line, Homepage ID, Synonym, Broader term, Hyponym, Related words, NOTE Classification symbol (NDLC), Classification symbol (NDC 9), Classification symbol (NDC 10), Reference (LCSH), Reference (BSH 4), Source (BSH 4), Source edit history, Created date, last updated VIAF, LoC No Mismatch between the actual dataset and the links in Wikidata; it extensively refers to VIAF and LoC (see [101], entry 12 of the table and [102])   Oppose
    [103] Not found Not found API humans: 97,599 / 4,405,345; authors: 28,404 / 466,773 Not found Not found Not found No dump available   Oppose
    [104] 5,736,280 [105] [106] API + Python client humans: 66,185 / 4,436,759; musicians & bands: 78,522 / 226,875 artist name, real name, short bio, aliases, releases, band membership Plenty, top-5 frequent: [107], [108], [109], [110], [111] [112] CC0 license, 92% not matched in Mix'n'match, high impact expected   Strong support
    [113] 1,269,331 [114] [115], [116] SPARQL humans: 106,839 / 4,396,105; authors: 50,251 / 444,926 name, bio [117], [118], [119], [120], [121], [122], [123], [124], [125] [126] Outdated dump (2013), low quality data, 75% not matched in Mix'n'match   Oppose

    Motivation #2: coverage estimation

    edit

    We computed coverage estimations over   Strong support and   Support candidates to assess their impact on Wikidata, as suggested by Nemo bis[8] (thanks for the valuable comment!). In a nutshell, coverage means how many existing Wikidata items could be linked.

    For each candidate, the estimation procedure is as follows.

    1. pick a representative 1% sample of Wikidata items with no identifier for the candidate. Representative means e.g., musicians for MusicBrainz (Q14005): it would not make sense to link generic people to a catalog of musical artists;
    2. implement a matching strategy:
      • perfect = perfect matches on names, sitelinks, external links;
      • similar = matches on names and external links based on tokenization and stopwords removal;
      • SocialLink, as per our approach;[9]
    3. compute the percentage of matched items with respect to the sample.

    The table below shows the result. It is worth to observe that similar coverage percentages correspond to different matching strategies: this may suggest that each candidate may require different algorithms to achieve the same goal. Our hypothesis is that higher data quality entails simpler solutions: for instance, MusicBrainz (Q14005) seems like a well structured catalog, thus the simplest strategy being sufficient.

    Target Sample Matching strategy # matches % coverage
    BIBSYS (Q4584301) 4,249 authors and teachers Perfect 899 21%
    Discogs (Q504063) 1,253 musicians Perfect & similar 414 33%(1)
    Internet Movie Database (Q37312) 1,022 actors, directors and producers Perfect 432 42%
    MusicBrainz (Q14005) 1,100 musicians Perfect 388 35%
    X (Q918) 15,565 living humans SocialLink 4,867(2) 31%

    (1) using perfect matching strategy only: 4.6%
    (2) out of which 609 are confident matches

    September 2018

    edit

    We manually assessed small subsets of the matches obtained after the coverage estimations. Given the scores and the evaluation, we decided to discard BIBSYS (Q4584301). The main reasons besides the mere score follow.

    • The dump is not synchronized with the online data;
      • identifiers in the dump may not exist online:
      • cross-catalog links in the dump may not be the same as online;
    • the dump suffers from inconsistency:
      • the same identifier may have multiple links, thus flawing the link-based matching strategy;
      • links from different catalogs may have different quality, e.g., one may be correct, the other not;
    • online data can also be inconsistent. A match may be correct, but the online identifier may have a wrong cross-catalog link.

    We report below a first round of evaluation that estimates the performance of already implemented matchers over the target catalogs. Note that MusicBrainz (Q14005) was evaluated more extensively thanks to MaxFrax96's thesis work.

    Target Matching strategy # samples Precision
    BIBSYS (Q4584301) Perfect links 10 50%
    Discogs (Q504063) Similar links 10 90%
    Discogs (Q504063) Similar names 32 97%
    Internet Movie Database (Q37312) Perfect names 10 70%
    MusicBrainz (Q14005) Perfect names 38 84%
    MusicBrainz (Q14005) Perfect names + dates 32 100%
    MusicBrainz (Q14005) Similar names 24 71%
    MusicBrainz (Q14005) Perfect links 71 100%
    MusicBrainz (Q14005) Similar links 102 99%
    X (Q918) SocialLink 67 91%
     
    Slides of MaxFrax96's BSc thesis on soweego

    Technical

    edit

    Dissemination

    edit
    • MaxFrax96 successfully defended his bachelor thesis[10] on soweego. Congratulations!
    • Lc_fd joined the project as a volunteer developer. Welcome!

    October 2018

    edit

    During this month, the team devoted itself to software development, with tasks broken down as follows.

    Application package

    edit

    This is how the software is expected to ship. Tasks:

    • packaged soweego in 2 Docker containers:
      1. test launches a local database instance to enable work on a target catalog dump extraction and import;
      2. production feeds the shared Toolforge large catalogs database;
    • let a running container see live changes in the code.

    Validator

    edit

    This component is responsible for monitoring the divergence between Wikidata and a given target catalog. It implements bullet point 3 of the project review committee recommendations[11] and performs validation of Wikidata content based on 3 main criteria:[12]

    1. existence of target identifiers;
    2. agreement with the target on third-party links;
    3. agreement with the target on "stable" metadata.

    Tasks:

    • existence-based validation (criterion 1):
    • full implementation of the link-based validation (criterion 2).

    Importer

    edit

    This component extracts a given target catalog data dump, cleans it and imports it in Magnus_Manske's database on Toolforge. It follows ChristianKl's suggestion[13] and is designed as a general-purpose facility for the developer community to import new target catalogs. Tasks:

    • worked on MusicBrainz (Q14005):
      • split dump into musicians and bands;
      • extraction and import of musicians and bands;
      • extraction and import of links.

    Ingestor

    edit

    This component is a Wikidata bot that uploads the linker and validator output. Tasks:

    • deprecate identifier statements not passing validation;
    • handle statements to be added: if the statement already exists in Wikidata, just add a reference node.

    Utilities

    edit
    • Complete URL validation: pre-processing, syntax parsing and resolution;
    • URL tokenization;
    • text tokenization;
    • match URLs known by Wikidata as external identifiers and convert them accordingly.

    November 2018

    edit

    The team focused on the importer and linker modules.

    Importer

    edit
    • Worked on Discogs (Q504063):
      • split dump into musicians and bands;
      • extraction and import of musicians and bands;
      • extraction, validation and import of links;
      • extraction and import of textual data.
    • major effort on building full-text indices on the Toolforge database:
    • Refinements for MusicBrainz (Q14005).

    Linker

    edit
    • Added baseline strategies to the importer workflow. They now consume input from the Toolforge database;
    • adapted and improved the similar name strategy, which leverages full-text indices on the Toolforge database;
    • preparations for the baseline datasets.

    Validator

    edit
    • First working version of the metadata-based validation (criterion 3).

    Ingestor

    edit
    • Add referenced third-party external identifier statements from link-based validation;
    • add referenced described at URL (P973) statements from link-based validation;
    • add referenced statements from metadata-based validation.

    Dissemination

    edit

    December 2018

    edit

    We focused on 2 key activities:

    1. research on probabilistic record linkage;[15]
    2. packaging of the complete soweego pipeline.

    Probabilistic record linkage

    edit

    Deterministic approaches are rule-based linking strategies. and represent a reasonable baseline. On the other hand, probabilistic ones leverage machine learning algorithms and are known to perform effectively.[16] Therefore, we expect our baseline to serve as the set of features for probabilistic methods.

    • First exploration and hands on the recordlinkage library:[17]
    • understood how the library applies the general workflow: cleaning, indexing, comparison, classification, evaluation;
    • published a report that details the required implementation steps;[18]
    • started the first probabilistic linkage experiment, i.e., using the naïve Bayes algorithm;[19]
    • recordlinkage extensively employs DataFrame objects from the well known pandas Python library:[20] investigation and hands on it;
    • started work on the training set building:
      • gathered the Wikidata training set live from the Web API;
      • gathered the target training set from the Toolforge database;
      • converted both to suitable pandas dataframes;
    • custom implementation of the cleaning step;
    • indexing implemented as blocking on the target identifiers;
    • started work on feature extraction.

    Pipeline packaging

    edit
    • Finalized work on full-text indices on the Toolforge database;
    • adapted perfect name and similar link baseline strategies to work against the Toolforge database;
    • built an utility to retrive mappings between Toolforge database tables and SQL Alchemy entities;
    • completed the similar link baseline strategy;
    • linking based on edit distances now works with SQL Alchemy full-text indices;
    • baseline linking can now run from the command line interface;
    • various Docker improvements:
      • set up volumes in the production instance;
      • allow custom configuration in the test instance;
      • set up the execution of all the steps for the final pipeline.

    January 2019

    edit

    Happy new year! We are pleased to announce a new member of the team: Tupini07. Welcome on board! Tupini07 will work on the linker for Internet Movie Database (Q37312). The development activities follow.

    IMDb

    edit
    • Clustered professions related to music;
    • reached out to IMDb licensing department;
    • understood how the miscellaneous profession is used in the catalog dump.

    Probabilistic linker

    edit
    • Investigated Naïve Bayes classification in the recordlinkage Python library;
    • worked on feature extraction;
    • grasped performance evaluation in the recordlinkage Python library;
    • completed the Naïve Bayes linker experiment;
    • engineered the vector space model feature;
    • gathered Wikidata aliases for dataset building;
    • discussed how to handle feature extraction in different languages.

    Baseline linker

    edit
    • Assessed similar URLs link results;
    • piped linker output to the ingestor;
    • read input data from the Wikidata live stream;
    • worked on birth and death dates linking strategy.

    Importer

    edit

    Package

    edit
    • Installed less in Docker;
    • set up the final pipeline as a Docker container;
    • hitting a segmentation fault when training in Docker container on a specific machine;
    • improved Docker configuration.

    February 2019

    edit

    The team fully concentrated on software development, with a special focus on the probabilistic linkers.

    Probabilistic linker

    edit
    • Handled missing data from Wikidata and the target;
    • parsed dates at preprocessing time;
    • started work on blocking via queries against the target full-text index;
    • understood custom blocking logic in the recordlinkage Python library;
    • resolved object QIDs;
    • dropped data columns containing missing values only;
    • included negative samples in the training set;
    • enabled dump of evaluation predictions to CSV;
    • started work on scaling up the whole probabilistic pipeline:
      • implemented chunk processing techniques;
      • parallelized feature extraction;
      • avoided redundant input/output operations on files when gathering target datasets.

    Importer

    edit
    • The IMDb importer is ready;
    • handled connection issues with the target database engine;
    • made the expensive URL resolution functionality optional;
    • fixed a problem causing the MusicBrainz import to fail;
    • improved logging of the MusicBrainz dump extractor;
    • added batch insert functionality;
    • added import progress tracker;
    • extra logging for the Discogs dump extractor;
    • enabled bulk insertion of the SQL Alchemy Python library;

    Ingestor

    edit
    • Uploaded a 1% sample of the Twitter linker to Wikidata;
    • filtered the dataset on confident links;
    • resolved Twitter UIDs against usernames.

    March 2019

    edit

    This was a crucial month. In a nutshell:

    1. the probabilistic linker workflow is in place;
    2. we successfully ran it over complete imports of the target catalogs;
    3. we uploaded samples of the linkers that performed best to Wikidata;
    4. we produced the following evaluation reports for the Naïve Bayes (NB) and Support Vector Machines (SVM) algorithms.

    Linker

    edit
    • Removed empty tokens from full-text index query;
    • prevented positive samples indices from being empty;
    • implemented a feature for dates;
    • implemented a feature based on similar names;
    • implemented a feature for occupations:
      • gathered specific statements from Wikidata;
      • ensured that occupation statements are only gathered when needed;
      • enabled comparison in the whole occupation classes tree;
    • handled missing target data;
    • avoided computing features when Wikidata or target DataFrame columns are not available;
    • built blocking via full-text query over the whole Wikidata dataset;
    • built full index of positive samples;
    • simplified probabilistic workflow;
    • checked whether relevant Wikidata or target DataFrame columns exist before adding the corresponding feature;
    • k-fold evaluation;
    • ensured to pick a model file based on the supported classifiers;
    • filtered duplicate predictions;
    • first working version of the SVM linker;
    • avoided stringifying list values.

    Importer

    edit
    • Fixed an issue that caused the failure of the IMDb import pipeline;
    • parallelized URL validation;
    • prevented the import of unnecessary occupations in IMDb;
    • occupations that are already expressed on the import table name do not get imported;
    • decompressed Discogs and MusicBrainz dumps are now deleted after a successful import;
    • avoided populating tables when a MusicBrainz entity type is unknown.

    Miscellanea

    edit
    • Optimized full-text index queries;
    • the perfect name match baseline now runs bulk queries;
    • set up the Wikidata API login with the project bot;
    • progress bars do not disappear anymore.

    April 2019

    edit

    After introducing 2 machine learning algorithms, i.e., naïve Bayes and support vector machines, this month we brought neural networks into focus.

    The major outcome is a complete run of all linkers over all whole datasets, and made the evaluation results available at https://github.com/Wikidata/soweego/wiki/Linkers-evaluation.

    Linker

    edit
    • Decided not to handle QIDs with multiple positive samples;
    • added feature that captures full names;
    • added an optional post-classification rule that filters out matches with different names;
    • injected SVM linker based on libsvm, instead of liblinear. This allows to use non-linear kernels at the cost of higher training time;
    • first implementation of a single-layer perceptron;
    • added a set of Keras callbacks;
    • ensured a training/validation set split when training neural networks;
    • incorporated early stopping at training time of neural networks;
    • implemented a rule to enforce links of MusicBrainz entities that already have a Wikidata URL;
    • built a stopword list for bands;
    • enabled cache of complete training & classification sets for faster prototyping;
    • constructed facilities for hyperparameters tuning through grid search, available at evaluation and optionally at training time;
    • experimented architectures of multi-layer perceptrons.

    Importer

    edit
    • Fixed a misleading log message when importing MusicBrainz relationships.

    May 2019

    edit

    This month was pretty packed. The team revolved around 3 main activities:

    1. development of new linkers for musical and audiovisual works;
    2. refactoring & documentation of the code base;
    3. facility to upload medium-confidence results to the Mix'n'match tool.

    New linkers

    edit
    • imported Discogs masters;
    • imported IMDb titles;
    • imported MusicBrainz releases;
    • implemented the musical work linker;
    • implemented the audiovisual work linker.

    Refactor & document

    edit
    • Code style:
      • format with black;[21]
      • remove unused imports & variables with autoflake;[22]
      • apply relevant suggestions from pylint;[23]
    • refactored & documented the pipeline;
    • refactored & documented the importer module;
    • refactored & documented the ingestor module.

    Mix'n'match client

    edit
    • Interacted with the project advisor, who is also the maintainer of the tool;
    • added ORM entities for mix'n'match catalog and entry DB tables.

    Linker

    edit
    • Added string kernels as a feature for names;
    • completed the multi-layer perceptron;
    • handled too many concurrent SPARQL queries being sent when gathering the tree of occupation QIDs;
    • fixed parallelized full-text blocking, which made IMDb crash.

    Importer

    edit
    • Avoid populating DB tables when a Discogs entity type is unknown.

    Ingestor

    edit
    • Populate statements connecting works with people.

    Continuous integration

    edit
    • Set up Travis;[24]
    • added build badge to the README;
    • let Travis push formatted code.

    June 2019

    edit

    The final month was totally devoted to 5 major tasks:

    1. deployment of soweego in production;
    2. upload of results;
    3. documentation;
    4. code style;
    5. refactoring.

    Production deployment

    edit
    • Set up the production-ready Wikimedia Cloud VPS machine;[25]
    • dry-ran and monitored production-ready pipelines for each target catalog;
    • structured the output folder tree;
    • decided confidence score thresholds;
    • the pipeline script now backs up the output folder of the previous run;
    • avoided interactive login to the Wikidata Web API;
    • enabled the extraction of Wikidata URLs available in target catalogs;
    • set up scripts for cron jobs.

    Results upload

    edit
    • Confident links (i.e., with score above 0.8) are being uploaded to Wikidata via d:User:Soweego bot;
    • medium-confidence (i.e., with score between 0.5 and 0.8) links are being uploaded to Mix'n'match for curation by the community.

    Documentation

    edit
    • Added Sphinx-compliant[26] documentation strings to all public functions and classes;
    • complied with PEP 257[27] and PEP 287;[28]
    • converted and uplifted relevant pages of the GitHub Wiki into Python documentation;
    • customized the look of the documentation theme;
    • deployed the documentation to Read the Docs;[29]
    • completed the validator module;
    • completed the Wikidata module;
    • completed the linker module: this activity required extra efforts, since it is soweego's core;
    • main command line documentation;
    • full README.

    Code style

    edit
    • Complied with PEP 8[30] and Wikimedia[31] conventions;
    • added type hints[32] to public function signatures.

    Refactoring

    edit
    • fixed pylint errors and relevant warnings;
    • reduced code complexity;
    • applied relevant pylint refactoring suggestions.
    1. Grants:Project/Hjfocs/soweego#Work_package
    2. Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability
    3. Select datatype set to ExternalId, Used for class set to human Q5
    4. https://github.com/MaxFrax/Evaluation
    5. https://tools.wmflabs.org/mix-n-match/
    6. http://magnusmanske.de/wordpress/?p=471
    7. http://magnusmanske.de/wordpress/?p=478
    8. Grants_talk:Project/Hjfocs/soweego#Coverage_statistics
    9. https://iswc2017.semanticweb.org/wp-content/uploads/papers/MainProceedings/441.pdf
    10. https://tools.wmflabs.org/soweego/MaxFrax96_BSc_thesis.pdf
    11. Grants_talk:Project/Hjfocs/soweego#Round_2_2017_decision
    12. https://github.com/Wikidata/soweego/issues/19#issuecomment-413622924
    13. Grants_talk:Project/Hjfocs/soweego#Target_databases_scalability
    14. https://etherpad.wikimedia.org/p/WikiCite18Day3sparql
    15. en:Record_linkage
    16. http://axon.cs.byu.edu/~randy/pubs/wilson.ijcnn2011.beyondprl.pdf
    17. https://recordlinkage.readthedocs.io
    18. https://github.com/Wikidata/soweego/wiki/Notes-on-the-recordlinkage-Python-library
    19. https://github.com/Wikidata/soweego/issues/146
    20. https://pandas.pydata.org/
    21. https://black.readthedocs.io/
    22. https://pypi.org/project/autoflake/
    23. https://pylint.readthedocs.io/
    24. https://travis-ci.com/
    25. https://tools.wmflabs.org/openstack-browser/project/soweego
    26. https://www.sphinx-doc.org/
    27. https://www.python.org/dev/peps/pep-0257/
    28. https://www.python.org/dev/peps/pep-0287/
    29. https://soweego.readthedocs.io/
    30. https://www.python.org/dev/peps/pep-0008/
    31. https://www.mediawiki.org/wiki/Manual:Coding_conventions/Python
    32. https://docs.python.org/3/library/typing.html