Grants:Project/DBpedia/GlobalFactSyncRE/SyncTargets
problem: four layers of complexity
edit- identity check / check for ambiguity -- Are we talking about the same entity? There's some overall ambiguity, e.g., Hamburg = city in Germany, but also 24 cities/places with the name Hamburg located in the United States, 3 other places elsewhere in the world, 3 entries of Hamburg linked to animals and plants, 7 entries for ships, 1 for the Hamburg port, BUT they all have separate wiki entries.
- fixed vs. varying property -- Some properties vary depending on nationality (e.g., release dates for different countries), point in time (e.g., population count)
- reference -- Depending on the entity's identity check and the property's fixed or varying state the reference might vary. Also, for some targets no query-able online reference might be available.
- normalization / conversion of values -- Depending on language/nationality of the article some properties have varying units (e.g., currency, metric vs imperial system, ...).
Keep in mind:
- Properties (targets) can be divided into object properties (e.g., employer, NBA player birthplace, name of company CEO) and data properties (population count, NBA player height, release year for album, single, or game)
- Data properties belong to one of the xml data types and depending on whether e.g. a double precision or a float type is chosen the last digits of a number might differ
- Data properties that were measured or calculated depend by default on the methodology used
- Different references might use different methodologies (e.g., data acquisition for the US census changes with the years)
- Dates and times depend on the location on the globe and its respective time zone (e.g. Tokio is 16hs ahead of LA), worst case szenario there could be a maximum difference of 1 day if time zone or location is not specified - best: use UTC (which provides some means of normalization)
Interest from the Wikiverse:
- Wikidata users are concerned with 1 property across all articles
- Wikipedia users are concerned with all properties of 1 article
Solution: start with easy sync targets
NBA Players
edit- ambiguity
- clearly defined group of people
- Currently no active players with same name
- BUT there used to be e.g. 4 Charles Smiths (each has an individual English Wikipedia page, though)
- property variability:
- team and number can change due to trades and free agency
- career highlights and awards can change, become more over time
- Stats - change after every game and are thus only provided for retired players
- trade deadline in February, new season starts in July, free agency starts in July)
- after each NBA draft (around June) there will be new players
- reference:
- https://www.nba.com/players/ - only active players, though
- using google structured data tool: no data available/detected
- https://www.basketball-reference.com/ - Eng. wiki uses it for stats
- https://en.hispanosnba.com/
- according to google structured data tool you can extract data
- you can see all active players
- https://www.nba.com/players/ - only active players, though
- normalization issues:
- height and weight - metric vs imperial system
- choice of xml data type can have an impact on exact height or weight (dependency on reference and its chosen data type)
- to check for completeness:
- all active players (30 teams - 450)
- all players of the eastern or western conference (225)
- all players in one division (Atlantic,Central, Southwest, Northwest, Pacific, Southwest - each division has 5 teams - 75)
- all players of one team (15)
- GOOD EXAMPLE: Amino Al-Farouq - discrepancies in height, weight, place of birth
- NBA.com: 6ft9, 2.06m / 220lbs, 99.8kg, no info on place of birth
- basketball-reference.com: 6-9, 206cm / 220lbs, 99kg, Atlanta, Georgia
- birthplace - en, fr, WD: Atlanta, Georgia
- birthplace de, es: Stone Mountain, Georgia
language | weight | in FCF? |
---|---|---|
en | 220lbs / 100kg | yes |
fr | 220lbs / 100kg | no |
es | 216lbs / 98kg | no |
it | 100kg | no |
pt | 220lbs / 98kg | yes |
pl | 98kg | no |
WD | 100kg | yes |
- QUESTIONS: @Mrvnhfr: @JohannesFre:
- Why is there no entry in the FCF for fr, es, pl, it?
- Where are the entries for lbs?
- Why are there two different predicates for weight?
Video Games
editE.g., DOOM3
- ambiguity in the broader sense (DOOM (the franchise) vs DOOM (the 1993 video game) vs DOOM (the 2016 video game) vs DOOM (the 2005 movie)) - in Wikipedia/Wikidata entities can be singled out clearly, but needs to be checked for other sources
- property variability:
- different publishers for different regions and platforms
- different composers for different platforms
- different release dates for different platforms (Microsoft Windows, Linux, MacOS X, XBox) and different regions (NA vs EU)
- reference: multitude of different sources
- https://www.gameinformer.com/ - but no relevant info extractable via GSDT
- https://www.gamereactor.de/ - but no relevant info extractable via GSDT
- normalization: no issues
- note: the infobox refers to the respective game, but oftentimes the article also includes the whole game series and its impact, often in the abstract already
Films
edit- ambiguity:
- potential issues: remakes
- property variability:
- release date for individual countries
- revenue depends on date
- different dubbing actors for releases in different countries
- reference:
- IMDb
- https://www.boxofficemojo.com/ has revenue, but according to GSDT no data available
- normalization:
- potential for mixup of currency for budget and revenue
- Choice of xml data type can have an impact on exact budget and revenue (dependency on reference and its chosen data type) - e.g., $463.4 million vs $ 463.406.268
- notes: infoboxes can be extensive, including cast, dubbing actors, individual release dates depending on country/language
IMDB
editIMDB publishes very good data in the HTML using JSON-LD, example:
Rambo Last Blood, see IMDB data |
|
More difficult / complex sync targets:
Music albums
edit- ambiguity:
- "High Voltage" by AC/DC - 1975 (Australia) vs 1976 (international)
- German wiki: only 1 page for both albums (with two infoboxes) https://de.wikipedia.org/wiki/High_Voltage - links to 1976 albun in Eng and French
- English wiki: https://en.wikipedia.org/wiki/High_Voltage_(1975_album) & https://en.wikipedia.org/wiki/High_Voltage_(1976_album)
- French wiki: https://fr.wikipedia.org/wiki/High_Voltage_(album_australien) & https://fr.wikipedia.org/wiki/High_Voltage
- also: https://en.wikipedia.org/wiki/High_Voltage_(song)
- "High Voltage" by AC/DC - 1975 (Australia) vs 1976 (international)
- property variability:
- release date might vary depending on country [[1]]
- references: https://musicbrainz.org/
- normalization: no issues
Music singles
edit- ambiguity: singles with same name but different artists
- "Stairway to Heaven" https://en.wikipedia.org/wiki/Stairway_to_Heaven_(disambiguation), but clearly separated
- "God is a DJ" by Faithless and by Pink, but clearly separated
- ambiguity within the pages, which refer to the song, but the infobox describes the single, and within the abstract "song" and "single" are mentioned interchangeably
- BUT: "Every Single is a Song but not every Song is a Single." - https://www.quora.com/Whats-the-difference-between-a-song-and-a-single
- "Boys don't cry" - song by multiple artists - https://en.wikipedia.org/wiki/Boys_Don%27t_Cry_(Moulin_Rouge_song) is page for the song by Moulin Rouge but with infobox from song by Wink, and it's linked to other language wikis also referring to the Wink song
- property variability:
- release date might vary depending on country [[2]]
- e.g., "The real Slim Shady" from "The Marshall Maters LP": April 15, 2000 (English wiki) vs May 16, 2000 (German wiki), but no reference to a country
- music videos for different countries (e.g., UK vs US version, https://en.wikipedia.org/wiki/Want_U_Back)
- release date might vary depending on country [[2]]
- references: MusicBrainz
- normalization: no issues
- notes:
Cloud Types
edit- ambiguity: clearly defined topic, no ambiguities
- property variability: no issue
- reference: no easily accessible data formats
- normalization: metric vs imperial system
- notes:
- differentiation between genera, species, varieties, supplementary features and accessory clouds - once you go further than the 10 genera it becomes quite complex
- most languages don't have an infobox (e.g. German), French: infobox is editable using visual editor, code, or wikidata
- how do we compare the cloud symbols?
- wiki-project https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Meteorology
- using GFS Data Browser it seems that mappings need to be improved (only "type", "label", and "see also" are available)
Cars
edit- e.g. Volkswagen Polo https://en.wikipedia.org/wiki/Volkswagen_Polo
- ambiguity: the car has various models over the years, the wiki page "VW Polo" describes the car type in general and its various models, the infobox describes the car type VW Polo (not a certain model), some of the models also have individual wiki pages, it seems that there is no ambiguity
- property variability: no issues
- references:
- normalization: metric vs imperial system for car dimensions and weight for some car models
- e.g. BMW M5:
- ambiguity:
- the German wiki page https://de.wikipedia.org/wiki/BMW_M5 has multiple infoboxes for the general car model and for its various generations, all in one page
- same with the English page - How does the extraction work if there are multiple infoboxes?
- property variability: no issues
- references:
- normalization: metric vs imperial system for car dimensions and weight for some car models
Organizations / Companies
edit- e.g., BMW, IBM, Bank of America, Bayer
- ambiguity: no issues, companies and e.g., their product are clearly separated
- I beg to differ: company disambiguation can be a hard problem, depending on how much data is present and what are the source datasets. Specifically, official trade registers are often polluted by small/ inactive/ variant entities and it's not easy to distinguish the interesting/ large entities amongst the total.
- Disambiguating between companies, divisions, products, and companies after significant events (merger/acquisition, renaming, reorganization) is also not so easy.
- Eg many people assume a stock ticker is a fairly unique identifier. But it can be sold together with the exchange seat to another company... --Vladimir Alexiev (talk) 06:39, 22 August 2019 (UTC)
- Hi @Vladimir Alexiev:, thank you for your input! We're currently looking deeper into individual sync targets and any advice or information is very much appreciated. Tina Schmeissner (talk) 08:37, 22 August 2019 (UTC)
- property variability: production output, revenue and properties referring to monetary value depend on a point in time
- references: no database to harvest information about organizations or companies
- normalization: potential for mixup of currencies
Relevant: https://www.wikidata.org/wiki/Wikidata:WikiProject_Companies
cities - property: population count
edit- ambiguity
- cities them self are easy to link by name only, only very few duplicates for notable cities within one country
- for pop. count it is important to know the area (inner/outer city+/-surrounding areas) - different counts could be accurate depending on reference area
- who is being counted (citizens, temporary workers, refugees)
- property variability:
- time dependence
- dependent on measurement methodology (and thus on reference)
- reference:
- United States Census - according to Google structured data tool data can be extracted, e.g. for Baltimore city
- normalization issues:
- population density #/km2 or #/sq_mi
- notes:
- to check for completeness:
- 206 sovereign states
- cities of 1 country
- e.g. Grimma:
- property: population - in the German Wikipedia https://de.wikipedia.org/wiki/Grimma data for this property is derived from a PDF document storing the key for this municipality (the key is given in the infobox template), therefore we are not able to derive any information from the German Wiki page for this property
- to check for completeness:
Employer
edit- WD description: person or organization for which the subject works or worked
- ambiguity:
- no...?
- property variability:
- interlinked with employment period
- previous employments should be 'fix'
- current employment: start date - present
- ideally functional property with respect to time, BUT there are people that work two jobs and therefore have 2 employers at the same time!
- reference:
- LinkedIn - Structured Data Testing Tool shows multiple employers (e.g., Sebastian works for DBpedia Association, AKSW/Kilt and University of Leipzig)
- normalization issues:
- no units, no currency
- do names of employers vary depending on language or country?
- intentional vs extensional context, especially with subsidiary companies (e.g., "Sebastian works for InfAI" vs "Sebastian works for DBpedia" extensional context where the DBpedia Association is affiliated with InfAI)
Geo Coordinates
edit- Ambiguity: no
- Property variability:
- pos#lat for Leipzig
- pos#long for Leipzig
- Accuracy of GPS: ~10m (4 decimal places)
- Precision = the number of decimal places
- Accuracy = conformance with reality
- GFS-enabled smartphones are accurate to within a ~5m radius (under ideal conditions) https://www.gps.gov/systems/gps/performance/accuracy/
- Reference: GeoNames - All lat/long coordinates in WGS84 (World Geodetic System 1984)
- Normalization:
- Choice of precision (number of digits)
- Choice of xml data type can have an impact on exact budget and revenue (dependency on reference and its chosen data type) 12.37475 vs 12.3748 vs 12.374722222222223
Nobel Price
edit- ambiguity:
- property variability:
- reference:
- normalization issues:
Difficult sync targets (due to ambiguity issues)
edit- So far I could not find a target group that would present major issues, especially with respect to ambiguity.
- If you move away from these concrete and materialistic targets to more theoretical topics such as mathematics (or mathematical equations), linguistics, or natural sciences as a target, which might have a bigger potential for ambiguity, you find that the wiki pages don't have infoboxes anymore. (Some have nav-boxes or sidebars for orientation/overview.)
What we've learned so far
edit- there are wiki pages that have multiple infoboxes for some languages (especially for cars)
- reference extraction can extract information from pages with multiple infoboxes
- fact extraction can also extract information from pages with multiple infoboxes, but during the fusion this will all be handled as 1 infobox which will cause problems (e.g., two different release dates and fusion might pick the older one)
- there are instances when information is not stored directly in the infobox but via some kind of sub-routine, which renders the information non-extractable for us (see Grimma exapmle)
Tina Schmeissner (talk) 10:47, 12 July 2019 (UTC)