Grants:Project/Information extraction and replacement/Intuition on templates
An intuition on template filling goes like this. Assume that we analyze articles from lakes of Oppland for elevation, then we would have something like the following. This is just for the example, in real code we would use the special page "What links here" on the d:elevation above sea level, and then limited the set to a specific type of instance. It is simpler to use the category for this example. Note that some elevations are not linked, this is because they do not exist in Wikidata.
Lake | Initial constituents (prefix) | Vaue | Unit | Trailing constituents (postfix) |
---|---|---|---|---|
Aursjoen | It lies at an elevation of | 1,098 | m | above sea level. |
Aursjøen | The 36.38-square-kilometre (14.05 sq mi) lake sits at an elevation of | 856 | metres | (2,808 ft) above sea level and is about 70.67 kilometres (43.91 mi) around. |
Bygdin | Bygdin is regulated and its normal level lies between | 1,048 and 1,057 | meters | above sea level. |
Dokkfløyvatn | It lies at an elevation of | 735 | m | above sea level |
Einavatnet | It lies at an elevation of | 398 | metres | (1,306 ft) above sea level. |
Helin | It is located at | 870 | m | above the sea, and has a volume of 18.6 million m³. |
Langvatnet | It has an area of 0.3505 square kilometers (0.1353 sq mi) and is located at | 1,422 | meters | (4,665 ft) above mean sea level. |
Losna | It lies | 181 | m | above sea level. |
Mjøsa | It is 365 km² in area and its volume is estimated at 56 km³; normally its surface is | 123 | metres | above sea level, and its greatest depth is 468 metres. |
Nedre Heimdalsvatn | It lies at an elevation of | 1,053 | m | above sea level. |
Prestesteinsvatnet | The 4.12-square-kilometre (1,020-acre) lake sits at an elevation of | 1,357 | metres | (4,452 ft) above sea level. |
Randsfjorden | The lake is | 135 | metres | (443 ft) above sea level. |
Rauddalsvatn | It lies at an elevation of | 916 | m | above sea level. |
Sandvatnet/Kaldfjorden/Øyvatnet | It is at an elevation of | 1,019 | m | above sea level. |
Slidrefjord | It is at an elevation of | 366 | m | above sea level. |
Steinbusjøen | It lies at an elevation of | 1,211 | m | above sea level. |
Strondafjorden | It lies at an elevation of | 355 | m | above sea level. |
Tisleifjorden | It has an elevation of | 819 | m | above sea level. |
Tyin | The lake serves as a reservoir for Tyin kraftverk and the water level is regulated between | 1082.84 and 1072.50 | m | above sea level. |
Vågåvatn | The lake is | 362 | meters | above sea level and has a surface area of 14.76 km², making it one of the 200 largest lakes in Norway. |
Vangsmjøse | It is at an elevation of | 466 | m | above sea level. |
Vinstre | It is at an elevation of | 1,032 | m | above sea level. |
By inspection we find that there are some repeated prefixed constituents «The lake is», «is located at», «at an elevation of», and a repeated suffixed constituent «above sea level». The value can be a single or combined value, that is a list. There are also units «m», «meters», and «metres».
It is possible to estimate probabilities for observing the constituents in specific relative positions around the value, thereby estimating probabilities for observing those overall patterns on external pages. With this we can search inside pages found out on the net. It is not difficult to find some pages mentioning w:Bygdin, but weeding out the pages that actually mention the elevation is more difficult and at least very time consuming.
External pages with text fragments found to satisfy a minimum probability, that is one or more previously found constituents are found on the page and the overall probability can then be calculated. Note that the actual value for the article in question is used together with the found constituents. Matching pages can then be ranked according to the probability, with more likely pages on the top. The user can then check the excerpt, or even open the page, to manually accept the page as source. If it is accepted, then a reference is made by mw:citoid and inserted after the period that contains the factoid.
Notes
edit- It is not necessary to rewrite any text to inert the reference, it is only necessary to scan forward to the end end of the period and insert the reference.
- It will not be necessary to manually verify this insertion, it is only necessary to manually verify insertion of the final references tag if it is missing.
- It might be necessary to add the reference to the statemet at Wikidata, and then reimport the value with the reference.