Research:Language-Agnostic Topic Classification/Outlink model performance
This page provides details about the current (as of December 2020) model for Wikipedia topic classification based on outlinks (as represented by their corresponding Wikidata IDs):
- Code: https://github.com/geohci/wikipedia-language-agnostic-topic-classification
- Model architecture: multi-label fastText supervised model
- Training data used a 90% sample of every language in Wikipedia. In practice, this meant that English Wikipedia provides 11.4% of the data and then Cebuano (8.8%), Swedish (6.4%), German (4.6%), French (4.2%), and all other languages are below 4%.
- Epochs: 2
- Learning rate: 0.1
- Window size: 20
- Min count (under which QID is not retained in vocab): 20
- No pre-trained embeddings used
- Embeddings dimension: 50
- Total number of model params: 3200 (50 x 64)
- Vocab size: 4,145,064
- Total number of embeddings params: 207,253,200 (vocab size * embeddings dimension)
- Model size on disk: 863 MB
Test Results
editRegarding the test results below, there is a strong limitation in that there is no labeled data for articles that do not have English Wikipedia equivalents. It is very possible that the performance degrades for these articles (assuming they also have outlinks to different types of articles).
Overall results:
- Precision: 0.877 (micro); 0.836 (macro)
- Recall: 0.793 (micro); 0.678 (macro)
- F1: 0.830 (micro); 0.744 (macro)
- Average precision: 0.891 (micro); 0.795 (macro)
For qualitative evaluations (of a slightly older model), see task T266201. Overall, precision was 94.7% in Arabic; 81.4% in Czech, 89.5% in English, 88.0% in French, 91.3% in Vietnamese.
Topic | n | TP | FP | TN | FN | Precision | Recall | F1 | Ave. Pre. |
---|---|---|---|---|---|---|---|---|---|
europe | 783307 | 681220 | 90139 | 1544254 | 102087 | 0.883 | 0.870 | 0.876 | 0.952 |
biography | 677093 | 621664 | 49035 | 1691572 | 55429 | 0.927 | 0.918 | 0.922 | 0.971 |
stem | 480236 | 431709 | 24123 | 1913341 | 48527 | 0.947 | 0.899 | 0.922 | 0.973 |
asia | 360865 | 304097 | 35810 | 2021025 | 56768 | 0.895 | 0.843 | 0.868 | 0.933 |
sports | 334774 | 315905 | 10497 | 2072429 | 18869 | 0.968 | 0.944 | 0.956 | 0.980 |
media | 327692 | 281754 | 37092 | 2052916 | 45938 | 0.884 | 0.860 | 0.872 | 0.938 |
western-europe | 290047 | 243700 | 28115 | 2099538 | 46347 | 0.897 | 0.840 | 0.867 | 0.944 |
north-america | 282981 | 216595 | 40953 | 2093766 | 66386 | 0.841 | 0.765 | 0.801 | 0.895 |
biology | 247481 | 232241 | 6260 | 2163959 | 15240 | 0.974 | 0.938 | 0.956 | 0.985 |
geographical | 213947 | 145537 | 33160 | 2170593 | 68410 | 0.814 | 0.680 | 0.741 | 0.825 |
eastern-europe | 176961 | 149351 | 14603 | 2226136 | 27610 | 0.911 | 0.844 | 0.876 | 0.938 |
southern-europe | 172638 | 138080 | 19706 | 2225356 | 34558 | 0.875 | 0.800 | 0.836 | 0.913 |
northern-europe | 161724 | 118894 | 21354 | 2234622 | 42830 | 0.848 | 0.735 | 0.787 | 0.869 |
history | 142121 | 83506 | 20703 | 2254876 | 58615 | 0.801 | 0.588 | 0.678 | 0.757 |
music | 127211 | 110574 | 9764 | 2280725 | 16637 | 0.919 | 0.869 | 0.893 | 0.939 |
women | 113835 | 47597 | 20674 | 2283191 | 66238 | 0.697 | 0.418 | 0.523 | 0.612 |
films | 112002 | 90799 | 15972 | 2289726 | 21203 | 0.850 | 0.811 | 0.830 | 0.904 |
east-asia | 97601 | 80108 | 10181 | 2309918 | 17493 | 0.887 | 0.821 | 0.853 | 0.909 |
military-and-warfare | 103937 | 64034 | 15022 | 2298741 | 39903 | 0.810 | 0.616 | 0.700 | 0.774 |
politics-and-government | 95746 | 53169 | 15573 | 2306381 | 42577 | 0.773 | 0.555 | 0.646 | 0.724 |
west-asia | 91404 | 73965 | 8718 | 2317578 | 17439 | 0.895 | 0.809 | 0.850 | 0.914 |
philosophy-and-religion | 89395 | 51774 | 14468 | 2313837 | 37621 | 0.782 | 0.579 | 0.665 | 0.717 |
visual-arts | 86498 | 51830 | 14574 | 2316628 | 34668 | 0.781 | 0.599 | 0.678 | 0.744 |
transportation | 84279 | 71175 | 6045 | 2327376 | 13104 | 0.922 | 0.845 | 0.881 | 0.921 |
literature | 75739 | 41881 | 11840 | 2330121 | 33858 | 0.780 | 0.553 | 0.647 | 0.718 |
south-asia | 72071 | 60734 | 5569 | 2340060 | 11337 | 0.916 | 0.843 | 0.878 | 0.919 |
africa | 71004 | 49691 | 8820 | 2337876 | 21313 | 0.849 | 0.700 | 0.767 | 0.835 |
south-america | 61705 | 46888 | 7646 | 2348349 | 14817 | 0.860 | 0.760 | 0.807 | 0.877 |
north-asia | 63870 | 47955 | 7789 | 2346041 | 15915 | 0.860 | 0.751 | 0.802 | 0.871 |
oceania | 60421 | 46368 | 5013 | 2352266 | 14053 | 0.902 | 0.767 | 0.829 | 0.876 |
business-and-economics | 53360 | 25192 | 9361 | 2354979 | 28168 | 0.729 | 0.472 | 0.573 | 0.614 |
technology | 44378 | 25983 | 8303 | 2365019 | 18395 | 0.758 | 0.585 | 0.661 | 0.721 |
engineering | 45147 | 30979 | 4465 | 2368088 | 14168 | 0.874 | 0.686 | 0.769 | 0.820 |
architecture | 42920 | 22882 | 7432 | 2367348 | 20038 | 0.755 | 0.533 | 0.625 | 0.684 |
medicine-and-health | 41775 | 28278 | 4943 | 2370982 | 13497 | 0.851 | 0.677 | 0.754 | 0.815 |
earth-and-environment | 40405 | 26281 | 4867 | 2372428 | 14124 | 0.844 | 0.650 | 0.735 | 0.779 |
television | 38695 | 26503 | 5099 | 2373906 | 12192 | 0.839 | 0.685 | 0.754 | 0.806 |
society | 38816 | 10600 | 6410 | 2372474 | 28216 | 0.623 | 0.273 | 0.380 | 0.407 |
southeast-asia | 38135 | 27982 | 3772 | 2375793 | 10153 | 0.881 | 0.734 | 0.801 | 0.856 |
space | 36009 | 32368 | 1331 | 2380360 | 3641 | 0.961 | 0.899 | 0.929 | 0.960 |
linguistics | 32034 | 20520 | 2452 | 2383214 | 11514 | 0.893 | 0.641 | 0.746 | 0.772 |
computing | 25619 | 18487 | 3769 | 2388312 | 7132 | 0.831 | 0.722 | 0.772 | 0.837 |
central-america | 24075 | 14708 | 2659 | 2390966 | 9367 | 0.847 | 0.611 | 0.710 | 0.762 |
entertainment | 22751 | 9633 | 4024 | 2390925 | 13118 | 0.705 | 0.423 | 0.529 | 0.587 |
internet-culture | 23239 | 17083 | 2064 | 2392397 | 6156 | 0.892 | 0.735 | 0.806 | 0.872 |
education | 23323 | 5774 | 3312 | 2391065 | 17549 | 0.635 | 0.248 | 0.356 | 0.381 |
chemistry | 22115 | 16800 | 2614 | 2392971 | 5315 | 0.865 | 0.760 | 0.809 | 0.884 |
northern-africa | 20544 | 11847 | 3437 | 2393719 | 8697 | 0.775 | 0.577 | 0.661 | 0.706 |
food-and-drink | 19547 | 11808 | 2595 | 2395558 | 7739 | 0.820 | 0.604 | 0.696 | 0.731 |
performing-arts | 17512 | 8030 | 2796 | 2397392 | 9482 | 0.742 | 0.459 | 0.567 | 0.587 |
physics | 16666 | 9738 | 2722 | 2398312 | 6928 | 0.782 | 0.584 | 0.669 | 0.730 |
books | 17010 | 9479 | 2561 | 2398129 | 7531 | 0.787 | 0.557 | 0.653 | 0.686 |
video-games | 16301 | 14382 | 700 | 2400699 | 1919 | 0.954 | 0.882 | 0.917 | 0.947 |
mathematics | 15628 | 10965 | 2110 | 2399962 | 4663 | 0.839 | 0.702 | 0.764 | 0.820 |
eastern-africa | 16386 | 10657 | 1831 | 2399483 | 5729 | 0.853 | 0.650 | 0.738 | 0.793 |
comics-and-anime | 14774 | 11416 | 1248 | 2401678 | 3358 | 0.901 | 0.773 | 0.832 | 0.862 |
software | 14288 | 8377 | 3391 | 2400021 | 5911 | 0.712 | 0.586 | 0.643 | 0.692 |
western-africa | 14481 | 9804 | 1640 | 2401579 | 4677 | 0.857 | 0.677 | 0.756 | 0.816 |
southern-africa | 10012 | 6318 | 1018 | 2406670 | 3694 | 0.861 | 0.631 | 0.728 | 0.758 |
central-asia | 9549 | 5304 | 1602 | 2406549 | 4245 | 0.768 | 0.555 | 0.645 | 0.687 |
central-africa | 6698 | 3932 | 985 | 2410017 | 2766 | 0.800 | 0.587 | 0.677 | 0.722 |
fashion | 5789 | 2679 | 892 | 2411019 | 3110 | 0.750 | 0.463 | 0.572 | 0.579 |
radio | 4373 | 2349 | 522 | 2412805 | 2024 | 0.818 | 0.537 | 0.649 | 0.636 |
libraries-and-information | 3765 | 1449 | 537 | 2413398 | 2316 | 0.730 | 0.385 | 0.504 | 0.470 |