Research:Language-Agnostic Topic Classification/Outlink model performance

This page provides details about the current (as of December 2020) model for Wikipedia topic classification based on outlinks (as represented by their corresponding Wikidata IDs):

  • Code: https://github.com/geohci/wikipedia-language-agnostic-topic-classification
  • Model architecture: multi-label fastText supervised model
  • Training data used a 90% sample of every language in Wikipedia. In practice, this meant that English Wikipedia provides 11.4% of the data and then Cebuano (8.8%), Swedish (6.4%), German (4.6%), French (4.2%), and all other languages are below 4%.
  • Epochs: 2
  • Learning rate: 0.1
  • Window size: 20
  • Min count (under which QID is not retained in vocab): 20
  • No pre-trained embeddings used
  • Embeddings dimension: 50
  • Total number of model params: 3200 (50 x 64)
  • Vocab size: 4,145,064
  • Total number of embeddings params: 207,253,200 (vocab size * embeddings dimension)
  • Model size on disk: 863 MB

Test Results

edit

Regarding the test results below, there is a strong limitation in that there is no labeled data for articles that do not have English Wikipedia equivalents. It is very possible that the performance degrades for these articles (assuming they also have outlinks to different types of articles).

Overall results:

  • Precision: 0.877 (micro); 0.836 (macro)
  • Recall: 0.793 (micro); 0.678 (macro)
  • F1: 0.830 (micro); 0.744 (macro)
  • Average precision: 0.891 (micro); 0.795 (macro)

For qualitative evaluations (of a slightly older model), see task T266201. Overall, precision was 94.7% in Arabic; 81.4% in Czech, 89.5% in English, 88.0% in French, 91.3% in Vietnamese.

Topic n TP FP TN FN Precision Recall F1 Ave. Pre.
europe 783307 681220 90139 1544254 102087 0.883 0.870 0.876 0.952
biography 677093 621664 49035 1691572 55429 0.927 0.918 0.922 0.971
stem 480236 431709 24123 1913341 48527 0.947 0.899 0.922 0.973
asia 360865 304097 35810 2021025 56768 0.895 0.843 0.868 0.933
sports 334774 315905 10497 2072429 18869 0.968 0.944 0.956 0.980
media 327692 281754 37092 2052916 45938 0.884 0.860 0.872 0.938
western-europe 290047 243700 28115 2099538 46347 0.897 0.840 0.867 0.944
north-america 282981 216595 40953 2093766 66386 0.841 0.765 0.801 0.895
biology 247481 232241 6260 2163959 15240 0.974 0.938 0.956 0.985
geographical 213947 145537 33160 2170593 68410 0.814 0.680 0.741 0.825
eastern-europe 176961 149351 14603 2226136 27610 0.911 0.844 0.876 0.938
southern-europe 172638 138080 19706 2225356 34558 0.875 0.800 0.836 0.913
northern-europe 161724 118894 21354 2234622 42830 0.848 0.735 0.787 0.869
history 142121 83506 20703 2254876 58615 0.801 0.588 0.678 0.757
music 127211 110574 9764 2280725 16637 0.919 0.869 0.893 0.939
women 113835 47597 20674 2283191 66238 0.697 0.418 0.523 0.612
films 112002 90799 15972 2289726 21203 0.850 0.811 0.830 0.904
east-asia 97601 80108 10181 2309918 17493 0.887 0.821 0.853 0.909
military-and-warfare 103937 64034 15022 2298741 39903 0.810 0.616 0.700 0.774
politics-and-government 95746 53169 15573 2306381 42577 0.773 0.555 0.646 0.724
west-asia 91404 73965 8718 2317578 17439 0.895 0.809 0.850 0.914
philosophy-and-religion 89395 51774 14468 2313837 37621 0.782 0.579 0.665 0.717
visual-arts 86498 51830 14574 2316628 34668 0.781 0.599 0.678 0.744
transportation 84279 71175 6045 2327376 13104 0.922 0.845 0.881 0.921
literature 75739 41881 11840 2330121 33858 0.780 0.553 0.647 0.718
south-asia 72071 60734 5569 2340060 11337 0.916 0.843 0.878 0.919
africa 71004 49691 8820 2337876 21313 0.849 0.700 0.767 0.835
south-america 61705 46888 7646 2348349 14817 0.860 0.760 0.807 0.877
north-asia 63870 47955 7789 2346041 15915 0.860 0.751 0.802 0.871
oceania 60421 46368 5013 2352266 14053 0.902 0.767 0.829 0.876
business-and-economics 53360 25192 9361 2354979 28168 0.729 0.472 0.573 0.614
technology 44378 25983 8303 2365019 18395 0.758 0.585 0.661 0.721
engineering 45147 30979 4465 2368088 14168 0.874 0.686 0.769 0.820
architecture 42920 22882 7432 2367348 20038 0.755 0.533 0.625 0.684
medicine-and-health 41775 28278 4943 2370982 13497 0.851 0.677 0.754 0.815
earth-and-environment 40405 26281 4867 2372428 14124 0.844 0.650 0.735 0.779
television 38695 26503 5099 2373906 12192 0.839 0.685 0.754 0.806
society 38816 10600 6410 2372474 28216 0.623 0.273 0.380 0.407
southeast-asia 38135 27982 3772 2375793 10153 0.881 0.734 0.801 0.856
space 36009 32368 1331 2380360 3641 0.961 0.899 0.929 0.960
linguistics 32034 20520 2452 2383214 11514 0.893 0.641 0.746 0.772
computing 25619 18487 3769 2388312 7132 0.831 0.722 0.772 0.837
central-america 24075 14708 2659 2390966 9367 0.847 0.611 0.710 0.762
entertainment 22751 9633 4024 2390925 13118 0.705 0.423 0.529 0.587
internet-culture 23239 17083 2064 2392397 6156 0.892 0.735 0.806 0.872
education 23323 5774 3312 2391065 17549 0.635 0.248 0.356 0.381
chemistry 22115 16800 2614 2392971 5315 0.865 0.760 0.809 0.884
northern-africa 20544 11847 3437 2393719 8697 0.775 0.577 0.661 0.706
food-and-drink 19547 11808 2595 2395558 7739 0.820 0.604 0.696 0.731
performing-arts 17512 8030 2796 2397392 9482 0.742 0.459 0.567 0.587
physics 16666 9738 2722 2398312 6928 0.782 0.584 0.669 0.730
books 17010 9479 2561 2398129 7531 0.787 0.557 0.653 0.686
video-games 16301 14382 700 2400699 1919 0.954 0.882 0.917 0.947
mathematics 15628 10965 2110 2399962 4663 0.839 0.702 0.764 0.820
eastern-africa 16386 10657 1831 2399483 5729 0.853 0.650 0.738 0.793
comics-and-anime 14774 11416 1248 2401678 3358 0.901 0.773 0.832 0.862
software 14288 8377 3391 2400021 5911 0.712 0.586 0.643 0.692
western-africa 14481 9804 1640 2401579 4677 0.857 0.677 0.756 0.816
southern-africa 10012 6318 1018 2406670 3694 0.861 0.631 0.728 0.758
central-asia 9549 5304 1602 2406549 4245 0.768 0.555 0.645 0.687
central-africa 6698 3932 985 2410017 2766 0.800 0.587 0.677 0.722
fashion 5789 2679 892 2411019 3110 0.750 0.463 0.572 0.579
radio 4373 2349 522 2412805 2024 0.818 0.537 0.649 0.636
libraries-and-information 3765 1449 537 2413398 2316 0.730 0.385 0.504 0.470