Research:Understanding Engagement with Images in Wikipedia/First Round of Analysis
In this page, we summarize the results of the first round of data analysis. We started a first exploratory analysis of readers engagement with images in Wikipedia using the data collected from the web request table.
Data & Methods
editData
editFor this first exploratory study, we collected 2 weeks of data, from May 1st 2020 to May 15th 2020. From the webrequest logs, we collected all the page views and image views for each reading session, identifying reading sessions by concatenating client_ip+user_agent. We discarded views generated by bots and logged in users, and we considered only views to the main namespace (namespace 0). We aggregated data into desktop vs. mobile generated views, and by location of the views (focusing on the country level and excluding from the analysis all the countries with less than 500 daily page views for privacy concerns). We collected data for 4 Wikipedia language editions: English (enwiki), French (frwiki), Spanish (eswiki), and Arabic (arwiki). These languages are among the most spoken languages worldwide, and their Wikipedia editions have more than 1M articles and are among the most viewed.
Total number of page and image views
editThe following figures show the total number of page and image views collected. We observe:
- the total number of page loads is significantly higher for English Wikipedia compared to the other languages and for mobile compared to desktop;
- the total number of image loads is higher for desktop than for mobile (for enwiki and frwiki) and, in general, it is one order of magnitude lower than page loads.
Methods
editTo quantify readers engagement with images, we defined two metrics: the page-specific click-through rate (CTR) and the image-specific click-through rate. The page-specific CTR is defined, for each page, as the ratio between the number of sessions which clicked on at least one image in that page, and the total number of sessions that clicked on that page at least once. It can be interpreted as the probability of observing at least on click on an image in a given page. The image-specific CTR is defined as the ratio between the total number of image views for a given image and the total number of page views of the pages containing that image, and it quantifies the popularity of the image.
Results
editAverage page-specific click-through rate
editWe compute the page-specific CTR and we take the average across all pages. The figure below shows the daily average page-specific CTR for each Wikipedia edition for desktop access (on the left) and mobile web access (on the right).
We find:
- the average page-specific CTR shows a weekly pattern with an increased probability of clicking on images over weekends with respect to weekdays, except for Arabic Wikipedia;
- on average, the page-specific CTR is 3.5% for English, 3.7% for French, 2.9% Spanish, and 2.2% for Arabic Wikipedia, for desktop;
- on average, the page-specific CTR is 1% lower for mobile web.
Page-specific click-through rate distribution for English Wikipedia
editAs an example, we plot the page-specific CTR distribution for English Wikipedia for desktop access. Note that the average page-specific CTR for images is one order of magnitude larger than for citations[1].
Topic analysis
editTo break down our analysis by topic, we assign each page a topic using the Wikidata topic model and choosing the topic with the highest probability. We compute the page-specific click-through rate for each topic by averaging the click-through rate over all pages associated to the same topic.
The following figure shows the average page-specific CTR by topic for each language and desktop access (for mobile access the figure is similar).
We find:
- the distribution of the click-through rate is similar across Wikipedia editions, while it is not across topics. This suggests significant differences in the way readers engage with images based on the topic of interest;
- the topics where the images tend to be more clicked are Culture.Visual_arts, History_and_Society.Transportation, and STEM.Engineering. While Culture.Visual_arts naturally match our expectations, the latters can be explained because they are associated with pages with a large number of images that show vehicles, maps, and, in general, content that may be worth looking in details;
- on the other hand, images are less clicked on topics such as STEM.Maths_and_Physics, Culture.Literature, and Culture.Linguistics, where pages contain less images on average;
- counterintuitively, Culture.Sports has a low click-through rate. This may be explained, for example, by the presence of a large number of pages containing small .svg images teams and leagues flags which we haven’t removed;
- for mobile web access, the considerations remain the same.
Average page-specific click-through rate by country
editTo get deeper into the role of language and location for visual content engagement, we break down our analysis by country. Namely, we compute the average page-specific CTR for each country in the world. We remove all countries for which the total average daily views are less than 500 for privacy concerns. The following figures show the average page-specific CTR by country for desktop access for each of the four Wikipedia languages considered.
We find:
- for all Wikipedia editions, European countries have high CTR with few exceptions, like Germany for eswiki and arwiki;
- for English Wikipedia, the CTR is high for all English speaking countries, as expected, and, in general, for all countries in the Northern in Europe, North America, the Middle East, Southeast Asia, and in Russia and Australia, while it is particularly low in the majority of the countries in Africa, and some countries in South America, despite being the most read version of Wikipedia. We observe significant high values in the Eastern and Northern European countries;
- for French Wikipedia, the pattern is similar to that of English Wikipedia, but the overall coverage is lower;
- for Spanish Wikipedia, the CTR is high in Europe (except for Germany and Russia) and North America. In South America, counterintuitively the CTR is overall low, with the only exception of Brazil. The rest of the countries are missing or with low CTR;
- for Arabic Wikipedia, the CTR is high for Arabic speaking countries, Europe, Australia, Japan, and it shows two outliers: Canada and Finland.
We plan to expand this geographical analysis by taking into account country level statistics.
Image-specific click-through rate
editThe following figure shows the distribution of the distribution of the number of views per image. The majority of images (92%) have been viewed less than 100 times during the period of the data collection, with an average of 39 views per image.
We compute the image-specific click-through rate for each image having at least 1 click. The following figure shows the distribution of the click-through rate.
We find the click-through rate is 0.18 on average.
In order to understand the role of page popularity in consumption of images, we analyse the distribution of click-through rate by pageviews. We plot, for each image, the total number of pageviews of the pages containing that image vs. its click-through rate.
We find an inverse relation between these two dimensions: readers of pages with more views tend to click less on images.
The role of the Main Page
editDuring our analyses, we found that images placed in the Main Page receive a significant proportion of the total number of image views. Therefore, we inspect the role of the Main Page in increasing readers engagement with images.
We find:
- on average, images placed on the Main Page receive significantly more views than others (385 vs. 6);
- moreover, on average 58% of these views are from the Main Page;
- in contrast, their click-through rate is 50% smaller on average. This can be explained by the fact that when a page is placed in the Main Page, readers tend to click on the page without further viewing the images.