Plotting Wikipedia Data

Posted at — Jun 15, 2020

Since January I have created a tool which allows for the creation of a database of Wikipedia edits, Namespace Database. As I now have this database I have been able to create the following plots. This is a work in progress, and will evolve over time :)

💡 Tip: Click and hold the images to zoom

Glossary

Namespace

A namespace on Wikipedia is the section that the page is in. Most people only know about the article or main namespace, where the articles are. There are other namespaces such as User for profiles, Wikipedia for policy discussion etc.


Talk space

The main focus of the last third of the graphs is talk space. This is section for discussion of page, and every namespace has a corresponding talk space. In my graphs I am only looking at the talk space of articles. For example, people arguing whether to use BC or AD on the Jesus article


Special Users

Simply these are users you could think of as admins. They have privilege to edit protected pages, confirm edits aren’t spam etc. You can view all of the groups in figures 18.1 and 18.2.


Bot editors

Bots on Wikipedia don’t directly add content, but they help by doing the tedious tasks like fixing broken links and adding templates, which help humans fix articles.

Pages

Chart
Figure 3.1 linear

The namespaces with the most pages are Article and User talk. User talk has more pages than User as users themselves usually create their own pages, whereas anyone can write on your talk page.

Chart
Figure 3.2 log

With a log view on the same plot, we see that there is a large variety in the number of pages in each namespace. Two namespaces have no pages and are not listed here which are the virtual namespaces -1 Special and -2 Media.

Chart
Figure 11

Most of the pages in the top four namespaces only have a single edit. Main talk space however more often has a couple edits than a single one, potentially due to a talk edit provoking conversation. Unsurprisingly, main space has the highest ratio of pages with over 100 edits. It’s worth noting that the Y-axes are consistent across all plots.

Chart
Figure 16
This plot is a alpha version of the next plot. We see that the most prolific editors have edited in pretty much all namespaces except Mediawiki, where you need to be a sysadmin to edit; TimedText and TimedText Talk, which is for audio and video files; Education Program and Education Program talk, which were superseded by WikiEdu;, Gadget and Gadget talk, for JavaScript tools for Wikipedia; Gadget definition and Gadget definition talk, for the JavaScript itself
Chart
Figure 29

Unsurprisingly, the namespace with the most editors is main article space, with all other namespaces having less engagement. Bots are active in a lot of namespaces, and ~40% don’t edit main space.

Users

Chart
Figure 1

Across all users, including IP, most only make a single edit.

Chart
Figure 2

94.3% of users do not edit in talk space

Chart
Figure 4

Showing the intersection of the above graphs, 3.6% of users edit both main and talk space.

Chart
Figure 5.1

Ser Amantio di Nicolao, Koavf, and Rich Farmbrough are the only editors to appear in both graphs

Chart
Figure 5.2
Chart
Figure 5.3
Chart
Figure 5.4

208.81.184.4, 129.33.19.294 and 204.153.84.10 are the only editors to appear in both graphs

Chart
Figure 5.5
Chart
Figure 5.6

AnomieBTO, Yobot and InternetArchiveBot are the only editors to appear in both graphs.

Chart
Figure 8

We have mainspace edits on the left and talkpage edits on the right, and each column is a different user group. Each group seems to have different peak in main space edits, with blocked being a subset of the user distribution. Bots are the most different group of editors which follows intuition.

Chart
Figure 9

This is a break down of figure 4 for the different user groups. Blocked users proportionally edit both main and talk compared to the wider population of users in general.

Chart
Figure 10.1

Users with special privileges have both the longest maximum time between edits and edit more frequently than users in general. Blocked users have shorter account durations which follows intution.

Chart
Figure 10.2
Chart
Figure 18.1 linear

Unlike the other plots, the data represented in this chart is directly from the dumps and no processing is needed. These are all of the user groups in wikipedia and all of them are positive and denote a privilege that the user can subsequently do. Extended confirmed is the largest group as all non-blocked accounts that reach 500 edits and are 30 days old become extended confirmed.

Chart
Figure 18.2 log

A log version of the previous plot. Details of the groups can be found User access levels page at Wikipedia.

Chart
Figure 20

Only 93,851 IPs have been indefinitely blocked on Wikipedia, far less than the 41 million IPs that have ever made edits.

Chart
Figure 21

The 14,074 special users (excluding extended confirmed, confirmed and bots) make 42% of talkpage edits. IP users have the greatest reduction making up only 9% of talkpage edits.

Chart
Figure 37.1
Chart
Figure 37.2
Chart
Figure 24

This shows the proportion of users that have a specific boolean feature. Autoconfirmed is the largest as all users with over 5 edits are autoconfirmed, showing the overwhelming majority of users make less than that many edits.

Chart
Figure 38.1
Chart
Figure 38.2
Chart
Figure 38.3
Chart
Figure 38.4
Chart
Figure 38.5
Chart
Figure 38.6
Chart
Figure 38.7
Chart
Figure 39

Talkpage Edits

Chart
Figure 12

Experimentally we applied TextBlob to get the sentiment of talkpage edits. On average all edits add or remove positive content. As the average removed content is more neutral, we can imagine that more negative content is removed.

Chart
Figure 13.1

To investigate whether there was a bias on whether positive or negative content was added, I seperated edits from different groups into subsets where the content added/deleted was positive/negative. Additionally, only edits where both added and deleted had sentiment were considered. On average, when blocked IP users add negative content they also delete positive content.

Chart
Figure 13.2

It’s clearer here that blocked IP’s are the only editors “cross the line”, signalling that they deleted the opposite sentiment on average to the other groups.

Chart
Figure 14

IPs use similar amounts of profanity whether they are blocked or not.

Chart
Figure 15

This shows the average value for all edit fields with error bars.

Chart
Figure 17

The average inserted internal links must be at least 1 as every comment is automatically inserted with a link to their user page if they don’t sign their comment. I expected users with special privileges to insert far more links.

Chart
Figure 19

It’s clear that edits that we think of as spammy — high vulgarity, high reversions, blanking (removing pages) - are performed by blocked IPs and not blocked users. Blocked IPs also delete the least but add the most, think copy and pasting lots of random text. Special users suprisingly don’t appear on any extremes other than they are reverted the least. The vertical bars are the average of all users.

Chart
Figure 22

Showing the range of all the integer values. Sentiment is constrained into a very tight range, suggesting people aren’t usually emotive in talkpage edits.

Chart
Figure 32.1
Chart
Figure 32.2
Chart
Figure 23

This shows the proportion of edits which have a specific boolean feature.

Chart
Figure 25

The large spike is from the 14th-17th of September 2012 and was the result of Yobot making 329,486 edits in 4 days, phew!

Chart
Figure 30.1

Figure 25 but per year and per month, the large spike in 2012 even seems to appear in the yearly plot. It does seem that talkpage engagement has fallen since 2007 at 6,072,548 edits.

Chart
Figure 31

Much less noisy when bots are excluded, the dip post 2020 is due to most of the dataset being generated with the dump up to January 1st, 2020 until it stopped being hosted, when the April data set was used instead.

Chart
Figure 30.2
Chart
Figure 34
Chart
Figure 26

A plot similar, but showing how editing has changed in the first 9 years compared to the second 9 years of the dataset. There’s less vulgarity, less reversion, less blanking, longer comments so at this level Wikipedia talkpage edits seem healthier on average.

Chart
Figure 27

All of the features in the previous plots over time.

Chart
Figure 28

Same as the previous plot but averaged per year.

Chart
Figure 35.1.1
Chart
Figure 35.1.2
Chart
Figure 35.1.3
Chart
Figure 35.1.4
Chart
Figure 35.2.1
Chart
Figure 35.2.2
Chart
Figure 35.2.3
Chart
Figure 35.2.4