Plotting Wikipedia Data
Since January I have created a tool which allows for the creation of a database of Wikipedia edits, Namespace Database. This page highlights the best plots from my database of talk page edits. All of the plots can be seen here.
π‘ Tip: Click and hold the images to zoom
Glossary
Namespace
A namespace on Wikipedia is the section that the page is in. Most people only know about the article or main namespace, where the articles are. There are other namespaces such as User for profiles, Wikipedia for policy discussion etc.
Talk space
The main focus of the last third of the graphs is talk space. This is section for discussion of page, and every namespace has a corresponding talk space. In my graphs I am only looking at the talk space of articles. For example, people arguing whether to use BC or AD on the Jesus article
Special Users
Simply these are users you could think of as admins. They have privilege to edit protected pages, confirm edits aren’t spam etc. You can view all of the groups in figures 18.1 and 18.2.
Bot editors
Bots on Wikipedia don’t directly add content, but they help by doing the tedious tasks like fixing broken links and adding templates, which help humans fix articles.
Users
Demonstrating the population of each group with a log axis.
Ser Amantio di Nicolao, Koavf, and Rich Farmbrough are the only editors to appear in both graphs
208.81.184.4, 129.33.19.294 and 204.153.84.10 are the only editors to appear in both graphs
Blocked users proportionally edit both main and talk compared to the wider population of users in general.
Users with special privileges have both the longest maximum time between edits and edit more frequently than users in general. Blocked users have shorter account durations which follows intution.
Unlike the other plots, the data represented in this chart is directly from the dumps and no processing is needed. These are all of the user groups in wikipedia and all of them are positive and denote a privilege that the user can subsequently do. Extended confirmed is the largest group as all non-blocked accounts that reach 500 edits and are 30 days old become extended confirmed. Details of the groups can be found User access levels page at Wikipedia.
Only 93,851 IPs have been indefinitely blocked on Wikipedia, far less than the 41 million IPs that have ever made edits.
The 14,074 special users (excluding extended confirmed, confirmed and bots) make 42% of talkpage edits. IP users have the greatest reduction making up only 9% of talkpage edits. We can see that there seems to be a reverse correlation between group size and number of talkpage edits made.
This shows the previous figure over time, demonstrating that talkpage activity peaked in 2007 and has steadily decreased. Positively special user activity seems to be decreasing at a slower rate. The number of editors has plateaued in recent years
This plots the first and last edits of each user, and how many edits they’ve made.
Most special users created their accounts in 2008 and have continued editing since. This is very positive and shows the commitment of these editors. Positively, we can see that a consistent amount of new editors are being created.
In contrast we can see that while blocked users do make a lot of edits, they donβt stick around for very long. This is positive as it would be concerning to see many long standing and productive users getting blocked.
Talkpage Edits
IPs use similar amounts of profanity whether they are blocked or not.
It’s clear that edits that we think of as spammy β high vulgarity, high reversions, blanking (removing pages) - are mainly made by IP and blocked IP users. Blocked users are sometimes higher in these regards, but in general are very similar to users. Blocked IPs also delete the least but add the most, think copy and pasting lots of random text. Special users suprisingly don’t appear on any extremes other than they are reverted the least. As they edit the most, they generally pull the all user average (vertical bars) towards them. Finally, it seems that on average all user groups add content with positive sentiment. This suggests that people are generally positive and nice when talking to each other on talk pages.
A dendrogram created from the features from the previous plot
The large spike is from the 14th-17th of September 2012 and was the result of Yobot making 329,486 edits in 4 days, phew!
Figure 25 but per year and per month, the large spike in 2012 even seems to appear in the yearly plot. It does seem that talkpage engagement has fallen since 2007 at 6,072,548 edits.
A plot similar, but showing how editing has changed in the first 9 years compared to the second 9 years of the dataset. There’s less vulgarity, less reversion, less blanking, longer comments so at this level Wikipedia talkpage edits seem healthier on average.
Pages
The namespaces with the most pages are Article and User talk. User talk has more pages than User as users themselves usually create their own pages, whereas anyone can write on your talk page. The log on the x axis allows us to see that there is a large variety in the number of pages in each namespace. Two namespaces have no pages and are not listed here which are the virtual namespaces -1 Special and -2 Media.
Unsurprisingly, the namespace with the most editors is main article space, with all other namespaces having less engagement. Bots are active in a lot of namespaces, and ~40% don’t edit main space.
If you would like to see the rest of the plots, they are available here