|
I am a UK journalist, writing for the Mail on Sunday. I work on infographics, and I'm looking to do something on the history and evolution of Wikipedia. What I'm wondering is, is it possible to build a scraper or similar tool to go over every wikipedia article and note when it was first made and the category it belongs to. This way I'm hoping to come up with a dataset that shows not only how quickly wikipedia expanded, but which subject areas grew first and quickest, and which have the most articles. I'm not code-literate; I'm looking for someone who knows a reliable way to do this, and who is prepared to sit and carry out the data-gathering - for which we are prepared to pay. For more info, email me at christopher.hall@mailonsunday.co.uk Thanks |
|
I've started a database dump containing the data you are looking for on the Wikimedia Toolserver. It should be done tomorrow, unless the server gods smite it for using too many resources. I'll post the URL if it goes through. You'll have to normalize the categories yourself, though. Amazing! Any pointers on a README on how to play around with that dump (fields heading, sql queries that would list page creation dates etc ...)
(07 Feb '11, 22:45)
rgrp ♦♦
1
OK, data is here: http://toolserver.org/~magnus/data/en_created_cats.tab.gz Format is TITLE - CREATION_DATE - CATEGORIES SQL query was : select page_title,min(rev_timestamp) AS created,group_concat(distinct cl_to) from page,revision,categorylinks where page_id=rev_page and page_namespace=0 and page_is_redirect=0 and cl_from=page_id group by page_id
(08 Feb '11, 09:14)
Magnus Manske
Magnus, Thanks for helping with my request. I'm not sure I fully understand the database you've created - but it looks like it's on the right lines. How do you read the 'date created' column, eg 20010803163502? Also, this file contains details for some 65000 pages - what does that set represent? Recently updated pages, or a random subset, or something else? Finally, the categories data is useful, but there are just too many categories. Is it possible to categorise pages according to Wiki's contents divisions? (http://en.wikipedia.org/wiki/Portal:Contents) Thanks again, Chris
(08 Feb '11, 11:29)
Chris Hall
The date format is Year - Month - Day - Hour - Minute - Second (2 digits each, except year) The file I linked to contains data on all 3.5 million articles on en.wikipedia. If you only see 65000, your software (Excel?) has truncated them. Backtracking categories to the "divisions" is not trivial technically; also, there is no guarantee than an article will fall into any of the divisions, or it might fall into more than one. I might have a look at that later today, but no promises (/me eyes stack of urgent to-do things on desk).
(08 Feb '11, 11:46)
Magnus Manske
Ah thanks, I thought that date format was something like that. You're right, excel is truncating the data. re. categories, any help is much appreciated, but I understand you've got more important things to do. Chris
(08 Feb '11, 14:33)
Chris Hall
Chris: if Magnus is busy (he has already done an amazing job for you!) I can help out with the data-wrangling here (Excel really isn't going to cut it given the size of the data).
(08 Feb '11, 15:04)
rgrp ♦♦
There you go: http://toolserver.org/~magnus/data/New_articles_by_topic.xlsx I have summarized articles by day, so the whole thing is only 1MB and won't be eaten by Excel. Each article is counted only once, even if it tracks to several categories (in these cases, I counted it for the "majority root" of the individual category trees). Most articles (>3.4M) could be tracked to such a a "root" topic, but a few had no discernable category associated.
(08 Feb '11, 21:08)
Magnus Manske
Magnus, Thanks, but I can't download the file. Your link just displays random characters. As it's not a large file, you could try emailing it to me if that's easier? Really appreciate your help here. @rgrp thanks, will let you know if I need help with the data once I get it.
(09 Feb '11, 10:36)
Chris Hall
@Chris, I was talking about processing the raw data (e.g. the 600Mb tab-separated dump) -- I don't tend to use Excel ;)
(09 Feb '11, 16:13)
rgrp ♦♦
showing 5 of 9
show all
|
|
Hi Christopher, To answer your first question, yes, it is possible, with a few exceptions. The data dumps required to answer the question can be found at http://dumps.wikimedia.org/enwiki/20110115/ However, there might be some things to consider as they do not contain version histories of articles that were created once and have been deleted since then. Matthias, Thanks for your help. So, just to be clear, if I wanted the article title, date of creation and category info for every English Wikipedia page, I would to download from here? http://dumps.wikimedia.org/enwiki/20110115/ I'm only interested in what exists on wikipedia now, so if a page was created and deleted, that's not a problem if it doesn't show up in the data. Chris
(08 Feb '11, 11:33)
Chris Hall
|
|
As detailed on the CKAN Wikipedia entry Wikipedia does provide database dumps and you would definitely want to use these rather than scraping the Wikipedia site (which is forbidden!). The dumps are very large but you only need the metadata (page first created, when edits were made, categories etc) so the problem is rather more tractable.
I'm just going to fold my efforts into Magnus and Matthias' great contributions. |
Get the Data