Data Mining Wikipedia


20 Feb 2013


   V1.0 Product - Transform Wikipedia Articles into  
                                            "Interactive Videos"

Qwiki on Bing Video

What data do we need?

  • Article Text
  • Structured Data
  • Media: Images/Videos


Always remember your goal!

It's very easy to get lost in the data

Wikipedia details are open

Article Text

HTML format

not available in bulk since 2008.


Articles are available in bulk XML format.
Article text itself is in wikitext markup:

The '''Boston Red Sox''' are a [[professional baseball|professional baseball team]] based in [[Boston]], [[Massachusetts]], and a member of [[Major League Baseball]]'s [[American League East|American League Eastern Division]]. Founded in {{by|1901}} as one of the American League's eight charter franchises,

AVOID parsing wikitext if you can...

If you just need the abstract text:
<title>Wikipedia: Peter Duchan</title>
<abstract>Peter Duchan is an American script writer. He was a writer of the 2009 film Breaking Upwards.</abstract>

Template Expansion is difficult

Templates can be custom functions to embed/eval content.

{{for|the card game|Contract bridge}} 

{{#ifeq: yes | yes | Hooray...! | Darn...! }}

Wikitext Parsers

mediawiki parser (PHP)
bliki parser (Java)
Parsoid (Javascript node.js/C++)
and many others...

We used sweble (Java) with Scala.

Alternatively you can setup a wikipedia mirror 
and parse the HTML. (slow!)

Structured Data

Birth and death dates

Location coordinates

City/Country statistics (population)

Structured Data Output


But Structured data is in    WIKIText

Can we use another data source?
Freebase maybe?


Easily accessible HTTP API- Yes
Formatted JSON - Yes
Data Quality - No!

Argentina population (Wikipedia)

Argentina Population (Freebase)

Argentina Population (google)

solution was

  parsing the wikitext Infobox

We used dbpedia mappings to help determine the right key-value pairs.
New York City
{{Infobox settlement
|population_total                = 8,175,133
New York
{{Infobox U.S. state
|2000Pop         = 19,465,197 (2011 est)<ref name=PopEstUS/>
{{Infobox country
|population_estimate     = 81,799,600<ref name=population />
|population_estimate_year = 2010


Images found on Wikipedia mostly originate
 from Wikimedia Commons

You will need to download the data dump for the image caption/metadata.

Beware Large Images

30,000 x 27,904 pixels; file size 225 MB

Parallel Processing

Our Sweble parsing was easily integrated into a Hadoop job.

Processed using Amazon EMR with spot instances.

Multiple passes through the dataset. 


Know what you want.

Consider the edge cases.

Use available open tools. 
(and customize them when necessary)

Questions mining Wikipedia?

contact info