Data Mining Wikipedia


@tommychheng

http://tommychheng.com

20 Feb 2013

Qwiki

   V1.0 Product - Transform Wikipedia Articles into  
                                            "Interactive Videos"
                            

Qwiki on Bing Video



What data do we need?


  • Article Text
  • Structured Data
  • Media: Images/Videos

BEFORE CONTINUING...


Always remember your goal!

It's very easy to get lost in the data

Wikipedia details are open

Article Text

HTML format
...

not available in bulk since 2008.

WikiText

Articles are available in bulk XML format.
Article text itself is in wikitext markup:

The '''Boston Red Sox''' are a [[professional baseball|professional baseball team]] based in [[Boston]], [[Massachusetts]], and a member of [[Major League Baseball]]'s [[American League East|American League Eastern Division]]. Founded in {{by|1901}} as one of the American League's eight charter franchises,

AVOID parsing wikitext if you can...

If you just need the abstract text:
<doc>
<title>Wikipedia: Peter Duchan</title>
<url>http://en.wikipedia.org/wiki/Peter_Duchan</url>
<abstract>Peter Duchan is an American script writer. He was a writer of the 2009 film Breaking Upwards.</abstract>

Template Expansion is difficult

Templates can be custom functions to embed/eval content.

{{for|the card game|Contract bridge}} 

{{#ifeq: yes | yes | Hooray...! | Darn...! }}

http://en.wikipedia.org/wiki/Help:Template

Wikitext Parsers


mediawiki parser (PHP)
bliki parser (Java)
Parsoid (Javascript node.js/C++)
and many others...

We used sweble (Java) with Scala.

Alternatively you can setup a wikipedia mirror 
and parse the HTML. (slow!)

Structured Data

Birth and death dates

Location coordinates

City/Country statistics (population)

Structured Data Output

Visualizations

But Structured data is in    WIKIText


Can we use another data source?
Freebase maybe?

Freebase

Easily accessible HTTP API- Yes
Formatted JSON - Yes
Data Quality - No!


Argentina population (Wikipedia)


Argentina Population (Freebase)

Argentina Population (google)

solution was

  parsing the wikitext Infobox

We used dbpedia mappings to help determine the right key-value pairs.
New York City
{{Infobox settlement
|population_total                = 8,175,133
                                    
New York
{{Infobox U.S. state
|2000Pop         = 19,465,197 (2011 est)<ref name=PopEstUS/>
                                        
Germany
{{Infobox country
|population_estimate     = 81,799,600<ref name=population />
|population_estimate_year = 2010

Images

Images found on Wikipedia mostly originate
 from Wikimedia Commons

You will need to download the data dump for the image caption/metadata.


Beware Large Images


30,000 x 27,904 pixels; file size 225 MB

Parallel Processing

Our Sweble parsing was easily integrated into a Hadoop job.

Processed using Amazon EMR with spot instances.

Multiple passes through the dataset. 

Takeaway

Know what you want.

Consider the edge cases.

Use available open tools. 
(and customize them when necessary)

Questions mining Wikipedia?


contact info
@tommychheng
http://tommy.chheng.com/2013/02/12/data-mining-wikipedia-notes/