Qwiki
V1.0 Product - Transform Wikipedia Articles into
"Interactive Videos"
What data do we need?
- Article Text
- Structured Data
- Media: Images/Videos
BEFORE CONTINUING...
Always remember your goal!
It's very easy to get lost in the data
Article Text
HTML format
...
WikiText
The '''Boston Red Sox''' are a [[professional baseball|professional baseball team]] based in [[Boston]], [[Massachusetts]], and a member of [[Major League Baseball]]'s [[American League East|American League Eastern Division]]. Founded in {{by|1901}} as one of the American League's eight charter franchises,
AVOID parsing wikitext if you can...
If you just need the abstract text:
<doc>
<title>Wikipedia: Peter Duchan</title>
<url>http://en.wikipedia.org/wiki/Peter_Duchan</url>
<abstract>Peter Duchan is an American script writer. He was a writer of the 2009 film Breaking Upwards.</abstract>
Template Expansion is difficult
Templates can be custom functions to embed/eval content.
{{for|the card game|Contract bridge}}
{{#ifeq: yes | yes | Hooray...! | Darn...! }}
Wikitext Parsers
mediawiki parser (PHP)
We used
sweble (Java) with Scala.
Alternatively you can setup a wikipedia mirror
and parse the HTML. (slow!)
Structured Data
Birth and death dates
Location coordinates
City/Country statistics (population)
Structured Data Output
Visualizations
But Structured data is in WIKIText
Can we use another data source?
Freebase maybe?
Freebase
Easily accessible HTTP API- Yes
Formatted JSON - Yes
Data Quality - No!
Argentina population (Wikipedia)
Argentina Population (Freebase)
Argentina Population (google)
solution was
parsing the wikitext Infobox
New York City
{{Infobox settlement
|population_total = 8,175,133
New York
{{Infobox U.S. state
|2000Pop = 19,465,197 (2011 est)<ref name=PopEstUS/>
Germany
{{Infobox country
|population_estimate = 81,799,600<ref name=population />
|population_estimate_year = 2010
Images
Images found on Wikipedia mostly originate
You will need to download the data dump for the image caption/metadata.
Beware Large Images
30,000 x 27,904 pixels; file size 225 MB
Parallel Processing
Our Sweble parsing was easily integrated into a Hadoop job.
Processed using Amazon EMR with spot instances.
Multiple passes through the dataset.
Takeaway
Know what you want.
Consider the edge cases.
Use available open tools.
(and customize them when necessary)
Questions mining Wikipedia?
contact info