The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:
-
Wikipedia English-language Article Corpus (
wikipedia_corpus
; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article. -
Wikipedia Pagelink Graph (
wikipedia_pagelinks
; ): every page-to-page hyperlink in wikipedia. -
Wikipedia Pageview Stats (
wikipedia_pageviews
; 2.3 TB, about 250 billion records (FIXME: verify num records)): hour-by-hour pageviews for all of Wikipedia -
ASA SC/SG Data Expo Airline Flights (
airline_flights
; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes. -
NCDC Hourly Global Weather Measurements, 1929-2009 (
ncdc_weather_hourly
; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929. -
1998 World Cup access logs (
access_logs/ita_world_cup_apachelogs
; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format. -
60,000 UFO Sightings
-
Retrosheet Game Logs — every recorded baseball game back to the 1890s
-
Gutenberg Corpus
- - a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
-
Twilio/Wigle.net Street Vector Data Set — — geo — Twilio/Wigle.net database of mapped US street names and address ranges.
-
2008 TIGER/Line Shapefiles — 125 GB — geo — This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.
This data set is from the [ASA Statistical Computing / Statistical Graphics](http://stat-computing.org/dataexpo/2009/the-data.html) section 2009 contest, "Airline Flight Status — Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
The data comes originally from the DOT’s [Research and Innovative Technology Administration (RITA)](http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp) group, where it is [described in detail](http://www.transtats.bts.gov/Fields.asp?Table_ID=236). You can download the original data there. The files here have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.
Here are a few ideas to get you started exploring the data:
-
When is the best time of day/day of week/time of year to fly to minimise delays?
-
Do older planes suffer more delays?
-
How does the number of people flying between different locations change over time?
-
How well does weather predict plane delays?
-
Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?
-
Openflights.org (ODBL-licensed): user-generated datasets on the world of air flight.
-
Dataexpo (Public domain): The core airline flights database includes
-
dataexpo_airports.tsv
(original) — info on about 3400 US airlines; slightly cleaner but less comprehensive than the Openflights.org data. -
dataexpo_airplanes.tsv
(original) — info on about 5030 US commercial airplanes by tail number. -
dataexpo_airlines.tsv
(original) — info on about 1500 US airline carriers; slightly cleaner but less comprehensive than the Openflights.org data.
-
-
Wikipedia.org (CC-BY-SA license): Airport identifiers
The airport datasets contain errors and conflicts; we’ve done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv
shows where my patience wore out.
-
1998 World Cup access logs (
access_logs/ita_world_cup_apachelogs
; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.
25 Retrosheet: MLB play-by-play, high detail, 1840-2011 ripd/www.retrosheet.org-2007/boxesetc/2006 25 Retrosheet: MLB box scores, 1871-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
The main collection is about 650GB (as of October 2011) with nearly 2 million files, 60 languages, and dozens of different file formats.
-
Gutenberg collection catalog as RDF
-
rsync, repeatable: Mirroring How-To
-
wget, zip files: Gutenberg corpus download instructions. From a helpful Stack Overflow thread, you can run
wget -r -w 2 -m 'http://www.gutenberg.org/robot/harvest?filetypes=txt&langs[]=en'
-
see also ZIP file input format
-
cd /mnt/gutenberg/ ; mkdir -p logs/gutenberg nohup rsync -aviHS --max-size=10M --delete --delete-after --exclude=\*.{m4a,m4b,ogg,spx,tei,md5,db,mus,mid,rst,sib,xml,zip,pdf,jpg,jpeg,png,htm,html,mp3,rtf,wav,mov,mp4,avi,mpg,mpeg,doc,docx,xml,tex,ly,eps,iso,rar} --exclude={\*-h*,old,images} --exclude '*cache/generated' {[email protected]::gutenberg,/mnt/gutenberg/ftp.ibiblio.org}/2 >> /mnt/gutenberg/logs/gutenberg/rsync-gutenberg-`datename`-part_2.log 2>&1 &