Overview of Datasets

The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:

Wikipedia English-language Article Corpus (wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article.
Wikipedia Pagelink Graph (wikipedia_pagelinks; ): every page-to-page hyperlink in wikipedia.
Wikipedia Pageview Stats (wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)): hour-by-hour pageviews for all of Wikipedia
ASA SC/SG Data Expo Airline Flights (airline_flights; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes.
NCDC Hourly Global Weather Measurements, 1929-2009 (ncdc_weather_hourly; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929.
1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.
60,000 UFO Sightings
Retrosheet Game Logs — every recorded baseball game back to the 1890s
Gutenberg Corpus

Wikipedia Page Traffic Statistic V3

-
- a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).

Twilio/Wigle.net Street Vector Data Set — — geo — Twilio/Wigle.net database of mapped US street names and address ranges.
2008 TIGER/Line Shapefiles — 125 GB — geo — This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.

ASA SC/SG Data Expo Airline Flights

This data set is from the [ASA Statistical Computing / Statistical Graphics](http://stat-computing.org/dataexpo/2009/the-data.html) section 2009 contest, "Airline Flight Status — Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.

The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

The data comes originally from the DOT’s [Research and Innovative Technology Administration (RITA)](http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp) group, where it is [described in detail](http://www.transtats.bts.gov/Fields.asp?Table_ID=236). You can download the original data there. The files here have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.

Here are a few ideas to get you started exploring the data:

When is the best time of day/day of week/time of year to fly to minimise delays?
Do older planes suffer more delays?
How does the number of people flying between different locations change over time?
How well does weather predict plane delays?
Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

Support data

Openflights.org (ODBL-licensed): user-generated datasets on the world of air flight.
- openflights_airports.tsv (original) — info on about 7000 airports.
- openflights_airlines.tsv (original) — info on about 6000 airline carriers
- openflights_routes.tsv (original) — info on about 60_000 routes between 3000 airports on 531 airlines.
Dataexpo (Public domain): The core airline flights database includes
- dataexpo_airports.tsv (original) — info on about 3400 US airlines; slightly cleaner but less comprehensive than the Openflights.org data.
- dataexpo_airplanes.tsv (original) — info on about 5030 US commercial airplanes by tail number.
- dataexpo_airlines.tsv (original) — info on about 1500 US airline carriers; slightly cleaner but less comprehensive than the Openflights.org data.
Wikipedia.org (CC-BY-SA license): Airport identifiers
- wikipedia_airports_iata.tsv (original) — user-generated dataset pairing airports with their IATA (and often ICAO and FAA) identifiers.
- wikipedia_airports_icao.tsv(original) — user-generated dataset pairing airports with their ICAO (and often IATA and FAA) identifiers.

The airport datasets contain errors and conflicts; we’ve done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv shows where my patience wore out.

ITA World Cup Apache Logs

1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)

20 GB
geo, stats
Old Weather project —
http://philip.brohan.org.transfer.s3.amazonaws.com/oW_imma_20120801.tgz

Retrosheet

25	 	Retrosheet: MLB play-by-play, high detail, 1840-2011	ripd/www.retrosheet.org-2007/boxesetc/2006
25	 	Retrosheet: MLB box scores, 1871-2011               	ripd/www.retrosheet.org-2007/boxesetc/2006

Gutenberg corpus

The main collection is about 650GB (as of October 2011) with nearly 2 million files, 60 languages, and dozens of different file formats.

Gutenberg collection catalog as RDF
rsync, repeatable: Mirroring How-To
wget, zip files: Gutenberg corpus download instructions. From a helpful Stack Overflow thread, you can run wget -r -w 2 -m 'http://www.gutenberg.org/robot/harvest?filetypes=txt&langs[]=en'
- see also ZIP file input format

    cd /mnt/gutenberg/ ; mkdir -p logs/gutenberg
    nohup rsync -aviHS --max-size=10M --delete --delete-after --exclude=\*.{m4a,m4b,ogg,spx,tei,md5,db,mus,mid,rst,sib,xml,zip,pdf,jpg,jpeg,png,htm,html,mp3,rtf,wav,mov,mp4,avi,mpg,mpeg,doc,docx,xml,tex,ly,eps,iso,rar} --exclude={\*-h*,old,images} --exclude '*cache/generated' {ftp@ftp.ibiblio.org::gutenberg,/mnt/gutenberg/ftp.ibiblio.org}/2 >> /mnt/gutenberg/logs/gutenberg/rsync-gutenberg-`datename`-part_2.log 2>&1 &

Complete works of William Shakespeare

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ca4d-overview_of_datasets.asciidoc

ca4d-overview_of_datasets.asciidoc

Overview of Datasets

Wikipedia Page Traffic Statistic V3

ASA SC/SG Data Expo Airline Flights

Support data

ITA World Cup Apache Logs

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)

Retrosheet

Gutenberg corpus

Files

ca4d-overview_of_datasets.asciidoc

Latest commit

History

ca4d-overview_of_datasets.asciidoc

File metadata and controls

Overview of Datasets

Wikipedia Page Traffic Statistic V3

ASA SC/SG Data Expo Airline Flights

Support data

ITA World Cup Apache Logs

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)

Retrosheet

Gutenberg corpus