Skip to content

Latest commit

 

History

History
382 lines (296 loc) · 32.5 KB

zz00-other_datasets_on_the_web.asciidoc

File metadata and controls

382 lines (296 loc) · 32.5 KB

Other Datasets on the Web

    approx size	 Mrecs	source Data
           huge		US Patent Data from Google                          	www.google.com/googlebooks/uspto-patents.html[Google Patent Collection]
           huge	      1	Mathematical constants to billion+'th-place         	www.numberworld.org/ftp
      2_300_000	 250000	Wikipedia Pageview Stats                           	dumps.wikimedia.org/other/pagecounts-raw
       470_000	      	Wikibench.eu Wikipedia Log traces                   	Wikibench.eu
       124_000	1300000	Access Logs, 1998 World Cup (Internet Traffic Archive) 	access_logs/ita/ita_world_cup
        40_000	 B	NCDC: Hourly Weather (full)                         	ftp.ncdc.noaa.gov/pub/data/noaa
        34_000	     10	MLB Gameday Pitch-by-pitch data, 2007-2011          	gd2.mlb.com/components/game/mlb
        16_000	    619	Wikipedia corpus and pagelinks                      	dumps.wikimedia.org/enwiki/20120601
        14_000	      	NCDC: Hourly weather (simplified)                   	ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite
        14_000	       	Memetracker                                         	snap.stanford.edu/data/bigdata/memetracker9
        14_000	      	Amazon Co-Purchasing Data                           	snap.stanford.edu/data/bigdata/amazon0312.html
        11_000	      	Crosswikis                                          	nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
         6_400	      	NCDC: Daily Weather                                 	ftp.ncdc.noaa.gov/pub/data/gsod
         6_300	      	Berkeley Earth Surface Temperature                  	stats/earth_surface_temperature
         2_900	      	Twilio TigerLINE US Street Map                      	geo/us_street_map/addresses
         1_900	      	All US Airline Flights 1987-2009 (ASA Data Expo)    	stat-computing.org/dataexpo/2009
         1_300	      	Geonames Points of Interest                         	geo/geonames/info
         1_300	      	Daily Prices for all US stocks, 1962-2011           	stats/stock_prices
         1_040	      	Patent data (see Google data too)                   	www.nber.org/~jbessen
           573	      	TAKS Exam Scores for all Texas students, 2007-2010  	ripd/texas_taks_exam
           571	      	Pi to 1 Billion decimal places                      	ja0hxv.calico.jp/value/pai/val01/pi
           419	      	Enron Email Corpus                                  	lang/corpora/enron_trial_coporate_email_corpus
           362	      	DBpedia Wikipedia Article Features                  	downloads.dbpedia.org/3.7/links
           331	      	DBpedia                                             	spotlight.dbpedia.org/datasets
           310	       	Grouplens: User-Movie affinity                      	graph/grouplens_movies
           223	 	Geonames Postal Codes                               	geo/geonames/postal_codes
           121	 	Book Crossing: User-Book affinity                   	graph/book_crossing
           111		Maxmind GeoLite (IP-Geo) data                       	ripd/geolite.maxmind.com/download
            91	 	Access Logs: waxy.org's Star Wars Kid logs          	access_logs/star_wars_kid
            62	 	Metafilter corpus of postings with metadata         	ripd/stuff.metafilter.com/infodump
            47	 	Word frequencies from the British National Corpus   	ucrel.lancs.ac.uk/bncfreq/lists
            36	 	Mobywords thesaurus                                 	lang/corpora/thesaurus_mobywords
            25	 	Retrosheet: MLB play-by-play, high detail, 1840-2011	ripd/www.retrosheet.org-2007/boxesetc/2006
            25	 	Retrosheet: MLB box scores, 1871-2011               	ripd/www.retrosheet.org-2007/boxesetc/2006
            20	 	US Federal Reserve Bank Loans (Bloomberg)           	misc/bank_loans_by_fed
            11	 	Scrabble dictionaries                               	lang/corpora/scrabble
            11	 	All Scrabble tile combinations with rack value      	misc/words_quackle
          1000	 	Marvel Universe Social Graph
             . 		Materials Safety Datasheets
             .	 	UFO Sightings (UFORC)                               	geo/ufo_sightings
             . 		Crunchbase
             . 		Natural Earth detailed geographic boundaries
             . 		US Census 2009 ACS (Long-form census)
             .		US Census Geographic boundaries
             .		Zillow US Neighborhood Boundaries
             . 		Open Street Map
    2_000_000		Google Books N-Grams                                	aws.amazon.com/datasets/8172056142375670
   60_000_000		Common Crawl Web Corpus
      600_000		Apache Software Foundation Public Mail Archives 	aws.amazon.com/datasets/7791434387204566
      300_000		Million-Song dataset                             	labrosa.ee.columbia.edu/millionsong
             .		Reference Energy Disaggregation Dataset (REDD)      	redd.csail.mit.edu/
             .   	US Legislation Co-Sponsorship                        	jhfowler.ucsd.edu/cosponsorship.htm
             .   	VoteView: Political Spectrum Rank of US Legistorls/Laws	voteview.org/downloads.asp                       	DW-NOMINATE Rank Orderings all Houses and Senates
             .   	World Bank                                           	data.worldbank.org
             .      	Record of American Democracy                         	road.hmdc.harvard.edu/pages/road-documentation     	The Record Of American Democracy (ROAD) data includes election returns, socioeconomic summaries, and demographic measures of the American public at unusually low levels of geographic aggregation. The NSF-supported ROAD project covers every state in the country from 1984 through 1990 (including some off-year elections). One collection of data sets includes every election at and above State House, along with party registration and other variables, in each state for the roughly 170,000 precincts nationwide (about 60 times the number of counties). Another collection has added to these (roughly 30-40) political variables an additional 3,725 variables merged from the 1990 U.S. Census for 47,327 aggregate units (about 15 times the number of counties) about the size one or more cities or towns. These units completely tile the U.S. landmass. The collection also includes geographic boundary files so users can easily draw maps with these data.
             .		Human Mortality DB    	                             	www.mortality.org/                                  	The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity. The project began as an outgrowth of earlier projects in the Department of Demography at the University of California, Berkeley, USA, and at the Max Planck Institute for Demographic Research in Rostock, Germany (see history). It is the work of two teams of researchers in the USA and Germany (see research teams), with the help of financial backers and scientific collaborators from around the world (see acknowledgements).
             .		FCC Antenna locations                                	transition.fcc.gov/mb/databases/cdbs
             .		Pew Research Datasets                                	pewinternet.org/Static-Pages/Data-Tools/Download-Data/Data-Sets.aspx
             .		Youtube Related Videos                                	netsg.cs.sfu.ca/youtubedata
  	     .		Westbury Usenet Archive (2005-2010)                  	www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html 	This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.
             .		Wikipedia Page Traffic Statistics                	aws.amazon.com/datasets/2596              	snap-753dfc1c
             .   	Wikipedia Traffic Statistics V2                 	aws.amazon.com/datasets/4182            	snap-0c155c67
             .   	Wikipedia Page Traffic Statistic V3                	aws.amazon.com/datasets/6025882142118545	snap-f57dec9a
             .   	Marvel Universe Social Graph                      	aws.amazon.com/datasets/5621954952932508	snap-7766d116
       10_000      	Daily Global Weather, 1929-2009                   	aws.amazon.com/datasets/2759             	snap-ac47f4c5
      220_000		Twilio/Wigle.net Street Vector Data Set         	aws.amazon.com/datasets/2408             	snap-5eaf5537	MySQL	geo	A complete database of US street names and address ranges mapped to zip codes and latitude/longitude ranges, with DTMF key mappings for all street names.
             .		US Economic Data 2003-2006                      	aws.amazon.com/datasets/2341             	snap-0bdf3f62		stats	US Economic Data for 2003-2006 from the The US Census Bureau -- raw census data (ACS2002-2006)
	     .		Github Archive                                  	githubarchive.org
	     		2012 Election Results, by County			https://docs.google.com/spreadsheet/lv?key=0AjYj9mXElO_QdHpla01oWE1jOFZRbnhJZkZpVFNKeVE&toomany=true#gid=19


* yahoo stocks
* mathematical_constants
* ACS 2009
* zillow_neighborhoods
* marvel comics

* time
  - timezone
  - calendars
  - lunar_eclipses
* historical currency
* sports


-------------------

==== Github Archive ====

https://github.com/igrigorik/githubarchive.org

http://www.githubarchive.org

Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.


==== Wikibench.eu Wikipedia Log traces ====

* `logs/wikibench_logtraces` (470 GB)

==== Amazon Co-Purchasing Data ====

* http://snap.stanford.edu/data/amazon0312.html

==== Patents ====

* http://www.google.com/googlebooks/uspto-patents.html[Google Patent Collection]

====  Marvel Universe Social Graph ====

* 1 GB
* graph
* Social collaboration network of the Marvel comic book universe based on co-appearances.

==== Google Books Ngrams ====

* http://aws.amazon.com/datasets/8172056142375670[Google Books Ngrams]
* 2_000 GB
* graph, linguistics

==== Common Crawl web corpus ====

http://aws.amazon.com/datasets/41740

s3://aws-publicdatasets/common-crawl/crawl-002

A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.

Details
* Size:  60 TB
* Source:        Common Crawl Foundation -­ http://commoncrawl.org
* Created On:   February 15, 2012 2:23 AM GMT
* Last Updated: February 15, 2012 2:23 AM GMT
* Available at: s3://aws-publicdatasets/common-crawl/crawl-002/

A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.

Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.

The ARC (.arc) file format used by Common Crawl was developed by the Internet Archive to store their archived crawl data. It is essentially a multi-part gzip file, with each entry in the master gzip (ARC) file being an independent gzip stream in itself. You can use a tool like zcat to spill the contents of an ARC file to stdout. For more information see the Internet Archive's [Arc File Format description](http://www.archive.org/web/researcher/ArcFileFormat.php).

Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.

To learn more about Amazon Elastic MapReduce please see the product detail page.

Common Crawl's Hadoop classes and other code can be found in its [GitHub repository](https://github.com/commoncrawl/commoncrawl).

A tutorial for analyzing Common Crawl's dataset with Amazon Elastic MapReduce called MapReduce for the Masses: [Zero to Hadoop in Five Minutes with Common Crawl](http://www.commoncrawl.org/mapreduce-for-the-masses/) may be found on the Common Crawl blog.


==== Apache Software Foundation Public Mail Archives ====

* Original: http://aws.amazon.com/datasets/7791434387204566[Apache Software Foundation Public Mail Archives]
* 200 GB
* corpus
* A collection of all publicly available mail archives from the Apache55 Software Foundation (ASF)

==== Reference Energy Disaggregation Dataset (REDD) ====

http://redd.csail.mit.edu/[Reference Energy Disaggregation Data Set]

Initial REDD Release, Version 1.0

This is the home page for the REDD data set. Below you can download an initial version of the data set, containing several weeks of power data for 6 different homes, and high-frequency current/voltage data for the main power supply of two of these homes. The data itself and the hardware used to collect it are described more thoroughly in the Readme below and in the paper:

\J. Zico Kolter and Matthew J. Johnson. REDD: A public data set for energy disaggregation research. In proceedings of the SustKDD workshop on Data Mining Applications in Sustainability, 2011. [pdf]

Those wishing to use the dataset in academic work should cite this paper as the reference. Although the data set is freely available, for the time being we still ask those interested in the downloading the data to email us ([email protected]) to receive the username/password to download the data. See the readme.txt file for a full description of the different downloads and their formats

==== The Book-Crossing dataset ====

* http://www.informatik.uni-freiburg.de/~cziegler/BX/[Book Crossing] Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication): Improving Recommendation Lists Through Topic Diversification, Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear. As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.

The Book-Crossing dataset comprises 3 tables.

* BX-Users: Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.
* BX-Books: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.
* BX-Book-Ratings: Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

==== Westbury Usenet Archive ====

* http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html[Westbury Usenet Archive] -- USENET corpus (2005-2010) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.

==== Million Song Dataset ====

* http://labrosa.ee.columbia.edu/millionsong/[BETA VERSION]

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

To encourage research on algorithms that scale to commercial sizes
To provide a reference dataset for evaluating research
As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)
To help new researchers get started in the MIR field
The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.

The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

* SecondHandSongs dataset: cover songs
* musiXmatch dataset: lyrics
* Last.fm dataset: song-level tags and similarity
* Taste Profile subset: user data


**Fields**

From the [original documentation](http://labrosa.ee.columbia.edu/millionsong/pages/field-list):

Field name                      Type            Description                                     Link
analysis sample rate            float           sample rate of the audio used                   url
artist 7digitalid               int             ID from 7digital.com or -1                      url
artist familiarity              float           algorithmic estimation                          url
artist hotttnesss               float           algorithmic estimation                          url
artist id                       string          Echo Nest ID                                    url
artist latitude                 float           latitude
artist location                 string          location name
artist longitude                float           longitude
artist mbid                     string          ID from musicbrainz.org                         url
artist mbtags                   array string    tags from musicbrainz.org                       url
artist mbtags count             array int       tag counts for musicbrainz tags                 url
artist name                     string          artist name                                     url
artist playmeid                 int             ID from playme.com, or -1                       url
artist terms                    array string    Echo Nest tags                                  url
artist terms freq               array float     Echo Nest tags freqs                            url
artist terms weight             array float     Echo Nest tags weight                           url
audio md5                       string          audio hash code
bars confidence                 array float     confidence measure                              url
bars start                      array float     beginning of bars, usually on a beat            url
beats confidence                array float     confidence measure                              url
beats start                     array float     result of beat tracking                         url
danceability                    float           algorithmic estimation
duration                        float           in seconds
end of fade in                  float           seconds at the beginning of the song            url
energy                          float           energy from listener point of view
key                             int             key the song is in                              url
key confidence                  float           confidence measure                              url
loudness                        float           overall loudness in dB                          url
mode                            int             major or minor                                  url
mode confidence                 float           confidence measure                              url
release                         string          album name
release 7digitalid              int             ID from 7digital.com or -1                      url
sections confidence             array float     confidence measure                              url
sections start                  array float     largest grouping in a song, e.g. verse          url
segments confidence             array float     confidence measure                              url
segments loudness max           array float     max dB value                                    url
segments loudness max time      array float     time of max dB value, i.e. end of attack        url
segments loudness max start     array float     dB value at onset                               url
segments pitches                2D array float  chroma feature, one value per note              url
segments start                  array float     musical events, ~ note onsets                   url
segments timbre                 2D array float  texture features (MFCC+PCA-like)                url
similar artists                 array string    Echo Nest artist IDs (sim. algo. unpublished)   url
song hotttnesss                 float           algorithmic estimation
song id                         string          Echo Nest song ID
start of fade out               float           time in sec                                     url
tatums confidence               array float     confidence measure                              url
tatums start                    array float     smallest rythmic element                        url
tempo                           float           estimated tempo in BPM                          url
time signature                  int             estimate of number of beats per bar, e.g. 4     url
time signature confidence       float           confidence measure                              url
title                           string          song title
track id                        string          Echo Nest track ID
track 7digitalid                int             ID from 7digital.com or -1                      url
year                            int             song release year from MusicBrainz or 0         url


An [Example Track Description](http://labrosa.ee.columbia.edu/millionsong/pages/example-track-description)

Below is a list of all the fields associated with each track in the database. This is simply an annotated version of the output of the example code display_song.py. For the fields that include a large amount of numerical data, we indicate only the shape of the data array. Since most of these fields are taken directly from the Echo Nest Analyze API, more details can be found at the Echo Nest Analyze API documentation.

A more technically-oriented list of these fields is given on the field list page.

This example data is shown for the track whose track_id is TRAXLZU12903D05F94 - namely, "Never Gonna Give You Up" by Rick Astley.

    artist_mbid:                    db92a151-1ac2-438b-bc43-b82e149ddd50            the musicbrainz.org ID for this artists is db9...
    artist_mbtags:                  shape = (4,)                                    this artist received 4 tags on musicbrainz.org
    artist_mbtags_count:            shape = (4,)                                    raw tag count of the 4 tags this artist received on musicbrainz.org
    artist_name:                    Rick Astley                                     artist name
    artist_playmeid:                1338                                            the ID of that artist on the service playme.com
    artist_terms:                   shape = (12,)                                   this artist has 12 terms (tags) from The Echo Nest
    artist_terms_freq:              shape = (12,)                                   frequency of the 12 terms from The Echo Nest (number between 0 and 1)
    artist_terms_weight:            shape = (12,)                                   weight of the 12 terms from The Echo Nest (number between 0 and 1)
    audio_md5:                      bf53f8113508a466cd2d3fda18b06368                hash code of the audio used for the analysis by The Echo Nest
    bars_confidence:                shape = (99,)                                   confidence value (between 0 and 1) associated with each bar by The Echo Nest
    bars_start:                     shape = (99,)                                   start time of each bar according to The Echo Nest, this song has 99 bars
    beats_confidence:               shape = (397,)                                  confidence value (between 0 and 1) associated with each beat by The Echo Nest
    beats_start:                    shape = (397,)                                  start time of each beat according to The Echo Nest, this song has 397 beats
    danceability:                   0.0                                             danceability measure of this song according to The Echo Nest (between 0 and 1, 0 := not analyzed)
    duration:                       211.69587                                       duration of the track in seconds
    end_of_fade_in:                 0.139                                           time of the end of the fade in, at the beginning of the song, according to The Echo Nest
    energy:                         0.0                                             energy measure (not in the signal processing sense) according to The Echo Nest (between 0 and 1, 0 := not analyzed)
    key:                            1                                               estimation of the key the song is in by The Echo Nest
    key_confidence:                 0.324                                           confidence of the key estimation
    loudness:                       -7.75                                           general loudness of the track
    mode:                           1                                               estimation of the mode the song is in by The Echo Nest
    mode_confidence:                0.434                                           confidence of the mode estimation
    release:                        Big Tunes - Back 2 The 80s                      album name from which the track was taken, some songs / tracks can come from many albums, we give only one
    release_7digitalid:             786795                                          the ID of the release (album) on the service 7digital.com
    sections_confidence:            shape = (10,)                                   confidence value (between 0 and 1) associated with each section by The Echo Nest
    sections_start:                 shape = (10,)                                   start time of each section according to The Echo Nest, this song has 10 sections
    segments_confidence:            shape = (935,)                                  confidence value (between 0 and 1) associated with each segment by The Echo Nest
    segments_loudness_max:          shape = (935,)                                  max loudness during each segment
    segments_loudness_max_time:     shape = (935,)                                  time of the max loudness during each segment
    segments_loudness_start:        shape = (935,)                                  loudness at the beginning of each segment
    segments_pitches:               shape = (935, 12)                               chroma features for each segment (normalized so max is 1.)
    segments_start:                 shape = (935,)                                  start time of each segment (~ musical event, or onset) according to The Echo Nest, this song has 935 segments
    segments_timbre:                shape = (935, 12)                               MFCC-like features for each segment
    similar_artists:                shape = (100,)                                  a list of 100 artists (their Echo Nest ID) similar to Rick Astley according to The Echo Nest
    song_hotttnesss:                0.864248830588                                  according to The Echo Nest, when downloaded (in December 2010), this song had a 'hotttnesss' of 0.8 (on a scale of 0 and 1)
    song_id:                        SOCWJDB12A58A776AF                              The Echo Nest song ID, note that a song can be associated with many tracks (with very slight audio differences)
    start_of_fade_out:              198.536                                         start time of the fade out, in seconds, at the end of the song, according to The Echo Nest
    tatums_confidence:              shape = (794,)                                  confidence value (between 0 and 1) associated with each tatum by The Echo Nest
    tatums_start:                   shape = (794,)                                  start time of each tatum according to The Echo Nest, this song has 794 tatums
    tempo:                          113.359                                         tempo in BPM according to The Echo Nest
    time_signature:                 4                                               time signature of the song according to The Echo Nest, i.e. usual number of beats per bar
    time_signature_confidence:      0.634                                           confidence of the time signature estimation
    title:                          Never Gonna Give You Up                         song title
    track_7digitalid:               8707738                                         the ID of this song on the service 7digital.com
    track_id:                       TRAXLZU12903D05F94                              The Echo Nest ID of this particular track on which the analysis was done
    year:                           1987                                            year when this song was released, according to musicbrainz.org

==== Google / Stanford Crosswiki ====

http://www-nlp.stanford.edu/pubs/crosswikis-data.tar.bz2/[wikipedia_words]

This data set accompanies

   Valentin I. Spitkovsky and Angel X. Chang. 2012.
   A Cross-Lingual Dictionary for English Wikipedia Concepts.
   In Proceedings of the Eighth International
     Conference on Language Resources and Evaluation (LREC 2012).

Please cite the appropriate publication if you use this data.  (See
  http://nlp.stanford.edu/publications.shtml for .bib entries.)


There are six line-based (and two other) text files, each of them
lexicographically sorted, encoded with UTF-8, and compressed using
bzip2 (-9).  One way to view the data without fully expanding it
first is with the bzcat command, e.g.,

  bzcat dictionary.bz2 | grep ... | less

Note that raw data were gathered from heterogeneous sources, at
different points in time, and are thus sometimes contradictory.
We made a best effort at reconciling the information, but likely
also introduced some bugs of our own, so be prepared to write
fault-tolerant code...  keep in mind that even tiny error rates
translate into millions of exceptions, over billions of datums.


==== English Gigaword Dataset (LDC) ====

The http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13[English Gigaword] corpus, now being released in its fourth edition, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC at the University of Pennsylvania. The fourth edition includes all of the contents in English Gigawaord Third Edition (LDC2007T07) plus new data covering the 24-month period of January 2007 through December 2008. Portions of the dataset are © 1994-2008 Agence France Presse, © 1994-2008 The Associated Press, © 1997-2008 Central News Agency (Taiwan), © 1994-1998, 2003-2008 Los Angeles Times-Washington Post News Service, Inc., © 1994-2008 New York Times, © 1995-2008 Xinhua News Agency, © 2009 Trustees of the University of Pennsylvania. The six distinct international sources of English newswire included in this edition are the following:

Agence France-Presse, English Service (afp_eng)
Associated Press Worldstream, English Service (apw_eng)
Central News Agency of Taiwan, English Service (cna_eng)
Los Angeles Times/Washington Post Newswire Service (ltw_eng)
New York Times Newswire Service (nyt_eng)
Xinhua News Agency, English Service (xin_eng)
New in the Fourth Edition

For an example of the data in this corpus, please review http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2009T13.html[this sample file].





=== Sources of public and Commercial data

((data_commons))

* Infochimps
* Factual
* CKAN
* Get.theinfo
* Microsoft Azure Data Marketplace