Appendix: DBpedia Datasets

The DBpedia project extracts various kinds of structured information from Wikipedia editions in 97 languages and combines this information into a huge, cross-domain knowledge base. [^fn1]

The DBpedia knowledge base currently describes more than

3.64 million total things — see, for example, the DBpedia entry for "Chimpanzee"
1.83 million of those things are classified in a consistent Ontology, including
- 416,000 persons,
- 526,000 places (including 360,000 populated places),
- 106,000 music albums,
- 60,000 films,
- 17,500 video games,
- 169,000 organizations (including 40,000 companies and 38,000 educational institutions),
- 183,000 species and 5,400 diseases.
Labels and abstracts for these 3.64 million things in up to 97 different languages
2,724,000 links to images
6,300,000 links to external web pages
6,200,000 external links into other RDF datasets
740,000 Wikipedia categories
2,900,000 YAGO categories.

The dataset consists of 1 billion pieces of information (RDF triples) out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. note: only the english language versions are used in the infochimps collection.

^fn1 This documentation extracted from [DBpedia web site]

Datasets Used

DBpedia Core Datasets, version 3.7:

Content:

Titles — Titles of all Wikipedia Articles in the corresponding language
Extended Abstracts — Additional, extended English abstracts of Wikipedia Articles
Page IDs — Wikipedia’s Numeric Page IDs
Revision IDs — Wikipedia Revision IDs as of this DBpedia version

Extracted Properties:

These are hand-generated mappings of Wikipedia infoboxes/templates to the DBpedia ontology. The mappings adjust weaknesses in the Wikipedia infobox system, like using different infoboxes for the same type of thing (class) or using different property names for the same property. Therefore, the instance data within the infobox ontology is much cleaner and better structured than the Infobox Dataset, but currently doesn’t cover all infobox types and infobox properties within Wikipedia. There are three different Infobox Ontology data sets:

Ontology Infobox Types — the rdf:types of of the instances which have been extracted from the infoboxes.
Ontology Infobox Properties (Strict) — The actual data values that have been extracted from infoboxes. The data values are represented using ontology properties (e.g., 'volume') that may be applied to different things (e.g., the volume of a lake and the volume of a planet). This restricts the number of different properties to a minimum, but has the drawback that it is not possible to automatically infer the class of an entity based on a property. For instance, an application that discovers an entity described using the volume property cannot infer that that the entity is a lake and then for example use a map to visualize the entity. Properties are represented using properties following the http://dbpedia.org/ontology/{propertyname} naming schema. All values are normalized to their respective SI unit.
Ontology Infobox Properties (Loose) — Properties which have been specialized for a specific class using a specific unit. e.g. the property height is specialized on the class Person using the unit centimetres instead of metres. Specialized properties follow the http://dbpedia.org/ontology/{Class}/{property} naming schema (e.g. http://dbpedia.org/ontology/Person/height). The properties have a single class as rdfs:domain and rdfs:range and can therefore be used for classification reasoning. This makes it easier to express queries against the data, e.g., finding all lakes whose volume is in a certain range. Typically, the range of the properties are not using SI units, but a unit which is more appropriate in the specific domain.

Categories:

Articles Categories — Links from concepts to categories using the SKOS vocabulary.
Categories (Labels) — Labels for Categories
Categories (Skos) — Information which concept is a category and how categories are related using the SKOS Vocabulary.
[YAGO]() — derived from the Wikipedia category system using Word Net. Please refer to Yago: A Core of Semantic Knowledge – Unifying WordNet and Wikipedia for more details.

Hyperlinks:

*(to avoid confusion, we’ll specifically use 'hyperlink' to refer to links from wikipedia pages to other pages inside or outside wikipedia).

External Hyperlinks — Hyperlinks to external web pages about a concept.
Links to Wikipedia Article — Links to corresponding Articles in Wikipedia
Wikipedia Pagelinks — internal links between DBpedia instances. The dataset was created from the internal pagelinks between Wikipedia articles.
Redirects — redirects between Articles in Wikipedia, used when one topic might be known under several distinct names.
Disambiguation Links — used when distinct, unrelated topics might be know by the same page name.
Images — Thumbnail Links from Wikipedia Articles

Entity Metadata:

Geographic Coordinates — geo-coordinates for 697,000 geographic locations. Geo-coordinates are expressed using the W3C Basic Geo Vocabulary.
Homepages — Links to external webpages nominated as the 'homepage' of an entity.
Persondata — Information about persons (date, place of birth, etc.).
PND — Personennamendatei identifiers, a uniform identifier set for notable people.
DBpedia External references: Eurostat — Links between countries and regions in DBpedia and data about them from Eurostat. Links were created manually.
DBpedia External references: CIA Factbook — Links between countries in DBpedia and data about them from CIA Factbook. Links were created manually.
DBpedia External references: flickr wrappr — Links between DBpedia concepts and photo collections depicting them generated by the flikr wrappr.
DBpedia External references: Freebase — Links between DBpedia and Freebase (MIDs).
DBpedia External references: Geonames — Links between geographic places in DBpedia and data about them in the Geonames database. Provided by the Geonames people.
DBpedia External references: MusicBrainz — Links between artists, albums and songs in DBpedia and data about them from MusicBrainz. Created manually using the result of SPARQL queries.
DBpedia External references: New York Times — Links between New York Times subject headings and DBpedia concepts.
DBpedia External references: US Census — inks between US cities and states in DBpedia and data about them from US Census.
DBpedia External references: WordNet — Word Net Synset references, generated by manually relating Wikipedia infobox templates and Word Net synsets, and adding a corresponding link to each thing that uses a specific template. In theory, this classification should be more precise then the Wikipedia category system.

NLP datasets:

DBpedia NLP: Lexicalizations Dataset
DBpedia NLP: Topic Signatures
DBpedia NLP: Thematic Concept
DBpedia NLP: People’s Grammatical Genders

Not used:

Short Abstracts
Raw Infobox Properties
Raw Infobox Property Definitions
many of the "external link" datasets.

Note: Where available we use the "N-Quads" datasets. These include a provenance URI, composed of the URI of the article from Wikipedia where the statement has been extracted; the absolute-line in the Wikipedia article source (the first line of a source has the line number 1); the relative-line in the Wikipedia article source in respect of the current section; and the section inside the article. For Example, in http://en.wikipedia.org/wiki/BMW_7_Series#section=E23&relative-line=1&absolute-line=23 the given statement can be found in the 23rd line overall, in the first line of the section "E23".

License

DBpedia is derived from Wikipedia and is distributed under the same licensing terms as Wikipedia itself. DBpedia version 3.7 is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License.

Detailed Descriptions

DBpedia Core Datasets

The core datasets from DBpedia include an ontology to model the extracted information from Wikipedia, general facts about extracted resources, as well as inter-language links. More information on the Core Datasets Page.

If you use DBpedia core data sets in your research, please cite:

Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, Sebastian Hellmann: DBpedia - A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154-165, 2009.

DBpedia NLP Datasets

Each and every dataset from DBpedia is potentially useful for several tasks related to Natural Language Processing (NLP) and Computational Linguistics. We have described in Datasets/NLP a few examples of how to use these datasets. Moreover, we describe a number of extended datasets that were generated during the creation of DBpedia Spotlight and other NLP-related projects.

If you use DBpedia NLP data sets in your research, please cite:

Pablo N. Mendes, Max Jakob and Christian Bizer. DBpedia for NLP: A Multilingual Cross-domain Knowledge Base. Proceedings of the International Conference on Language Resources and Evaluation, LREC 2012, 21–27 May 2012, Istanbul, Turkey. (to appear)

DBpedia NLP: Lexicalizations Dataset

Contains mappings between surface forms and URIs. A surface form is term that has been used to refer to an entity in text. Names and nicknames of people are examples of surface forms. We store the number of times a surface form was used to refer to a DBpedia resource in Wikipedia, and we compute statistics from that. Created by the DBpedia Spotlight team. Authors: Pablo N. Mendes, Max Jakob.

Download the DBpedia Lexicalizations Dataset

Example Data:

dbpedia:Apple_Inc. lexvo:label "Apple computer"@en graph:Apple_Inc.---Apple_computer .
graph:Apple_Inc.---Apple_computer :pmi “9.867346749590263”^^xsd:double :score .
dbpedia:Apple_Inc. lexvo:label "Apple, Inc"@en graph:Apple_Inc.---Apple,_Inc .
graph:Apple_Inc.---Apple,_Inc :pmi "9.867346749590263"^^xsd:double :score .

The data above describes the entity Apple_Inc. and two surface forms used to refer to it: "Apple Inc." and "Apple computer".

DBpedia NLP: Topic Signatures

We tokenize all Wikipedia paragraphs linking to DBpedia resources and aggregate them in a Vector Space Model of terms weighted by their co-occurrence with the target resource. We use those vectors to select the strongest related terms and build topic signatures for those entities. Created by the DBpedia Spotlight team. Authors: Pablo N. Mendes.

Download the DBpedia Topic Signatures

Example Data:

Apple_Inc.	+"Apple Inc." computer from mac
Apple_sauce	+"Apple sauce" pudding butter pie
Apple_Records	+"Apple Records" beatles album released

DBpedia NLP: Thematic Concept

Thematic Concepts are DBpedia resources that are the main subject of a Wikipedia Category. Created by the DBpedia Spotlight team. Authors: Pablo N. Mendes, Max Jakob.

Download the DBpedia Thematic Concepts

Example Data:

dbpedia:Adolescence rdf:type skos:Concept
dbpedia:Adoption rdf:type skos:Concept
dbpedia:Biodiversity rdf:type skos:Concept

DBpedia NLP: People’s Grammatical Genders

Can be used for anaphora resolution and coreference resolution tasks. Created by the DBpedia Spotlight team. Authors: Pablo N. Mendes

Download the DBpedia People’s Grammatical Genders

Example Data:

\<http://dbpedia.org/resource/Britney_Spears\> :gender :Female
\<http://dbpedia.org/resource/Brigitte_Bardot\> :gender :Female
\<http://dbpedia.org/resource/Michiel_Smit\> :gender :Male
\<http://dbpedia.org/resource/David_Duke\> :gender :Male
\<http://dbpedia.org/resource/Jack_Aubrey\> :gender :Male

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pr04-wikipedia_dbpedia.asciidoc

pr04-wikipedia_dbpedia.asciidoc

Appendix: DBpedia Datasets

Datasets Used

License

Detailed Descriptions

DBpedia Core Datasets

DBpedia NLP Datasets

DBpedia NLP: Lexicalizations Dataset

DBpedia NLP: Topic Signatures

DBpedia NLP: Thematic Concept

DBpedia NLP: People’s Grammatical Genders

Files

pr04-wikipedia_dbpedia.asciidoc

Latest commit

History

pr04-wikipedia_dbpedia.asciidoc

File metadata and controls

Appendix: DBpedia Datasets

Datasets Used

License

Detailed Descriptions

DBpedia Core Datasets

DBpedia NLP Datasets

DBpedia NLP: Lexicalizations Dataset

DBpedia NLP: Topic Signatures

DBpedia NLP: Thematic Concept

DBpedia NLP: People’s Grammatical Genders