-
Notifications
You must be signed in to change notification settings - Fork 5
Home
Wikipedia is full of data hidden in tables. The aim of this project is to explore the possibilities of exploiting all the data represented with the appearance of tables in Wiki pages, in order to populate the different chapters of DBpedia through new data of interest. The Table Extractor has to be the engine of this data “revolution”: it would achieve the final purpose of extracting the semi structured data from all those tables now scattered in most of the Wiki pages.
As said previously, Wikipedia is full of data hidden in tables, so we need a general software that can run over different domains and languages. That's the target of the project of this year (With domain we means a set of resource, that can range from basketball players to tv series).
To reach this objective, I need a help from user in order to know all mapping rules
that are defined in a particular domain.
(Remember that mapping rule
stands for an association between a table's header and a dbpedia ontology property).
I have built two modules that can be summarized in this way:
-
pyDomainExplorer
: this module has the purpose to surf over resources of domain chosen by user (single resource, dbpedia mapping class or a SPARQL where) and it collect every section and table met during its work. Then it print in output a file (default calleddomain_settings.py
and it's created indomain_explorer
folder) that contains all sections ( that are grouped to simplify user's work) and table's headers. When you open this file you can see that each section has its own dictionary that has to be filled. You can observe that some fields are already full. This means thatpyDomainExplorer
has found that header inpyTableExtractor
dictionary or there is a dbpedia property that has same name. After you have written mapping rules that you need, you can startpyTableExtractor
(that doesn't need any parameters). -
pyTableExtractor
: this module takes parameters (like chapter or output format value) fromdomain_settings.py
. Then it reads all rules defined by user and update its own dictionary. After that it pick up resources analyzed inpyDomainExplorer
and extract all tables from wikipedia pages. With a simple research over dictionary of mapping rules it can create RDF triples: firstly it creates a bridge between actual resource and table rows (in this point I'm usingsectionProperty
ofdomain_settings
) and secondly it map all row's data with respective table's row.
-
It checks parameters given by user in order to make a correct exploration and extraction of domain (For example I verify if mapping class wrote by user is in DBpedia or not).
-
The lists of resources involved in your extractions are stored to help user to understand if the scope targeted has been correctly hit.
-
In domain_settings.py there are a lot of comments to help in filling all required fields so that obtain an higher effectiveness of extraction process.
-
Easy to write mapping rules because of intuitive structure of domain_settings.py ("table's header":"ontology property"). You have also wikipedia example page where it was found a particular section.
-
Both modules produces log files that contain info regarding exploring and extracting domains selected by user. You can see each operation for each resource involved.
-
domain_settings.py allows you to analyze any domains in several languages.
-
At the end of each log file, there is a statistic part that explain how exploration and extraction has gone well over domain.
-
Possible to customize scripts on your necessity through settings.py. For example you can activate or not the filter on table's data, you can also disable check over properties written by user.
-
Use output format parameter to change output by pyDomainExplorer to fit it to your work. (I recommend to use output= 2 and Notepad++ to fill
domain_settings.py
).
REPORT:
Total # resources analyzed: 100
Total # tables found : 83
Total # tables analyzed : 83
Total # of rows extracted: 534
Total # of data cells extracted : 5788
Total # of exceptions extracting data : 0
Total # of 'header not resolved' errors : 0
Total # of 'no headers' errors : 9
REPORT:
Total # resources analyzed: 100
Total # tables found : 83
Total # tables analyzed : 83
Total # of rows extracted: 534
Total # of data cells extracted : 5788
Total # of exceptions extracting data : 0
Total # of 'header not resolved' errors : 0
Total # of 'no headers' errors : 9
Total # of 'no mapping rule' errors for section : 0
Total # of 'no mapping rule' errors for headers : 0
Total # of data cells extracted that needs to be mapped: 5788
Total # of table's rows triples serialized : 534
Total # of table's cells triples serialized : 5788
Total # of triples serialized : 6322
Percentage of mapping effectiveness : 1.000
Note that effectiveness of project mostly depends on how many properties are written in domain_settings.py
file
(Effectiveness is calculated as ratio between data cells extracted that needs to be mapped and table's cells triples serialized).
<http://dbpedia.org/resource/Larry_Bird> ns1:regularSeason <http://dbpedia.org/resource/Larry_Bird__10>,
<http://dbpedia.org/resource/Larry_Bird__5>,
<http://dbpedia.org/resource/Larry_Bird__6>,
<http://dbpedia.org/resource/Larry_Bird__7>,
<http://dbpedia.org/resource/Larry_Bird__8>,
<http://dbpedia.org/resource/Larry_Bird__9> .
<http://dbpedia.org/resource/Larry_Bird__10> ns1:Year "1984–85"^^xsd:string ;
ns1:assistsPerGame "6.6"^^xsd:float ;
ns1:blocksPerGame "1.2"^^xsd:float ;
ns1:fieldGoal "0.522"^^xsd:float ;
ns1:freeThrow "0.882"^^xsd:float ;
ns1:gamesPlayed "80.0"^^xsd:float ;
ns1:gamesStarted "77.0"^^xsd:float ;
ns1:minutesPerGame "39.5*"^^xsd:string ;
ns1:pointsPerGame "28.7"^^xsd:float ;
ns1:reboundsPerGame "10.5"^^xsd:float ;
ns1:stolePerGame "1.6"^^xsd:float ;
ns1:team <http://dbpedia.org/resource/Boston> ;
ns1:threePoints "0.427"^^xsd:float .
I think that is worth to explain how Mapper class
searching for mapping rules. In my work, you can observe two types of rule for table's headers: one is "header":"property"
and the other is "section name + _ + header":"property"
.
When Mapper
analyze an header, it firstly search on dictionary a key named as section name + _ + header. If it doesn't find a key like that, it will research for only header string.
In this way, if user hasn't defined a strict rule (section name + _ + header), I will search for a less strict rule (only header) that could be defined previously in another exploration or extraction.
These different rules (strict and less strict) are defined depending on output format parameter.
-f
equal to 1 will define only strict rules, while -f
equal to 2 will write less strict rules.
In my opinion there are two points of project that can be improved:
-
Facilitate user in his work: I have already added comments, progress bars, a search on dbpedia ontology properties, but maybe there are other ways to help user in filling
domain_settings.py
file. - Html Table Parser: I think that the actual parser (firstly implemented by Simone, then improved a bit by me) is really performing, but as in all parsers, it can be improved. For example it could filtering tables that are used as legend instead of giving errors (many E2 errors depend on this aspect).
Note on languages: Obviously my project works on languages that use latin alphabet. Greek and russian for example is not supported. If you need extracting data in those languages, you have to add them in my scripts.
If you want to read all steps done in this two years project, take a look to: