-
Notifications
You must be signed in to change notification settings - Fork 5
GSoC 2017: Luca Virgili progress
- Name: Luca Virgili
- Mentor: Emanuele Storti
- Co-mentor: Domenico Potena
- Proposal: DBpedia proposal: The table extractor
- DBpedia support page: https://dbpedia.atlassian.net/wiki/questions
[25/08/2017] Few days to last evaluation and I'm making a lot of tests on my project. I'm also waiting for mentors response on my work's aspects.
[23/08/2017] I contacted my mentors and I have corrected some aspects of project:
- I have changed the README, to make it more readable and to help user on the usage of this project.
- Changed
verbose
tooutput format
to understand better what this parameter represents. Now I will fix some bugs that mentors reported to me.
[22/08/2017] I thought to a misunderstanding that could be in this project: a section and a header that have same name. In order to delete this possible problem (even if I think that is really rare) I have added a prefix for section (SECTION) in dictionary so that you can distinguish between a rule that is used for an header and another one that is related to a section.
Example: In pyTableExtractor
dictionary you will observe something like this:
a line SECTION_Playoff:playoffMatch
related to playoff's section, while a header named as playoff will be Playoff:playoffHeader
.
[20/08/2017] Final days before last evaluation: I'm making a lot of tests and I have added comments to code. I will continue like this until last day.
[18/08/2017] These days I have spent a lot of time on a method that has to detect if a table's row is a summary of previous ones. Since that tables in wikipedia are very different from each others, my filter maybe can't work well in several domains. For this reason I have added a new option in settings file that allows user to disable this filter. However it works in this way: for each row, I calculate sum and mean of previous values in same column. If I find two cells that has this property, I analyze their row. In this row I search if cells has values that change radically compared to previous row. If it goes like this, I will delete that row.
I have added a new voice in log file to warn user about row's cancellation.
I have also added new setting that allow you to don't check if what properties written by user are already in dbpedia or not. (It's called CHECK_USER_INPUT_PROPERTY). In this way you can create whatever dataset you want, then add properties to dbpedia.
(Example of summary row are Career rows in page like this.
[17/08/2017] Review of method that filter table data. Now I see difference among cells in same column and if this difference is above a certain value (chosen by user in settings file) I will consider that row as a row that sums the previous one.
[14/08/2017] Fixed some bugs that I found on printing final report. I splitted mapping rules errors in two categories: errors due to sections and due to headers. I made a little optimization on code in order to search easily rules on dictionary. I've also made a review of README.
[13/08/2017] Today I updated another example of dataset created through table's extractor project
. I have analyzed Broadcaster domain (100 resources) in english language. You can see it here Broadcaster - English. I have also made other tests and I fixed some bugs: for example I added a in-depth verification of what user has wrote in domain_settings.py
, because he can't write punctuation characters (like '_, and more over) in ontology's properties.
[12/08/2017] Added progress bar even in pyTableExtractor
. I think that it is a useful tool to watch how much work is left. I have also updated README of the project and the home of this wiki. In particular, I have written Idea behind the project
that can help user to work with my project. Finally I started to make another example of RDF dataset, that will contains 100 resources of Broadcaster class of dbpedia. Tomorrow I will update it to github.
[10/08/2017] During tests I have fixed some bugs (like in pyDomainExplorer module that didn't find section in mapping rules dictionary) and I have added some feature in order to increase user experience. User now can delete mapping rules from dictionary (he only has to delete what he wants from domain_settings
). I have also updated README of project. Next days I will add progress bar even in pyTableExtractor
.
[09/08/2017] I have finished testing pyDomainExplorer
module. I added progress bar in order to see how script is working over resources. I have also fixed some bugs during resource collection and in Utilities class.
[08/08/2017] I started to edit some wikipedia pages in order to get a well-formed tables. In log file, I found a lot of problems on extracting headers from tables, and most of the times this is due to human errors. Example pages modified by me: https://en.wikipedia.org/wiki/Charlie_Parker , https://en.wikipedia.org/wiki/Bessie_Smith.
[07/08/2017] During tests I noted some bugs on settings file writing and on report printing in pyDomainExplorer
. I fixed them and today and tomorrow I will continue testing pyDomainExplorer
as said previously.
[05/08/2017] I will firstly test pyDomainExplorer
on these languages and domains. Then, I will fix all bugs that I can observe. Secondly, I will test second module pyTableExtractor
with results previously obtained. Then I will add to github repository dataset obtained.
Finally I will evaluate with mentor if I have to map an entire domain: in that case I will add properties in dbpedia ontology that I consider opportune and then produce related dataset that will contain all resources for basketball's player domain (english and italian probably).
[03/08/2017] I'm in the last part of this amazing experience. Now I'm making a lot of tests. I'm focusing on these languages: english, italian, french (I'm giving less importance to spanish and german, because in both there aren't defined some types that is useful in this project, as explained before. You can bypass this obstacle giving a SPARQL query to my project. w
option). In order to test my work on different domains, I choose four fields: Broadcaster, MilitaryPerson, BasketballPlayer and Person. These options has different type of tables and different data.
In all cases I will analyze 1000 resources.
[01/08/2017] I met my mentor and we decided that I don't have to read tables that haven't headers. It simplify my work: table's legend will be deleted automatically in this way. These days I will add more comments and tests my project on different domains.
[30/07/2017] Previously I told to delete tables that represents legends. This is not so easy because I have to identify unique features that represents this type of tables, otherwise I will delete structures that are meaningful. I established that legend's tables can be only in this way:
Character | Meaning |
---|---|
* | A |
& | B |
In a nutshell: tables has to have 2 columns, where first one represents character and second one stands for meaning of that character. I will add this "recognize method" in few days.
[27/07/2017] During these evaluation days, I made a lot of tests in order to find bug or errors. Furthermore I'm thinking about a way to recognize tables like this https://en.wikipedia.org/wiki/Kobe_Bryant#NBA_career_statistics . It represents a legend, and it's useless to map or to print in domain_settings.py
. I'm also working on facilitating user in filling settings file.
[25/07/2017] In production of log file, made by pyTableExtractor, I added new rows that print which ontology properties is used for a particular header. I have also changed part of code, in order to optimize it and to reduce execution's time.
[23/07/2017] Added some examples of my project launched on english and italian language over Basketball players domain. You can see RDF triples extracted here. I also added log files in order to get a complete view of how pyDomainExplorer and pyTableExtractor works.
[22/07/2017] I'm making a lot of tests on my project. I'm testing it on different languages and topics. In this way, before second evaluation, I will adjust all relevant aspects. Note: I found that in dbpedia organization (es, de) some resources don't have defined a properly ontology class. I have found this problem during an analysis over Basketball player domain. As you can see, in resource like http://es.dbpedia.org/page/Kobe_Bryant there isn't ontology property BasketballPlayer. So you can't analyze this domain in es language, unless you make a properly sparql query to get these resources.
[21/07/2017] I added comments to all scripts and now I will update the README.md
in order to explain how project works now. Secondly I thought to a different organization of properties dictionary, and I think that I will implement that this evening/tomorrow.
[20/07/2017] These are days before second evaluation, so I will add comments to my code and I will make several tests for different languages and domains (I also added a vector that contains all available languages so that I can better check user's input).
[19/07/2017] In pyTableExtractor
I added effective metrics so that you can have a better view on how script worked.
These metrics will be printed out in Log file. Here is a little example:
Total # resources analyzed: 1
Total # tables found : 5
Total # tables analyzed : 5
Total # of rows extracted: 37
Total # of data cells extracted : 404
Total # of exceptions extracting data : 0
Total # of 'header not resolved' errors : 0
Total # of 'no headers' errors : 0
Total # of 'no mapping rule' errors : 0
Total # of table's rows triples serialized : 33
Total # of table's cells triples serialized : 404
Total # of triples serialized : 437
Percentage of mapping effectiveness : 100.00
Of course the effectiveness of my project consists on how many properties the user define in settings file.
[18/07/2017] I asked some questions to Marco Fossati in order to solve doubts about DBpedia organization. Now, my project will work in this way:
- Table's rows are defined as resources.
- User has only to write the ontology property that will connect the resource (Eg. Kobe Bryant) with tables (Eg. Playoff matches).
- User has to write ontology properties of all table's header (I can help user in this task by searching in the actual dictionary, if this header is already defined).
[17/07/2017] Changed SPARQL query in order to find an ontology property. The previous one had a little problem. I also added an association between table's header and section so that you can specify unlike ontology properties to same header that are in different sections.
[14/07/2017] Today I worked on a little aspect: in many tables, the last row represents a summary of all column. (Eg. in domains like soccer or basket, many times we can see that last row is "Career" that contains sums or mean for each column's values). I built a method that can find and delete this type of row.
Furthermore I added blank node
in RDF triples. If user doesn't write anything to represent a table's row, I will add a blank node
(that will contains section's property and a sequential number) in respective RDF triple.
[13/07/2017] I added new type of wikitable that can be analyzed. Before my review, pyTableExtractor
worked on two different wiki's table classes: wikitable
and wikitable sortable
. Now this script also works on wikitable sortable collapsible
.
[12/07/2017] I have adapted log functions in order to show to user all relevant actions of pyDomainExplorer
and pyTableExplorer
.
[11/07/2017] I met my mentor and we discussed about dataset organization. The meeting can be summarized in these points:
- In the configuration file, user has to add only
ontology
properties. I have only to check if user's input is in dbpedia. - Table's structure can be built in two different ways, we decided to contact Marco Fossati so that knowing something more about dbpedia organization.
- Resources found by
pyDomainExplorer
are locally defined (Eg. if I find Kobe_Bryant as resource while analyzing BasketballPlayer in italian language, the related uri will be<http://it.dbpedia.org/resource/Kobe_Bryant
), while ontology's properties are defined only indbpedia.org
.
[07-10/07/2017] I created basing structure of project: now it can analyze wiki tables and can produce related RDF dataset. Next weeks, until second evaluation I have to define a lot of details: if user can write resource or only ontology properties in configuration file, how to represent table in dbpedia organization and how to adapt all scripts to work with different languages (I think that I won't consider languages as greek or russian, because there are many problems related to unlike alphabets).
[05-06/07/2017] Started to create RDF dataset. I have to add a lot of checks on data, in order to print correct triples. I will also meet my mentors next week to establish a set of rules for this dataset.
[04/07/2017] I'm trying to make association between header's table and dbpedia ontology as general as possible.
WORKFLOW MAPPER
Input: table, table's section, resource's uri that I'm analyzing.
First of all, I need to link resource's uri to all table's row.
For each table's row, RDF triples will be: <resource's uri> <sectionProperty> <rowTableProperty + count>
Then for each table's cell, I will produce a RDF triple like this:
<rowTableProperty + count> <table's cell property> <table's cell value>
(I remember you that you can fill sectionProperty
, rowTableProperty
and cell's property
in domain_settings.py
for each table)
[03/07/2017] I have started to study how Simone Papalini create RDF triples. I think that I will change all functions of Mapper.py
, because I saw that this class has methods linked to only one domain while I want to make this feature as general as possible.
[26/06/2017] In these days, before first evaluation, I'm working on how to improve the research of a particular ontology property. There are a lot of properties in dbpedia, and it's so difficult to create a good way to find the right "element". In my point of view I have three different types of data category: property
,ontology
and resource
. I discussed with my mentors that I don't have to choose property
type for this purpose, but I should only use ontology
elements. Now I'm undecided if I should consider resource
elements, I don't know if I t could be used for an header's table. I will discuss this doubt with my mentors.
[25/06/2017] Today I edited method that prints domain_settings.py
in order to add more comments for helping user. Furthermore I added a function that check if user has edited research parameters of domain_settings.py
(like chapter
or topic
) in a wrong way. pyTableExtractor
won't start if it recognize this error type.
[24/06/2017] Review code. I changed and deleted some functions of last year's project, in order to get a clear work. Furthermore, I reorganized my scripts so that code's lines will be easier to read and to understand. During the day I will also add comments to functions and variables.
[23/06/2017] I'm working on a way to obtain dbpedia ontology property that match with a table's header. User will have a little help to fill the file settings.
[22/06/2017] Previously I said that I will evaluate if it's convenient to add a translator to my project. I discovered some libraries that can link python with Google Translate, but you have to pay for this service, so I won't add this feature.
[21/06/2017] Review code, created new class that has purpose to print in output the file settings.
[20/06/2017] Today and tomorrow I will prepare code for first evaluation. I will add comments and I will also review script's structure in order to make a simple and readable project.
[19/06/2017] I met my mentor today and we decided some new details to add into my project. However, he said that we are in the right direction for building good scripts.
[18/06/2017] Added the first support for analyzing greek dbpedia. Tomorrow I'll evaluate progress project with my mentors and we will decide which steps will be the next.
[17/06/2017] Moved Selector
class to domain_explorer
package. In this way I create only one time a file .txt that contains all resources required by user. I'm rebuilding whole project for establishing a better structure.
Update: I have discovered that resources on es.dbpedia aren't built really well: basketball player as Kobe Bryant or soccer player as Cristiano Ronaldo doesn't have defined rdf:type
. This is the reason why my script didn't work on es.dbpedia.
[16/06/2017] I'm working on finding automatically the right dbpedia ontology property for a particular header. Now I implemented a function that search if exists a dbo
with a label (written in the language specified by user) named like the header. I'll evaluate with my mentor if I should add Google Translate to this project to simplify this property research.
[15/06/2017] I'm trying to build a way to create a language independent script. It works on english, french, italian and german, but it doesn't want to work in spanish dbpedia, I don't have any idea why.
[14/06/2017] I made changes in some aspects of last year's project. I deleted the JSONpedia support, because it's not useful in this work. Many times I reported JSONpedia's limits, for example it can always find all tables that are in wikipedia pages. Furthermore, it doesn't give any information on how tables are built, so it's impossible to reconstruct a table without knowing rows or columns number.
[13/06/2017] Now the script pyTableExtractor
can read settings produced by pyDomainExplorer
. In this way a user can easily modify the mapping rules. Tomorrow I will work for getting these scripts independent from languages.
[12/06/2017] I developed a function in order to search if exists a property that has the same name of a table header. It can be useful for easily filling in the file settings produced by pyDomainExplorer
.
[11/06/2017] pyDomainExplorer
will be the first script that the user has to launch. I have to adapt the script of last year for working together with mine.
[10/06/2017] I have started to develop the idea previously mentioned. I want to produce an output file that is organized by sections and each section has its own header. This can be a useful organization that can help the user to easily map all tables in a particular domain.
[09/06/2017] That's the heart of my project: I will develop a script called pyDomainExplorer
that will analyze the domain (DBpedia ontology class) and it will print out a configuration file where the user can specify the relation between a table header and a ontology property.
[07/06/2017] I met my mentor and we decided how to implement the heart of my project: general mapping rules. I'll update soon this blog with all news about the idea's development.
[03/06/2017] I'm studying how the html parser works on the different templates that are in wikipedia pages. Below there are analysis' results on the DBpedia property: wikiPageUsesTemplate. (I tested only on it.dbpedia.org because dbpedia.org no longer shows up this property).
Template | Frequency | E2 | E3 |
---|---|---|---|
Portale | 1072.748 | 4 | 0 |
InterProgetto | 412.243 | 3 | 10 |
S | 318.220 | 0 | 1 |
Bio | 287.249 | 2 | 3 |
Divisione_amministrativa | 177.184 | 2 | 1 |
En | 110.610 | 2 | 10 |
F | 103.196 | 1 | 2 |
Sportivo | 102.650 | 3 | 15 |
Legend |
---|
Frequency: times that the template is used in a wikipedia page |
E2: error due to script that didn't find table's headers. |
E3: error due to script that didn't succeed to extract data from table's cells |
Note1: a single wikipedia page can contain one or more templates. (There are pages that have seven templates)
Note2: Template stand for the set of rules for defining a structure (like tables or lists) in a wikipedia page.
[31/05/2017] Today I met Marco Fossati for explaining tasks that I have previously done and for talking about some ideas for implementing mapping rules. We also discussed about tables' templates. I'm going to evaluate all wikipedia table's templeate for estimating which types of schema are frequently used. I'm also thinking that JSONpedia is not useful in this project, so I will assess to eliminate it from the table extractor.
[25/05/2017] I have created new case for including tables which were not understanded before. I have introduced new types of data structure:
- Tables that have headers and data cell in the same row. (ex. https://it.wikipedia.org/wiki/I_Griffin#Staff_del_doppiaggio_italiano)
- Tables where headers have the same structure of data cell (ex. https://it.wikipedia.org/wiki/Isiah_Thomas#Statistiche).
I think that I can stop improving the html parser. Now I will work on creating a general mapping rules.
[19/05/2017] In wikipedia pages, we can search for a resource using all accented characters that our keyboard can provide. In my code it's important to delete all these characters, because they can carry us to a lot of errors. So I have added a new function that deletes all accented characters of the resource that the script is analyzing.
[15/05/2017] In these days I studied how the html parser
works and I have tested it on different domains, such as tennis and basket player and tv show. It can't read tables neither like this /https://it.wikipedia.org/wiki/I_Griffin#Staff_del_doppiaggio_italiano nor like this https://it.wikipedia.org/wiki/Isiah_Thomas#Statistiche . These tables are particular cases, because the first table has header and associated data in the same row, while the second one has headers that looks like a data row. In the next week, I'll works to fix this.
[11/05/2017] I contacted my mentors and we are choosing which idea is better to implement. Meanwhile I'm studying the Simone Papalini's code, for completely understanding how it works.
[04/05/2017] My proposal to DBpedia has been accepted. I want to begin to develop my project as soon as possible.