-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathpyTableExtractor.py
86 lines (73 loc) · 3.8 KB
/
pyTableExtractor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#!/usr/bin/env python2.7
# coding=utf-8
from table_extractor import Analyzer
from table_extractor import Utilities
__author__ = 'papalinis - Simone Papalini - [email protected]'
def main():
"""
The table_extractor is a 2016&2017 GSoC project; here you have the [first project idea]
(wiki.dbpedia.org/ideas/idea/59/the-table-extractor/), assigned to Simone Papalini(s.papalini) in 2016
and to Luca Virgili in 2017. Developed for DBpedia Spotlight organization.
Note: Python 2.7, RDFlib, lxml are strictly required, so please install them in your environment.
LICENSE: Refer to LICENSE file. GNU GENERAL PUBLIC LICENSE. This is a OPEN SOURCE project.
Feel free to download the project, use it, enhance functionalities, add your creativity to the project,
but please SHARE your efforts with the rest of the World.
Refer to [License file](https://github.com/dbpedia/table-extractor/blob/master/LICENSE)
CONTACTS: feel free to e-mail me at: [email protected]
pyTableExtractor is a Python script used to carve out data from tables tou can find in wiki pages and to compose a
RDF data set (.ttl file) with them.
Please refer to project's Readme, [Github project page](https://github.com/dbpedia/table-extractor) and [2017 GSoC
progress page](https://github.com/dbpedia/table-extractor/wiki/GSoC-2017:-Luca-Virgili-progress) to understand
script's usage, purposes and options.
"""
"""
Instancing a Utilities object using correct language and topic.
Utilities would be used from other classes for different purposes (Internet connection, object testing,
get time and date..) and to print a final report.
"""
utils = Utilities.Utilities(None, None, None)
"""
Check if domain_settings.py has valid input
"""
check_parameters = utils.validate_user_input()
if check_parameters == "valid_input":
topic = utils.topic
language = utils.chapter
"""
res_list_filename is created but not set. In fact, if a user want to test a single resource, there isn't a
list of resources to collect.
"""
res_list_filename = None
# Test if the user chose a single resource
if not utils.research_type == "s":
single_res = ""
res_list_filename = utils.get_resource_file()
else:
single_res = topic
"""
Now we want to analyze the set of resources (or the single one) we just retrieved, so an analyzer is created.
Parameters passed: language, topic, utilities object, mode, list filename and the name of the single resource.
Note: the last two parameters (res_list_filename and single_res) are actually mutual exclusive, so
in both cases, one of them is None.
"""
analyzer = Analyzer.Analyzer(language, topic, utils, res_list_filename, single_res)
"""
To actually analyze the wiki pages and tables in them, you have to call the analyze() method.
Note: Once analyze has started, it carves out tables from every single resource passed to the Analyzer,
and then trying to apply mapping rules to every single data cells of those tables.
See Mapper class to get an idea of the decision algorithm for the mapping.
"""
analyzer.analyze()
"""
At last, you surely want to serialize the RDF graph obtained with serialize() method.
You can find the .ttl file containing the graph serialized in /Extractions/ along with
the corresponding log file.
"""
analyzer.serialize()
# Finally, print a report for the current extraction, then exits.
utils.print_report()
else:
# print parameters error
print check_parameters
if __name__ == "__main__":
main()