-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
127 lines (76 loc) · 3.07 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
INTRODUCTION
------------
techtc-builder is a software to build a dataset collection similar to
the techtc300 http://techtc.cs.technion.ac.il/techtc300/.
REQUIREMENTS
------------
* For build-techtc.py
- Python 2.7 or above
- python-lxml
- dmoz RDF dump files structure.rdf.u8 and content.rdf.u8. Can be
downloaded at http://www.dmoz.org/rdf.html
- One the following text based web browser w3m, lynx, elinks, links,
links2
- wget
* For techtc2CSV_all.sh
- gnu parallel
- PyStemmer
* For CSV_MI_filter.sh
- OpenCog (more specifically feature-selection)
INSTALLATION
------------
** This is not required if you run the scripts from the project root **
Run the following script (most likely in super user mode)
# ./install.sh
You can specify the prefix where to install with option --prefix, such as
# ./install --prefix /usr
USAGE
-----
1) Download structure.rdf.u8 and content.rdf.u8 from
http://www.dmoz.org/rdf.html
2) Strip the dmoz files from useless content:
$ strip_dmoz_rdf.py structure.rdf.u8 -o structure_stripped.rdf.u8
$ strip_dmoz_rdf.py content.rdf.u8 -o content_stripped.rdf.u8
This is gonna take a few minutes but will greatly speed up the
next step as well as lower memory usage.
3) Create a techtc300
$ ./build-techtc.py -c content_stripped.rdf.u8 -s structure_stripped.rdf.u8 -S 300
This will create a directory techtc300 with the dataset collection
inside. In fact the options are the default ones so the following
$ ./build-techtc.py
does the same thing.
There are multiple options, you can get the list of them using
--help. The default (and recommended) tool used to convert html to
text is w3m, you can change that using option -H such as
$ build-techtc.py -H html2text
here it will use html2text
(http://www.mbayer.de/html2text/files.shtml) instead of w3m.
4) Remove ill-formed directories (if positive or negative text files
are missing). Here the directory of the collection is techtc300,
replace if appropriate
$ ./strip_techtc techtc300
5) -- optional -- convert the text files into feature vectors in CSV
format. Each row corresponds to a document, and each column
corresponds to a word (0 if it does not appear in the document, 1
otherwise). The first row shows the corresponding word. Just run ()
$ ./techtc2CSV.py techtc300
is gonna create data.csv files under each Exp_XXXX_XXXX directory.
Then, if you don't need any more the intermediary text files you can
run
$ ./strip_csv.sh techtc300
Beware that this is gonna permanently remove everything expect the CSV
files.
Since the web is full of mess you may end-up with CSV file with
thousands of words plus lots of gibberish. To filter that out you can
use a filter that will keep only feature with their mutual information
with the target above a certain threshold, run
$ ./CSV_MI_filter.sh techtc300 0.05
The script is gonna copy the content under
techtc300_MI_0.05
where 0.05 is the threshold, then rewrite all data.csv files
filtered. This steps uses feature-selection a tool provided with
OpenCog project http://wiki.opencog.org/w/Building_OpenCog
AUTHOR
------
Nil Geisweiller
email: ngeiswei@gmail.com