-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
139 lines (105 loc) · 4.49 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
NAME
Text::NLP::Stanford::EntityExtract - Talks to a stanford-ner socket
server to get named entities back
VERSION
Version 0.02
Quick Start:
* Grab the Stanford Named Entity recogniser from
http://nlp.stanford.edu/ner/index.shtml.
* Run the server, something like as follows:
java -server -mx400m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -loadClassifier classifiers/ner-eng-ie.crf-4-conll-distsim.ser.gz 1234
* Wrte a script to extract the named entities from the text, like the
following:
#!/usr/bin/env perl -w
use strict;
use Text::NLP::Stanford::EntityExtract;
my $ner = Text::NLP::Stanford::EntityExtract->new;
my $server = $ner->server;
my @txt = ("Some text\n\n", "Treated as \\n\\n delimieted paragraphs");
my @tagged_text = $ner->get_entities(@txt);
my $entities = $ner->entities_list($txt[0]); # rather complicated
# @AOA based data
# structure for further
# processing
METHODS
new ( host => '127.0.0.1', port => '1234');
server
Gets the socket connection. I think that the ner server will only do one
line per connection, so you want a new connection for every line of
text.
get_entities(@txt)
Grabs the tagged text for an arbitrary number of paragraphs of text, and
returns as the ner tagged text.
_process_line ($line)
processes a single line of text to tagged text
entities_list($tagged_line)
returns a rater arcane data structure of the entities from the text. the
position of the word in the line is recorded as is the entity type, so
that the line of text can be recovered in full from the data structure.
TODO: This needs some utility subs around it to make it more useful.
list_entities($self->entities_list($line)
Lists the entities contained within a line based from the data structure
provided by entities_list($line).
If passed a list of entities it adds to that list, including counts of
the numbes of each entity already found.
The data structure returns looks like this:
$list_data = {
'LOCATION' => {
'Outer Mongolia' => 1,
'Location Location Location' => 1,
'Chinese Mainland' => 1,
'Britney' => 1
},
'O' => {
'may have returned from the' => 1,
'said from his home in' => 1,
'. Test a three word entity' => 1,
'faith that she follows . Now she is attempting , for a second time , to persuade' => 1,
'. There is a question that' => 1,
'blah blah' => 1,
'to the controversial' => 1,
'.' => 1,
'to follow suit , reports said .' => 1
},
'PERSON' => {
'Bruce Lee' => 1,
'Gwyneth Paltrow' => 1,
'Lord Lucan' => 1
},
'MISC' => {
'Jewish-based' => 1
}
};
AUTHOR
Kieren Diment, "<zarquon at cpan.org>"
BUGS
Please report any bugs or feature requests to
"bug-text-nlp-stanford-entityextract at rt.cpan.org", or through the web
interface at
<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-NLP-Stanford-Entity
Extract>. I will be notified, and then you'll automatically be notified
of progress on your bug as I make changes.
SUPPORT
The git repository for this code is available from
git://github.com/singingfish/text-nlp-stanford-entityextract.git
You can find documentation for this module with the perldoc command.
perldoc Text::NLP::Stanford::EntityExtract
You can also look for information at:
* RT: CPAN's request tracker
<http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-NLP-Stanford-EntityEx
tract>
* AnnoCPAN: Annotated CPAN documentation
<http://annocpan.org/dist/Text-NLP-Stanford-EntityExtract>
* CPAN Ratings
<http://cpanratings.perl.org/d/Text-NLP-Stanford-EntityExtract>
* Search CPAN
<http://search.cpan.org/dist/Text-NLP-Stanford-EntityExtract/>
ACKNOWLEDGEMENTS
COPYRIGHT & LICENSE
Copyright 2008 Kieren Diment, all rights reserved.
This program is released under the following license: GPL
POD ERRORS
Hey! The above document had some coding errors, which are explained
below:
Around line 53:
You forgot a '=back' before '=head2'