Replies: 1 comment
-
This looks like a bug. I have opened #93 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm doing a Resume Parsing project using GateNLP. I have several Gazetteer lists to match and it works well for 2-3 pages resumes. However, when the resume is very long I ecounter the following "IndexError" from TokenGazetteer function. Any help or suggestion would be highly appreciated.
2021-04-11 11:32:36,944 [MainThread ] [WARNI] Failed to see startup log message; retrying...
2021-04-11 11:32:36,944|WARNING|tika.tika|Failed to see startup log message; retrying...
2021-04-11 11:32:51,999|DEBUG|urllib3.connectionpool|Starting new HTTP connection (1): localhost:9998
2021-04-11 11:32:52,569|DEBUG|urllib3.connectionpool|http://localhost:9998 "PUT /rmeta/xml HTTP/1.1" 200 None
Trying to start GATE Worker on port=25335 host=127.0.0.1 log=false keep=false
PythonWorkerRunner.java: starting server with 25335/127.0.0.1/sNHip6pVztTKavzaS2-W6TtM8dg/false
Trying to start GATE Worker on port=25335 host=127.0.0.1 log=false keep=false
PythonWorkerRunner.java: starting server with 25335/127.0.0.1/PmBj0aQy1AQk12LoPnfVjy9NIh8/false
2021-04-11 11:33:04,221|INFO|gatenlp.processing.gazetteer|Reading list file data\certification.lst
2021-04-11 11:33:04,270|INFO|gatenlp.processing.gazetteer|Reading list file data\education.lst
2021-04-11 11:33:04,309|INFO|gatenlp.processing.gazetteer|Reading list file data\jobs.lst
IndexError Traceback (most recent call last)
in
8 doc2 = Annie(doc1)
9 properdoc = ProperDoc(doc1)
---> 10 gazdoc = GazDet(properdoc)
11 for ann in gazdoc.annset("Resume"):
12 doc2.annset("Resume").add_ann(ann)
in GazDet(doc)
5 for typ in details:
6 tgaz = TokenGazetteer("data/" + typ + ".def", fmt="gate-def", annset="", outset="Resume", outtype=typ)
----> 7 gazdoc = tgaz(doc)
8 return gazdoc
~\miniconda3\lib\site-packages\gatenlp\processing\gazetteer.py in call(self, doc, annset, tokentype, septype, splittype, withintype, all, skip)
697 for segment_start, segment_end in segment_offs:
698 tokens = list(anns.within(segment_start, segment_end))
--> 699 for matches in self.find_all(tokens, doc=doc):
700 for match in matches:
701 starttoken = tokens[match.start]
~\miniconda3\lib\site-packages\gatenlp\processing\gazetteer.py in find_all(self, tokens, doc, all, skip, fromidx, toidx, endidx, matchfunc)
617 idx = fromidx
618 while idx <= toidx:
--> 619 matches, maxlen, idx = self.find(
620 tokens,
621 doc=doc,
~\miniconda3\lib\site-packages\gatenlp\processing\gazetteer.py in find(self, tokens, doc, all, fromidx, toidx, endidx, matchfunc)
550 endidx = len(tokens)
551 while idx <= toidx:
--> 552 matches, long = self.match(
553 tokens, idx=idx, doc=doc, all=all, endidx=endidx, matchfunc=matchfunc
554 )
~\miniconda3\lib\site-packages\gatenlp\processing\gazetteer.py in match(self, tokens, doc, all, idx, endidx, matchfunc)
454 while j <= endidx:
455 if node.nodes:
--> 456 token = tokens[j]
457 if token.type == self.splittype:
458 break
IndexError: list index out of range
Beta Was this translation helpful? Give feedback.
All reactions