You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a KeyError exception, while calculating the threshold in dedupe. One of my record is wrong and has a Corporation name in it. But it shouldn't cause an exception in parseratorvariable. (I'm using the Person Name FieldType).
This issue is raised :
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 76, in __call__
filtered_pairs = self.fieldDistance(record_pairs)
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 101, in fieldDistance
distances = self.data_model.distances(records)
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/datamodel.py", line 82, in distances
record_2[field])
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/parseratorvariable/__init__.py", line 90, in comparator
variable_type = self.variable_types[variable_type_1]
KeyError: 'Corporation'
Either parseratorvariable is not handling the case of probablepeople is returning a wrong label or probablepeople is not returning an error if the label doesn't correspond to the type 'person'.
For those who wants to patch this with a work around, what I have done is replacing the comparator method in parseratorvariable/__init__ (line 54) :
Add this lines where you are using dedupe library
import dedupe
import numpy
import parseratorvariable
from probableparsing import RepeatedLabelError
def comparator(self, field_1, field_2):
distances = numpy.zeros(self.expanded_size)
i = 0
if not (field_1 and field_2):
return distances
distances[i] = 1
i += 1
try:
parsed_variable_1, variable_type_1 = self.tagger(field_1)
parsed_variable_2, variable_type_2 = self.tagger(field_2)
except RepeatedLabelError as e:
if self.log_file:
import csv
with open(self.log_file, 'a') as f:
writer = csv.writer(f)
writer.writerow([e.original_string.encode('utf8')])
distances[i:3] = [1, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
if 'Ambiguous' in (variable_type_1, variable_type_2):
distances[i:3] = [1, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
elif variable_type_1 != variable_type_2:
distances[i:3] = [0, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
elif variable_type_1 == variable_type_2:
distances[i:3] = [0, 1]
if variable_type_1 not in self.variable_types:
distances[i:3] = [1, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
i += 2
variable_type = self.variable_types[variable_type_1]
distances[i:i + self.n_type_indicators] = variable_type['indicator']
i += self.n_type_indicators
i += variable_type['offset']
for j, dist in enumerate(variable_type['compare'](parsed_variable_1,
parsed_variable_2),
i):
distances[j] = dist
unobserved_parts = numpy.isnan(distances[i:j + 1])
distances[i:j + 1][unobserved_parts] = 0
unobserved_parts = (~unobserved_parts).astype(int)
distances[(i + self.n_parts):(j + 1 + self.n_parts)] = unobserved_parts
return distances
parseratorvariable.ParseratorType.comparator = comparator
Then you can use dedupe.
The text was updated successfully, but these errors were encountered:
mlollo
changed the title
Propablepeople is breaking parseratorvariable
KeyError exception not handled caused by a Corporation label returned by propablepeople
May 1, 2018
mlollo
changed the title
KeyError exception not handled caused by a Corporation label returned by propablepeople
KeyError exception caused by a Corporation label returned by propablepeople
May 2, 2018
I have a KeyError exception, while calculating the threshold in dedupe. One of my record is wrong and has a Corporation name in it. But it shouldn't cause an exception in parseratorvariable. (I'm using the
Person Name
FieldType).This issue is raised :
Either parseratorvariable is not handling the case of probablepeople is returning a wrong label or probablepeople is not returning an error if the label doesn't correspond to the type 'person'.
for more see datamade/probablepeople#74
For those who wants to patch this with a work around, what I have done is replacing the
comparator
method inparseratorvariable/__init__
(line 54) :Add this lines where you are using dedupe library
Then you can use dedupe.
The text was updated successfully, but these errors were encountered: