Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError exception caused by a Corporation label returned by propablepeople #3

Open
mlollo opened this issue May 1, 2018 · 0 comments

Comments

@mlollo
Copy link

mlollo commented May 1, 2018

I have a KeyError exception, while calculating the threshold in dedupe. One of my record is wrong and has a Corporation name in it. But it shouldn't cause an exception in parseratorvariable. (I'm using the Person Name FieldType).

This issue is raised :

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 76, in __call__
    filtered_pairs = self.fieldDistance(record_pairs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 101, in fieldDistance
    distances = self.data_model.distances(records)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/datamodel.py", line 82, in distances
    record_2[field])
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/parseratorvariable/__init__.py", line 90, in comparator
    variable_type = self.variable_types[variable_type_1]
KeyError: 'Corporation'

Either parseratorvariable is not handling the case of probablepeople is returning a wrong label or probablepeople is not returning an error if the label doesn't correspond to the type 'person'.

for more see datamade/probablepeople#74

For those who wants to patch this with a work around, what I have done is replacing the comparator method in parseratorvariable/__init__ (line 54) :
Add this lines where you are using dedupe library

import dedupe
import numpy
import parseratorvariable
from probableparsing import RepeatedLabelError

def comparator(self, field_1, field_2):
    distances = numpy.zeros(self.expanded_size)
    i = 0

    if not (field_1 and field_2):
        return distances

    distances[i] = 1
    i += 1

    try:
        parsed_variable_1, variable_type_1 = self.tagger(field_1)
        parsed_variable_2, variable_type_2 = self.tagger(field_2)
    except RepeatedLabelError as e:
        if self.log_file:
            import csv
            with open(self.log_file, 'a') as f:
                writer = csv.writer(f)
                writer.writerow([e.original_string.encode('utf8')])
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances

    if 'Ambiguous' in (variable_type_1, variable_type_2):
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 != variable_type_2:
        distances[i:3] = [0, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 == variable_type_2:
        distances[i:3] = [0, 1]

    if variable_type_1 not in self.variable_types:
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    
    i += 2

    variable_type = self.variable_types[variable_type_1]

    distances[i:i + self.n_type_indicators] = variable_type['indicator']
    i += self.n_type_indicators

    i += variable_type['offset']
    for j, dist in enumerate(variable_type['compare'](parsed_variable_1,
                                                      parsed_variable_2),
                             i):
        distances[j] = dist

    unobserved_parts = numpy.isnan(distances[i:j + 1])
    distances[i:j + 1][unobserved_parts] = 0
    unobserved_parts = (~unobserved_parts).astype(int)
    distances[(i + self.n_parts):(j + 1 + self.n_parts)] = unobserved_parts

    return distances

parseratorvariable.ParseratorType.comparator = comparator

Then you can use dedupe.

@mlollo mlollo changed the title Propablepeople is breaking parseratorvariable KeyError exception not handled caused by a Corporation label returned by propablepeople May 1, 2018
@mlollo mlollo changed the title KeyError exception not handled caused by a Corporation label returned by propablepeople KeyError exception caused by a Corporation label returned by propablepeople May 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant