Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPPERCASE text #22

Closed
adamajm opened this issue Jun 10, 2016 · 2 comments
Closed

UPPERCASE text #22

adamajm opened this issue Jun 10, 2016 · 2 comments

Comments

@adamajm
Copy link

adamajm commented Jun 10, 2016

I've found that all UPPERCASE text returns the [{Name}] result for almost every word. For example:
PANKETS 621 LLC - Business Phone: 919-256-2873 (JESSICA)
621 CHAPPELL DR
She would like to have BINS SENT OUT FOR UNITS 101-105
STATED THEY ARE BILLED FOR RECYCLING BUT DON'T HAVE BINS
SHE WANTED TO KNOW IF SHE COULD GET A CALL BACK ABOUT CHARGES FOR RECYCLING

Returns:

{{NAME}} 621 {{NAME}} - Business Phone: {{PHONE}} ({{NAME}})
621 {{NAME}} {{NAME}}
She would like to have {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} 101-105
{{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} DON'T {{NAME}} {{NAME}}
{{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} GET {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}} {{NAME}}

@deanmalmgren
Copy link
Collaborator

Yeah, that's the trouble with only using part-of-speech tags from NLP to detect names. PoS tags are flawed in that way. I think that using machine learning for name detection (#16) will be quite a bit better, or possibly using spaCy instead of textblob/nltk (#18).

The good news is that this isn't a terrible default. Although it is redacting a bunch of information that you wouldn't necessarily want to redact (ALLCAPS results in a "false positive"), it is better to be more conservative and not release any names if you can avoid it. I just added the precision label to this issue to be sure to look into this more carefully.

@thomasbird
Copy link
Member

We've released some new detectors based on the spacy and stanford NER models, which should fix this issue. You can checkout the docs: https://scrubadub.readthedocs.io/en/stable/names.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants