Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with non-Hugging Face libraries #10

Open
NimaBoscarino opened this issue Nov 15, 2022 · 3 comments
Open

Compatibility with non-Hugging Face libraries #10

NimaBoscarino opened this issue Nov 15, 2022 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Milestone

Comments

@NimaBoscarino
Copy link
Contributor

NimaBoscarino commented Nov 15, 2022

(As suggested in #8)

This issue may be split into multiple issues if needed.

  • Pandas
  • SpaCy
  • etc.

Ideally the API should support these kinds of things out of the box, so this is a matter of verifying that they work and then documenting them, or making small changes to be compatible if needed. If it's absolutely necessary, we can consider making special methods to bridge things.

@NimaBoscarino NimaBoscarino added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 15, 2022
@NimaBoscarino NimaBoscarino added this to the v0.1.3 milestone Nov 15, 2022
@davidberenstein1957 davidberenstein1957 self-assigned this Feb 8, 2023
@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Feb 8, 2023

I just made a proposal for the new API structure. For now, it is structured as a standalone that processes documents using spaCy and lateron infers knowledge based on the pre-processed docs. However, I can also see it being set-up as a fully integrated spaCy pipeline component but this would require rewriting the code a bit more.

Note that the current set-up only works for Gender, however, we need to iterative over it a bit.

from disaggregators import Disaggregator


text = ["She, the woman, went to the park."]

disaggregator = Disaggregator("gender", language="en_core_web_md")

doc = disaggregator(text[0])
print(doc.spans["sc"])
print(doc.cats)
docs = disaggregator.pipe(text)

new features:

  • pipe for batch processing
  • call and pipe are only called once and processed (tokenized etc.) docs are shared.
  • call and pipe do not need to be shared if components are set individually (we need to define something to merge individual docs per module)
  • predictions are split in recognized potentially overlapping Spans doc.spans["sc"] and multi-label classifications doc.cats

things to consider:

  • instead of docs we can also just return list of spans and classes
  • semi-standalone vs. spaCy components per module
  • word-matched vs fuzzymatches vs pattern matches vs semantic matches
  • expand word sets over word2vec
  • weighted counter for classification over number of hits n_i/sum(nij)
  • TODOs in code

@NimaBoscarino
Copy link
Contributor Author

Awesome, thank you so much for this! Jotting a couple thoughts down here:

  • Since not all DisaggregationModules will be built with spaCy, I think it would make sense to make something like a SpacyDisaggregationModule that subclasses DisaggregationModule, to encapsulate the spaCy-specific API, and that could make the Gender disaggregator simpler.
  • The language param should probably live inside of the DisaggregationModuleConfig (or even something like SpacyDisaggregationModuleConfig), and should ideally not be passed through the DisaggregationModuleFactory's create_module, since it wouldn't be relevant for non-spaCy disaggregators.
  • What does sc stand for?
  • To keep the output for each disaggregator simple + easy to work with, IMO all modules should return something in the form of a dict with keys of <DisaggregationModuleLabel.VALUE>: True/False. So for additional contextual info (e.g doc.spans["sc"]) I think it would be good to have that as a secondary return value, e.g. maybe SpacyDisaggregationModules can return a Tuple with something like: ResultDict, ResultContext.

I have some more thoughts which I'll write up soon – thanks again for doing this!!! 🤗

@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Feb 9, 2023

Hi, thanks for the input! I will re-write the code a bit during this week.

Since not all DisaggregationModules will be built with spaCy, I think it would make sense to make something like a SpacyDisaggregationModule that subclasses DisaggregationModule, to encapsulate the spaCy-specific API, and that could make the Gender disaggregator simpler.

I assumed that we also wanted to optimize for speed, and reducing the dependency overhead so I opted for using spacy as a default tokenizer and pre-processor (not having to do this for each document for each potential module). For now, I will assume that flexibility and adaptability go over speed and efficiency.

The language param should probably live inside of the DisaggregationModuleConfig (or even something like SpacyDisaggregationModuleConfig), and should ideally not be passed through the DisaggregationModuleFactory's create_module, since it wouldn't be relevant for non-spaCy disaggregators.

I wanted to re-use this language param whenever possible while still allowing for using custom language configs per module.

What does sc stand for?

This is a spacy specific thing for handling overlapping spans, which enables direct visualization with display.

To keep the output for each disaggregator simple + easy to work with, IMO all modules should return something in the form of a dict with keys of <DisaggregationModuleLabel.VALUE>: True/False. So for additional contextual info (e.g doc.spans["sc"]) I think it would be good to have that as a secondary return value, e.g. maybe SpacyDisaggregationModules can return a Tuple with something like: ResultDict, ResultContext.

I agree with this. However, I limited the generalizability due to wanting to wrap the modules within the spaCy eco-system, but as mentioned I will re-factor the code to allow for more flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants