Skip to content

Algorithms and tools for detecting and fixing hate speech, abuse and PII disclosure (HAP) in text generated by LLMs

License

Notifications You must be signed in to change notification settings

trustyai-explainability/trustyai-detoxify

Repository files navigation

TrustyAI-Detoxify

Algorithms and tools for detecting and fixing hate speech, abuse and profanity in content generated by Large Language Models (LLMs).

T-MaRCo

T-MaRCo is an extension of the work Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts , it makes it possible to use multiple combinations of experts and anti-experts to score and (incrementally) rephrase texts generated by LLMs.

In addition to that, it can integrate rephrasing with the base model self-reflection capabilities (see papers Towards Mitigating Hallucination in Large Language Models via Self-Reflection and N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics ).

T-MaRCo hence provides the following features:

  • content scoring: providing a disagreement score for each input token; high disagreement is often attached to toxic content.
  • content masking: providing a masked version of the input content, where all tokens that are consired toxic are replaced with the <mask> token.
  • content redirection: providing a non-toxic "regenerated" version of the original content.

How to use T-MaRCo:

from trustyai.detoxify import TMaRCo

# instantiate T-MaRCo
tmarco = TMaRCo(expert_weights=[-1, 2])

# load pretrained anti-expert and expert models
tmarco.load_models(["trustyai/gminus", "trustyai/gplus"])

# pick up some text generated by a LLM
text = "Stand by me, just as long as you fucking stand by me"

# generate T-MaRCo disagreement scores
scores = tmarco.score([text]) # '[0.78664607 0.06622718 0.02403926 5.331921 0.49842355 0.46609956 0.22441313 0.43487906 0.51990145 1.9062967  0.64200985 0.30269763 1.7964466 ]' 

# mask tokens having high disagreement scores
masked_text = tmarco.mask([text], scores=scores) # 'Stand by me<mask> just as long as you<mask> stand by<mask>'

# rephrase masked tokens
rephrased = tmarco.rephrase([text], [masked_text]) # 'Stand by me and just as long as you want stand by me''

# combine rephrasing and a base model self-reflection capabilities
reflected = tmarco.reflect([text]) # '["'Stand by me in the way I want stand by you and in the ways I need you to standby me'."]'

T-MaRCo Pretrained models are available under TrustyAI HuggingFace space at https://huggingface.co/trustyai/gminus and https://huggingface.co/trustyai/gplus.

About

Algorithms and tools for detecting and fixing hate speech, abuse and PII disclosure (HAP) in text generated by LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published