TrustyAI-Detoxify

Algorithms and tools for detecting and fixing hate speech, abuse and profanity in content generated by Large Language Models (LLMs).

T-MaRCo

T-MaRCo is an extension of the work Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts , it makes it possible to use multiple combinations of experts and anti-experts to score and (incrementally) rephrase texts generated by LLMs.

In addition to that, it can integrate rephrasing with the base model self-reflection capabilities (see papers Towards Mitigating Hallucination in Large Language Models via Self-Reflection and N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics ).

T-MaRCo hence provides the following features:

content scoring: providing a disagreement score for each input token; high disagreement is often attached to toxic content.
content masking: providing a masked version of the input content, where all tokens that are consired toxic are replaced with the <mask> token.
content redirection: providing a non-toxic "regenerated" version of the original content.

How to use T-MaRCo:

from trustyai.detoxify import TMaRCo

# instantiate T-MaRCo
tmarco = TMaRCo(expert_weights=[-1, 2])

# load pretrained anti-expert and expert models
tmarco.load_models(["trustyai/gminus", "trustyai/gplus"])

# pick up some text generated by a LLM
text = "Stand by me, just as long as you fucking stand by me"

# generate T-MaRCo disagreement scores
scores = tmarco.score([text]) # '[0.78664607 0.06622718 0.02403926 5.331921 0.49842355 0.46609956 0.22441313 0.43487906 0.51990145 1.9062967  0.64200985 0.30269763 1.7964466 ]' 

# mask tokens having high disagreement scores
masked_text = tmarco.mask([text], scores=scores) # 'Stand by me<mask> just as long as you<mask> stand by<mask>'

# rephrase masked tokens
rephrased = tmarco.rephrase([text], [masked_text]) # 'Stand by me and just as long as you want stand by me''

# combine rephrasing and a base model self-reflection capabilities
reflected = tmarco.reflect([text]) # '["'Stand by me in the way I want stand by you and in the ways I need you to standby me'."]'

T-MaRCo Pretrained models are available under TrustyAI HuggingFace space at https://huggingface.co/trustyai/gminus and https://huggingface.co/trustyai/gplus.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
notebooks		notebooks
scripts		scripts
src/trustyai/detoxify		src/trustyai/detoxify
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrustyAI-Detoxify

T-MaRCo

About

Releases

Packages

Languages

License

trustyai-explainability/trustyai-detoxify

Folders and files

Latest commit

History

Repository files navigation

TrustyAI-Detoxify

T-MaRCo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages