-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/multilingual #943
base: main
Are you sure you want to change the base?
Feature/multilingual #943
Conversation
DCO Assistant Lite bot: I have read the DCO Document and I hereby sign the DCO 2 out of 3 committers have signed the DCO. |
- Integrated support for multiple translation services including local and external APIs. - local: Huggingface model uses for translation - deepl:DeepL uses for translation - nim: NIM uses for translation - Implemented utility functions for language detection and text processing. Signed-off-by: Masaya Ogushi <[email protected]>
- Addd translation function for base probe class - prompts and triggers translate by base class method - attempt_descr translation Signed-off-by: Masaya Ogushi <[email protected]>
- Translation handling for detector keywords and substrings, triggers. Signed-off-by: Masaya Ogushi <[email protected]>
- Added support for specifying translation services directly from the CLI. - Implemented options to set target languages for translation. Signed-off-by: Masaya Ogushi <[email protected]>
- Added new dependencies required for enhanced translation features. Signed-off-by: Masaya Ogushi <[email protected]>
- Added detailed explanations of the translationn method - Included examples of how translation services are configured and utilized within the codebase. Signed-off-by: Masaya Ogushi <[email protected]>
b781a6f
to
6bb7da3
Compare
I have read the DCO Document and I hereby sign the DCO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, testing is still in progress.
The test failures in macOS
look to be an incomplete dependency requirement that may need to be reworked or removed. A default installation should not require install of an external library. Hence the dependency on pyenchant
here may be problematic.
Traceback:
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/__init__.py:90: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/detectors/test_detectors_riskywords.py:8: in <module>
import garak.detectors.base
garak/detectors/__init__.py:1: in <module>
from .base import *
garak/detectors/base.py:17: in <module>
from garak.translator import SimpleTranslator, LocalTranslator, is_english
garak/translator.py:15: in <module>
import enchant
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/enchant/__init__.py:81: in <module>
from enchant import _enchant as _e
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/enchant/_enchant.py:[157](https://github.com/leondz/garak/actions/runs/11246708414/job/31296143918?pr=943#step:5:158): in <module>
raise ImportError(msg)
E ImportError: The 'enchant' C library was not found and maybe needs to be installed.
E See https://pyenchant.github.io/pyenchant/install.html
E for details
requirements.txt
Outdated
pyenchant==3.2.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These entires are new direct requirements and need to be above the # tests
section in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I move it.
garak/probes/base.py
Outdated
if hasattr(config_root, 'run'): | ||
if hasattr(config_root.run, 'translation_service'): | ||
translation_service = config_root.run.translation_service | ||
if translation_service == "local": | ||
if probename == "encoding": | ||
self.translator = LocalEncodingTranslator(config_root) | ||
elif probename == "goodside": | ||
self.translator = LocalGoodsideTranslator(config_root) | ||
elif probename == "dan": | ||
self.translator = LocalDanTranslator(config_root) | ||
else: | ||
self.translator = LocalTranslator(config_root) | ||
elif translation_service == "deepl" or translation_service == "nim": | ||
if probename == "encoding": | ||
self.translator = EncodingTranslator(config_root) | ||
elif probename == "goodside": | ||
self.translator = GoodsideTranslator(config_root) | ||
elif probename == "dan": | ||
self.translator = DanTranslator(config_root) | ||
else: | ||
self.translator = SimpleTranslator(config_root) | ||
|
||
if hasattr(config_root, 'run'): | ||
if hasattr(config_root.run, 'lang_spec'): | ||
self.target_lang = config_root.run.lang_spec | ||
|
||
if hasattr(self, 'triggers') and len(self.triggers) > 0: | ||
if self.is_nested_list(self.triggers): | ||
trigger_list = [] | ||
for trigger in self.triggers: | ||
trigger_words = self._translate(trigger) | ||
for word in trigger_words: | ||
trigger_list.append([word]) | ||
self.triggers = trigger_list | ||
else: | ||
self.triggers = self._translate(self.triggers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes assumptions that config_root
will be the global _config
object, this will not always be valid. The _translate()
method below assumes self.translator will always exist.
Probes should not be responsible for creation of the translator object, since the service will be utilized by both probes and detectors. This may be important for impacts to local
translation as It looks like the current implementation would instantiate and load the model object for each probe.
One option may be to have the translator package retain a singleton translator built from _config.run
on first access request by passing the probe.bcp47
value and the target language to get a translator object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I separate translation responsibility form probes and detectors.
I add the check function self.translator
exist or not each instance.
I will develop for singleton translator built from _config.run.
garak/probes/base.py
Outdated
for prompt in prompts: | ||
mean_word_judge = is_english(prompt) | ||
if mean_word_judge: | ||
translate_prompt = self.translator._get_response(prompt, self.bcp47, lang) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no guarantee self.translator
will exist. Probes are not required
to call super().__init__()
, while they should in most cases this needs a guard to ensure the attribute is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will set up the check function for self.translator
tests/detectors/test_detectors.py
Outdated
def detector(): | ||
_config.run.translation_service = 'local' | ||
detector = Detector(_config) | ||
detector.bcp47 = "en" | ||
detector.target_lang = "jap" | ||
return detector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to reset the _config
via importlib.reload
after the request completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I use importlib.reload
for reset the _config
tests/detectors/test_leakreplay.py
Outdated
|
||
|
||
def test_startswith_detect(): | ||
_config.run.translation_service = 'local' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impacts to _config
inside a test need to be reset after the test completes, this may be a case where a shared fixture could be created in a conftest.py
here in the detectors
path or added to the existing global conftest.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I move set up _config
part to conftest.py
tests/test_translator.py
Outdated
@pytest.fixture | ||
def config(): | ||
config = _config | ||
return config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More _config
impacts that need to be reset after test completion. Likely will benefit from a shared fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I move the set up _config
part to conftest.py
tests/test_translator.py
Outdated
@pytest.mark.parametrize("translator_class, lang_specs, model_name, tokenizer_name", [ | ||
(LocalTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalDanTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalEncodingTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalGoodsideTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
(LocalDanTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
(LocalEncodingTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
(LocalGoodsideTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
]) | ||
def test_local_translate_multiple_languages(translator_class, lang_specs, model_name, tokenizer_name): | ||
config = _config | ||
input_text = "Hello, how are you?" | ||
config.run.local_model_name = model_name | ||
config.run.local_tokenizer_name = tokenizer_name | ||
|
||
for lang_spec in lang_specs: | ||
config.run.lang_spec = lang_spec | ||
translator = translator_class(config_root=config) | ||
translated_text = translator._get_response(input_text, | ||
source_lang="en", | ||
target_lang=lang_spec) | ||
assert isinstance(translated_text, str) | ||
assert translated_text != input_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test requires download of three models, facebook/m2m100_1.2B
~ 10GB, opus-mt-en-jap
~ 800MB and opus-mt-en-fr
~ 1GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the storage check function. If the environment does not have enough storage, those test skip.
garak/cli.py
Outdated
parser.add_argument( | ||
"--local_model_name", | ||
type=str, | ||
help="Model name", | ||
) | ||
parser.add_argument( | ||
"--local_tokenizer_name", | ||
type=str, | ||
help="Tokenizer name", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be exposed here? Should we accept a json option_file
here to allow for a generic generator
to be used here instead of restricting to HF models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with it. I think it will expand it for other eco-system such as ollama, ggml and so on.
tests/test_translator.py
Outdated
@pytest.mark.parametrize("translator_class, lang_specs, model_name, tokenizer_name", [ | ||
(LocalTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalDanTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalEncodingTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalGoodsideTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"), | ||
(LocalTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
(LocalDanTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
(LocalEncodingTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
(LocalGoodsideTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing in github
actions does not have access to GPU enabled systems, also this project cannot expect a GPU to be available in default environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to use CPU
default value for self.device
in the translator.py
.
garak/translator.py
Outdated
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | ||
lang_specs = getattr(config_root.run, 'lang_spec', "jap") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda
capability is not a default expectation for this project, while this does fall back to "cpu" there is likely work needed to enable this object on all supported platforms.
lang_spec
by default based on this PR should just be "en"
and should not load a model when no translation will ever occur. The translator services should be agnostic of any actual translation and source_lang
to target_lang
should just pass thru when both values are equal returning the original values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda capability is not a default expectation for this project, while this does fall back to "cpu" there is likely work needed to enable this object on all supported platforms.
I try to change the default value "cpu". I think it can work if machine does not have GPU.
lang_spec by default based on this PR should just be "en" and should not load a model when no translation will ever occur.
OK. I change the default value of lang_spec
. And if the lang_spec
is "en", it will pass to load a model
source_lang to target_lang should just pass thru when both values are equal returning the original values.
OK. I add the check function.
update remove punctuation update english judge add translate function add logging translate result add Reverse translate for hf detector and snowball probes Signed-off-by: Masaya Ogushi <[email protected]>
check translator instance remove translate function reset config Signed-off-by: Masaya Ogushi <[email protected]>
check translator instance add reverse translator add test reverse translator Signed-off-by: Masaya Ogushi <[email protected]>
remove argument using generator_option_file Signed-off-by: Masaya Ogushi <[email protected]>
add load translator instance Signed-off-by: Masaya Ogushi <[email protected]>
check storage size set up each instance for each test Signed-off-by: Masaya Ogushi <[email protected]>
remove pyenchant Using nltk instead of pyenchant Signed-off-by: Masaya Ogushi <[email protected]>
update how to use translation function Signed-off-by: Masaya Ogushi <[email protected]>
…garak into feature/multilingual
Signed-off-by: SnowGushiGit <[email protected]>
Signed-off-by: Masaya Ogushi <[email protected]>
…garak into feature/multilingual
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is a significant amount of work.
This is for translation-based multilingual, rather than general multilingual support, so we should be careful to separate these two functions. The ability to select which languages garak uses is distinct from how that language is achieved (in prompts, detectors, and so on).
There are some refactorings needed to get the PR to a sustainable state. We may need a couple more hooks, either injected into base classes or added there. It might be simplest to add these hooks to the harness. Happy to set up a call or instant message chat to discuss this. Given the breadth of the changes, it is likely to be beneficial to discuss plans and get good alignment while doing the rest of the changes.
garak/detectors/base.py
Outdated
if hasattr(config_root, 'plugins'): | ||
if hasattr(config_root.plugins, 'generators'): | ||
if "translation_service" in config_root.plugins.generators.keys(): | ||
translation_service = config_root.plugins.generators["translation_service"] | ||
self.translator = _config.load_translator(translation_service=translation_service, | ||
classname="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the base detector need to look for generators configurations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defined the translation instance on the _config.py
.
It defined by the generators configurations.
I think your advice to add these hooks to the harness is the better.
garak/detectors/base.py
Outdated
if hasattr(config_root, 'plugins'): | ||
if hasattr(config_root.plugins, 'generators'): | ||
if "translation_service" in config_root.plugins.generators.keys(): | ||
translation_service = config_root.plugins.generators["translation_service"] | ||
self.reverse_translator = _config.load_translator(translation_service=translation_service, | ||
classname="reverse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be factored out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move this part to harness.
garak/detectors/base.py
Outdated
if hasattr(self, 'reverse_translator'): | ||
if self.reverse_translator is not None: | ||
non_none_outputs = self.reverse_translator.translate_prompts(non_none_outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we handle this in the harness rather than the detector? It's an orchestration issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I try it.
garak/detectors/base.py
Outdated
if hasattr(self, 'translator'): | ||
if self.translator is not None: | ||
self.substrings = self.translator.translate_prompts(self.substrings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we handle this in the harness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I try it.
garak/detectors/leakreplay.py
Outdated
if hasattr(self, 'translator'): | ||
if self.translator is not None: | ||
triggers = self.translator.translate_prompts(triggers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this piece of code appears a lot. we should really avoid this, and use/add/inject a hook in the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand how to implement it.
Could you share with me your example code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider factoring up to harness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider factoring up to harness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
garak/detectors/base.py
Outdated
@@ -14,6 +14,7 @@ | |||
from garak.configurable import Configurable | |||
from garak.generators.huggingface import HFCompatible | |||
import garak.attempt | |||
from garak.translator import SimpleTranslator, LocalTranslator, is_english |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factoring this up to the harness keeps base class module-level imports light, which they must be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleTranslator
, LocalTranslator
does not need to import into the garak/detectors/base.py
.
I remove it.
docs/source/translator.rst
Outdated
DeepL | ||
~~~~~ | ||
|
||
.. code-block:: bash | ||
|
||
export DEEPL_API_KEY=xxxx | ||
|
||
NIM | ||
~~~ | ||
|
||
.. code-block:: bash | ||
|
||
export NIM_API_KEY=xxxx | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-standard - we should follow the same pattern as in garak.generators.base.Generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Please share the example code how to set up the API_KEY value.
docs/source/translator.rst
Outdated
- translation_service: "nim" or "deepl", "local" | ||
- lang_spec: "ja", "ja,fr" etc. (you can set multiple language codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where in the config does this go?
recommend something like:
run:
language:
translation:
translation_service: nim
api_key: xxx
lang_from: en
lang_to: ja,fr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sets up value by the generator_option_file
.
It needs set up the generator json file
{
"lang_spec": "ja",
"translation_service": "local",
"local_model_name": "facebook/m2m100_418M",
"local_tokenizer_name": "facebook/m2m100_418M"
}
I follow this advice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SnowMasaya thank you, this is a significant benefit for the project.
Adding a number of my pending comments here, please be aware these are likely similar to many @leondz made, although possibly from different reasoning. Happy to continue iterating.
garak/detectors/base.py
Outdated
if hasattr(config_root, 'plugins'): | ||
if hasattr(config_root.plugins, 'generators'): | ||
if "translation_service" in config_root.plugins.generators.keys(): | ||
translation_service = config_root.plugins.generators["translation_service"] | ||
self.translator = _config.load_translator(translation_service=translation_service, | ||
classname="") | ||
if hasattr(self, 'substrings'): | ||
if hasattr(self, 'translator'): | ||
if self.translator is not None: | ||
self.substrings = self.translator.translate_prompts(self.substrings) | ||
|
||
if hasattr(config_root, 'plugins'): | ||
if hasattr(config_root.plugins, 'generators'): | ||
if "translation_service" in config_root.plugins.generators.keys(): | ||
translation_service = config_root.plugins.generators["translation_service"] | ||
self.reverse_translator = _config.load_translator(translation_service=translation_service, | ||
classname="reverse") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted before base.Detector
should not access _config
directly.
Additionally the location where language configuration is being placed in the config file structure is inconsistent with intended use, config_root.plugins.generators
is where plugins of type Generator
reference to obtain injected properties used for the target application under test and should not be mixed with language selection for the run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I follow the following code example?
docs/source/translator.rst
Outdated
The translator config writes to a file and the path passed, with | ||
`--generator_option_file` as JSON. An example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--generator_option_file
is for the target generator under test and should not be used to pass run
specific data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow this advice.
But I think misunderstanding it...
Could you tell me the best way to set up config value?
garak/detectors/base.py
Outdated
if hasattr(config_root, 'plugins'): | ||
if hasattr(config_root.plugins, 'generators'): | ||
if "translation_service" in config_root.plugins.generators.keys(): | ||
translation_service = config_root.plugins.generators["translation_service"] | ||
self.reverse_translator = _config.load_translator(translation_service=translation_service, | ||
classname="reverse") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detectors should not access _config
directly. The detectors also should not load
the translator at instantiation. The detector should be able to request a compatible translator from the a service that owns the responsibility for handling translation.
Consider:
def detector_translator(self):
from garak import translator
translator.getInstance(self)
and in translator:
@staticmethod
def getInstance(klass):
service = _translator_instance.gel(klass, None)
if service is None:
service = load_translator(klass)
_translator_instance[klass] = service
return service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I try it.
garak/probes/atkgen.py
Outdated
if hasattr(self, 'translator'): | ||
if self.translator is not None: | ||
challenges = self.translator.translate_prompts([challenge]) | ||
else: | ||
challenges = [challenge] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to create a list and then iterate inside the turn?
I suspect this is misplaced, translation in this probe should only need to happen for the initial challenge in the probe and all generators
used to perform red_team actions by manipulating on that prompt using a model in the target language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got same review comment.
I will recheck code and considering which part is the better place to implement translation function.
If you have any recommend way to implement it, please let me know.
garak/probes/atkgen.py
Outdated
if hasattr(config_root, 'run'): | ||
if hasattr(config_root.run, 'translation_service'): | ||
translation_service = config_root.run.translation_service | ||
class_name = self.probename.split(".")[-2] | ||
self.translator = _config.load_translator(translation_service=translation_service, | ||
classname=class_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as detectors, the probe plugins should not own translator configuration logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
tests/probes/test_probes_encoding.py
Outdated
from garak.probes.encoding import BaseEncodingProbe, InjectBase16, InjectBase32, InjectAscii85, \ | ||
InjectHex, InjectQP, InjectUU, InjectMime, \ | ||
InjectROT13, InjectBase2048, InjectBraille, \ | ||
InjectMorse, InjectNato, InjectEcoji, InjectZalgo | ||
from garak import _config, _plugins | ||
from garak.translator import is_english | ||
import pytest | ||
import garak | ||
import importlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limit imports, the encoding classes here are not need as they are enumerated via the PROBES
constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will rewrite it.
@@ -12,6 +23,7 @@ def test_InjectBase64_len_cap(): | |||
assert len(p.prompts) < num_payloads * num_templates * num_encoders | |||
|
|||
|
|||
@pytest.fixture(scope="function") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing? Tests should not be annotated as fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. This test does not need it. I remove it this part.
@@ -20,6 +32,7 @@ def test_InjectBase64_prompt_trigger_match(): | |||
assert len(p.prompts) == len(p.triggers) | |||
|
|||
|
|||
@pytest.fixture(scope="function") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing? Tests should not be annotated as fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. This part does not need it. I remove it this part.
tests/test_translator.py
Outdated
_config.plugins.generators["translation_service"] = 'local' | ||
_config.plugins.generators["local_model_name"] = model_name | ||
_config.plugins.generators["local_tokenizer_name"] = tokenizer_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted lang_spec configuration should not be in _config.plugins.generators
. These values will come from config however they should not be a key under generators
as the structure of plugins.<plugin_type>
is meant to encapsulate shared configuration of each plugin of type in this case Generator
.
Possible suggestion:
_config.run.lang_spec = "ja"
_config.run.translation_service = { "type": "local", "model_name", "model_config" }
Also is there a reason for local_model_name
& local_tokenizer_name
? They look to be equal in all supplied tests, this suggest to me there should is not a need for specification of each independently. If we need to expand on the tokenizer configuration it should likely go inside a model_config
object that can supply parameters to a Configurable
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to place this in a dedicated sub-key of _config.run
, perhaps one related to multilingual config using translate (suggestion above #943 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow this advice. I set up value by the generator_option_file
.
I think I misunderstand about it.
I try to use _config.run
set up each propriety.
Also is there a reason for local_model_name & local_tokenizer_name?
Most hugging face model uses same name tokenizer and model names.
I think model_config
would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not mix configuration and runtime service creation in the same class, the translator package can consume the _config
however the _config
should not instantiate
any object.
Indeed, sorry for the dupe review, there was quite a lot of code so I reviewed directly instead of processing @jmartin-tech 's comments as well |
add mean judge for reverse translation change translation model size Signed-off-by: Masaya Ogushi <[email protected]>
translate trigger words Signed-off-by: Masaya Ogushi <[email protected]>
add reverse translation remove trigger translation fix test code Signed-off-by: Masaya Ogushi <[email protected]>
add translation instance Signed-off-by: Masaya Ogushi <[email protected]>
add library for reverse translation Signed-off-by: Masaya Ogushi <[email protected]>
Thank you for review. |
remove extra test code Signed-off-by: Masaya Ogushi <[email protected]>
Add attributes lang_type, reverse_translation for analysis translation result Signed-off-by: Masaya Ogushi <[email protected]>
move translation check code to another test code Signed-off-by: Masaya Ogushi <[email protected]>
check attempt language save reverse translation for analysis result Signed-off-by: Masaya Ogushi <[email protected]>
Signed-off-by: Masaya Ogushi <[email protected]>
Check StringDetector Check HF detector Check Yes, No start Detector Signed-off-by: Masaya Ogushi <[email protected]>
Only 1st probes translation atkgen goodside.py move translation function to translator.py base.py has get_translator function base.py add lang_type for attempt Signed-off-by: Masaya Ogushi <[email protected]>
save reverse translation which made by the dtector Signed-off-by: Masaya Ogushi <[email protected]>
Don't check non prompt probes Don't check visual probes Signed-off-by: Masaya Ogushi <[email protected]>
update remove english punctuation update english judge update split text add translate description remove some class and integrate with function to base class add lang class for analysis result Signed-off-by: Masaya Ogushi <[email protected]>
remove extra class check encode text translaiton check long sentence translation check add yaml for test without api key Signed-off-by: Masaya Ogushi <[email protected]>
using yaml file for translation Signed-off-by: Masaya Ogushi <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
510e63f
to
e53e7d2
Compare
* refactor for encapsulation * set default config entries for early load support * adds docs for new entries * initializes `lang_spec` to `en` * intializes `translators` to an empty list * revised test config formats * only maintain prompts in one language * limit to only translated prompts * treat "$" as a special line bypassing translation attempt * NLI detector should support None response * check translation required carefully * strings that do not contain "words" should bypass translation * tests of translation configuration require `lang_spec` * add remote translator specific class to tests * clarify base model prefix as opus-mt-* * detectors need translators in config * always return a translator even if just target to target * always have a translator * only attempt to translate output that is not None * force garbage collection after translator tests * validate probe trigger type during tranlation * translation needs lists of strings * support for nested lists is added for existing probes content * remove direct _config access in plugins * remove access to _config.run from `probe` classes * adjust goodside translations to not retain original prompts * refactor probe translation tests for unit testing * In the interest of reasonable execution time test probe call translation instead of executing translation. * probe translation tests as unit testing only * Translation actions are tested with there own tests. * remove side-effects for internal translation methods * latentinjection init adjustment * Remove extra call for translator * Ensure `_build_prompts_triggers` is called only once during init for all implemented classes. * bugfix - goodside instance instead of class attributes * remote test case corrections * extract translator base config restrictions * ENV var needs are handled by `remote` module * adjust docs for each class * match extending class method signature * consolidate nltk overrides in resources.api * remove no longer used "only_translate_word" * remove lang_list references, support a single target language * use pythonic code-style, adjust inline comments * refactor report file to rely on global fixture * rename base class to `Translator` * rename `SimpleTranslator` to `Translator` * source and target language determined via translator held values * update translation configuration docs Signed-off-by: Jeffrey Martin <[email protected]>
Streamline translation use case
Add multilingual support
Verification
python -m pytest tests/
docs/source/translator.rst
Feature explain
You need an API key for the preferred service.
DeepL
export DEEPL_API_KEY=xxxx
Config file
You can pass the translation service, source language, and target language as arguments.
Note: The Helsinki-NLP/opus-mt-en-{lang} case uses different language formats. The language codes used to name models are inconsistent. Two-digit codes can usually be found here, while three-digit codes require a search like "language code {code}".
Examples for multilingual support
DeepL
To use the translation option for garak, run the following command:
If you save the config file as "garak/configs/simple_translate_config_deepl.yaml", use this command:
Example config file:
NIM
For NIM, run the following command:
If you save the config file as "garak/configs/simple_translate_config_nim.yaml", use this command:
Example config file:
Local
For local translation, use the following command:
If you save the config file as "garak/configs/simple_translate_config_local.yaml", use this command:
Example config file:
Specific Hardware Examples:
Local Translation needs GPU
cuda