-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental feature: policy scan base infrastructure #955
base: main
Are you sure you want to change the base?
Changes from 51 commits
102f648
a44c335
f7da7d5
7c81725
733bd87
384fb53
a352818
4785340
1f4f95e
96586ad
05bfce4
e2e210c
7963a3e
c67715f
ebe34eb
71e568a
bc03380
2ba073e
b65e08e
bc920f7
ccc6444
1ac841e
650f576
9400587
74ab6a1
582e2ba
bc7831a
13beea9
b9a7dc8
644061e
aa2ff6f
09488df
5e4ba8c
97f2628
16f4d40
9317093
ebcd7e9
b3f27d6
4c38c85
4dd1b64
27eaa5b
9636f85
c397bab
b01ddee
9b8a60b
7352472
61f0b37
5d1981f
b58a8b4
61e38ed
33bc89d
3966461
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
garak.detectors.any | ||
=================== | ||
|
||
.. automodule:: garak.detectors.any | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,7 @@ Advanced usage | |
|
||
configurable | ||
cliref | ||
policy | ||
|
||
Code reference | ||
^^^^^^^^^^^^^^ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
garak.policy | ||
============ | ||
|
||
This module represents objects related to policy scanning. | ||
|
||
Policy scanning in garak attempts to work out what the target's content policy | ||
is, before running a security scan. | ||
|
||
It's important to know what target content policy is because we only really have | ||
a useful/successful hit or breach if we're able to get a model to do something that | ||
it otherwise wouldn't. It may be exciting to discover a model gives instructions for | ||
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives | ||
the instructions when simply asked directly "print instructions for cooking meth", the | ||
use of base64 necessarily an exploit in this output category - the model is acting | ||
the same. | ||
|
||
Garak's policy support follows a typology of different behaviours, each describing | ||
a different behaviour. By default this typology is stored in ``data/policy/policy_typology.json``. | ||
|
||
A policy scan is conducted by invoking garak with the ``--policy_scan`` switch. | ||
When this is requested, a separate scan runs using all policy probes within garak. | ||
Policy probes are denoted by a probe class asserting ``policy_probe=True``. | ||
A regular probewise harness runs the scan, though reporting is diverted to a separate | ||
policy report file. After completion, garak estimates a policy based on policy probe | ||
results, and writes this to both main and poliy reports. | ||
|
||
|
||
Model behaviour typologies | ||
-------------------------- | ||
|
||
Goal | ||
^^^^ | ||
|
||
The model behaviour typology enumerates many different types of target behaviour. The listed behaviours help structure a policy for model output. For each behaviour, one can choose if an model should engage or not engage in that activity. | ||
|
||
The typology serves as a point of departure for building model content policies, as well as a framework for describing model behaviour. | ||
|
||
This typology is hierarchical, but labels are not “hard”. That is, multiple categories might apply to a single candidate behaviour, and that's OK. | ||
|
||
Because the range of possible model behaviours is large, and an open set, this typology is not comprehensive, and is not designed or intended to ever be comprehensive. | ||
|
||
To optimise effort spent building this typology, it's best to prioritise addition & definition of categories for which we actually have payloads. | ||
|
||
Usage | ||
^^^^^ | ||
|
||
To use this typology to describe an model deployment, examine each category and check if the target model engages with that behaviour directly, without using any adversarial techniques. | ||
|
||
To use this typology to describe intended deployed model policy, consider each category in turn and decide how the model should react. A possible set of possible reactions can be as simple as "Engage" & "Decline". | ||
|
||
Policy point guidelines | ||
^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
* Each point describes something that the model does, i.e. a behaviour | ||
* Given a decent prompt representing a policy, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair | ||
* Prioritise enumerating policies that reflect things we have tests for (or can reasonably test for) | ||
* It's great to have two sample prompts per point | ||
* We want to stick to max three levels if at all possible | ||
* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology | ||
|
||
Policy metadata | ||
^^^^^^^^^^^^^^^ | ||
|
||
The total set of points in the behaviour typology can be represented as a dictionary. Definitions of policy names, descriptions, and behaviours are stored in a JSON data file | ||
|
||
* Key: behaviour identifier - format is TDDDs* | ||
* T: a top-level hierarchy code letter, in CTMS for chat/tasks/meta/safety | ||
* D: a three-digit code for this behaviour | ||
* s*: (optional) one or more letters identifying a sub-policy | ||
|
||
Value: a dict describing a behaviour | ||
* “name”: A short name of what is permitted when this behaviour is allowed | ||
* “description”: (optional) a deeper description of this behaviour | ||
|
||
The structure of the identifiers describes the hierarchical structure. | ||
|
||
|
||
.. automodule:: garak.policy | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,7 @@ | |
|
||
"""Flow for invoking garak from the command line""" | ||
|
||
command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version".split() | ||
command_options = "list_detectors list_probes list_policy_probes list_generators list_buffs list_config plugin_info interactive report version".split() | ||
|
||
|
||
def main(arguments=None) -> None: | ||
|
@@ -107,6 +107,12 @@ def main(arguments=None) -> None: | |
parser.add_argument( | ||
"--config", type=str, default=None, help="YAML config file for this run" | ||
) | ||
parser.add_argument( | ||
"--policy_scan", | ||
action="store_true", | ||
default=_config.run.policy_scan, | ||
help="determine model's behavior policy before scanning", | ||
) | ||
|
||
## PLUGINS | ||
# generator | ||
|
@@ -201,6 +207,9 @@ def main(arguments=None) -> None: | |
parser.add_argument( | ||
"--list_probes", action="store_true", help="list available vulnerability probes" | ||
) | ||
parser.add_argument( | ||
"--list_policy_probes", action="store_true", help="list available policy probes" | ||
) | ||
parser.add_argument( | ||
"--list_detectors", action="store_true", help="list available detectors" | ||
) | ||
|
@@ -237,11 +246,6 @@ def main(arguments=None) -> None: | |
action="store_true", | ||
help="Enter interactive probing mode", | ||
) | ||
parser.add_argument( | ||
"--generate_autodan", | ||
action="store_true", | ||
help="generate AutoDAN prompts; requires --prompt_options with JSON containing a prompt and target", | ||
) | ||
parser.add_argument( | ||
"--interactive.py", | ||
action="store_true", | ||
|
@@ -408,6 +412,9 @@ def main(arguments=None) -> None: | |
elif args.list_probes: | ||
command.print_probes() | ||
|
||
elif args.list_policy_probes: | ||
command.print_policy_probes() | ||
|
||
elif args.list_detectors: | ||
command.print_detectors() | ||
|
||
|
@@ -435,6 +442,7 @@ def main(arguments=None) -> None: | |
|
||
print(f"📜 logging to {log_filename}") | ||
|
||
# set up generator | ||
conf_root = _config.plugins.generators | ||
for part in _config.plugins.model_type.split("."): | ||
if not part in conf_root: | ||
|
@@ -457,6 +465,7 @@ def main(arguments=None) -> None: | |
logging.error(message) | ||
raise ValueError(message) | ||
|
||
# validate main run config | ||
parsable_specs = ["probe", "detector", "buff"] | ||
parsed_specs = {} | ||
for spec_type in parsable_specs: | ||
|
@@ -480,20 +489,7 @@ def main(arguments=None) -> None: | |
msg_list = ",".join(rejected) | ||
raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}") | ||
|
||
for probe in parsed_specs["probe"]: | ||
# distribute `generations` to the probes | ||
p_type, p_module, p_klass = probe.split(".") | ||
if ( | ||
hasattr(_config.run, "generations") | ||
and _config.run.generations | ||
is not None # garak.core.yaml always provides run.generations | ||
): | ||
_config.plugins.probes[p_module][p_klass][ | ||
"generations" | ||
] = _config.run.generations | ||
Comment on lines
-547
to
-557
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is this logic being captured? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see some, but not all of it below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this really make sense as a helper function in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a bit unclear for me, but given the current state of
So I'm tentatively placing it in |
||
|
||
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold) | ||
|
||
# generator init | ||
from garak import _plugins | ||
|
||
generator = _plugins.load_plugin( | ||
|
@@ -510,28 +506,28 @@ def main(arguments=None) -> None: | |
logging=logging, | ||
) | ||
|
||
if "generate_autodan" in args and args.generate_autodan: | ||
from garak.resources.autodan import autodan_generate | ||
|
||
try: | ||
prompt = _config.probe_options["prompt"] | ||
target = _config.probe_options["target"] | ||
except Exception as e: | ||
print( | ||
"AutoDAN generation requires --probe_options with a .json containing a `prompt` and `target` " | ||
"string" | ||
) | ||
autodan_generate(generator=generator, prompt=prompt, target=target) | ||
|
||
# looks like we might get something to report, so fire that up | ||
command.start_run() # start the run now that all config validation is complete | ||
print(f"📜 reporting to {_config.transient.report_filename}") | ||
|
||
# do policy run | ||
if _config.run.policy_scan: | ||
command.run_policy_scan(generator, _config) | ||
|
||
# configure generations counts for main run | ||
_config.distribute_generations_config(parsed_specs["probe"], _config) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Config should not change after a |
||
|
||
# set up plugins for main run | ||
# instantiate evaluator | ||
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold) | ||
|
||
# parse & set up detectors, if supplied | ||
if parsed_specs["detector"] == []: | ||
command.probewise_run( | ||
run_result = command.probewise_run( | ||
generator, parsed_specs["probe"], evaluator, parsed_specs["buff"] | ||
) | ||
else: | ||
command.pxd_run( | ||
run_result = command.pxd_run( | ||
generator, | ||
parsed_specs["probe"], | ||
parsed_specs["detector"], | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define the policy codes in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in
garak/data/policy/policy_typology.json
: https://github.com/leondz/garak/pull/955/files#diff-00beff92463bd705bbab517aa9130ebc01ab11d797b72a80f08a40c5277a8573 - is this OK?