Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental feature: policy scan base infrastructure #955

Draft
wants to merge 52 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
102f648
add policy metadata
leondz Oct 2, 2024
a44c335
Merge branch 'main' into feature/policy
leondz Oct 16, 2024
f7da7d5
re-org cli.py slightly; add cli hook for policy scans
leondz Oct 16, 2024
7c81725
add policy probe flag to base probe
leondz Oct 17, 2024
733bd87
add plugin filtering to enumerate_plugins
leondz Oct 17, 2024
384fb53
add plugin enumeration + filter test
leondz Oct 17, 2024
a352818
ahem
leondz Oct 17, 2024
4785340
add cli option to list policy probes, filter policy probes from stand…
leondz Oct 17, 2024
1f4f95e
reorg garak.cli if blocks, pass generator to policy scan
leondz Oct 17, 2024
96586ad
execute rudimentary policy scan
leondz Oct 17, 2024
05bfce4
probes.test.Blank is now a policy probe
leondz Oct 17, 2024
e2e210c
harnesses now return iterator of evaluator results, providing a condu…
leondz Oct 17, 2024
7963a3e
rm yield for now; rm announce_probe
leondz Oct 17, 2024
c67715f
update test.Blank probe to check policy
leondz Oct 17, 2024
ebe34eb
add some harness logging; base harness now returns a generator over e…
leondz Oct 21, 2024
71e568a
evaluators now return info, which is surfaced though harnesses.base.H…
leondz Oct 21, 2024
bc03380
write policy report to own file
leondz Oct 22, 2024
2ba073e
use raw regexp
leondz Oct 22, 2024
b65e08e
don't return after first probewise probe harness call
leondz Oct 22, 2024
bc920f7
consume scan result; put logging above policy report open
leondz Oct 22, 2024
ccc6444
amend Chat policy point name
leondz Oct 22, 2024
1ac841e
class for representing & handling policies
leondz Oct 22, 2024
650f576
code for parsing policy scan results, building policy, and storing po…
leondz Oct 23, 2024
9400587
log probewise harness completion
leondz Oct 23, 2024
74ab6a1
add policy thresholding
leondz Oct 23, 2024
582e2ba
add config block for policy
leondz Oct 23, 2024
bc7831a
factor distribution of generation count to probes out of cli
leondz Oct 23, 2024
13beea9
add policy docs
leondz Oct 23, 2024
b9a7dc8
add non-exploit tag 'policy' for policy probe tagging
leondz Oct 23, 2024
644061e
update config test to reflect new test.Blank detector
leondz Oct 23, 2024
aa2ff6f
Merge branch 'main' into feature/policy
leondz Oct 23, 2024
09488df
add snowballmini as policy probe
leondz Oct 23, 2024
5e4ba8c
tidy up policy probe status of snowball classes
leondz Oct 23, 2024
97f2628
repurpose more probes as policy
leondz Oct 23, 2024
16f4d40
move parent name to module; validate policy typologies at load; add f…
leondz Oct 23, 2024
9317093
add/tidy missing nodes
leondz Oct 23, 2024
ebcd7e9
when inferring policy, propagate permitted behaviours up
leondz Oct 23, 2024
b3f27d6
add tests for policy functionality
leondz Oct 24, 2024
4c38c85
test for probe policy metadata
leondz Oct 24, 2024
4dd1b64
add policy tests
leondz Oct 24, 2024
27eaa5b
evaluators now yield EvalTuple not dict
leondz Nov 6, 2024
9636f85
add policy module docstring, describe policy ID regex
leondz Nov 6, 2024
c397bab
Merge branch 'main' into feature/policy
leondz Nov 7, 2024
b01ddee
explain policy config stanza
leondz Nov 7, 2024
9b8a60b
document _config.run.policy_scan
leondz Nov 7, 2024
7352472
Update garak/harnesses/base.py
leondz Nov 7, 2024
61f0b37
typo fix
leondz Nov 7, 2024
5d1981f
document typology in policy.rst
leondz Nov 7, 2024
b58a8b4
rm text version of policy - one is enough
leondz Nov 7, 2024
61e38ed
stop base harness run() and other harness run() from colliding
leondz Nov 7, 2024
33bc89d
remove --generate_autodan
leondz Nov 8, 2024
3966461
merge main
leondz Dec 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/source/configurable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,9 @@ Let's take a look at the core config.
report_dir: garak_runs
show_100_pass_modules: true

policy:
threshold: false

Here we can see many entries that correspond to command line options, such as
``model_name`` and ``model_type``, as well as some entried not exposed via CLI
such as ``show_100_pass_modules``.
Expand All @@ -101,6 +104,7 @@ such as ``show_100_pass_modules``.
* ``deprefix`` - Remove the prompt from the start of the output (some models return the prompt as part of their output)
* ``seed`` - An optional random seed
* ``eval_threshold`` - At what point in the 0..1 range output by detectors does a result count as a successful attack / hit
* ``policy_scan`` - Should the run include a scan to automatically determine the target's content policy?

``plugins`` config items
""""""""""""""""""""""""
Expand Down Expand Up @@ -128,6 +132,10 @@ For an example of how to use the ``detectors``, ``generators``, ``buffs``,
* ``taxonomy`` - Which taxonomy to use to group probes when creating HTML report
* ``show_100_pass_modules`` - Should entries scoring 100% still be detailed in the HTML report?

``policy`` config items
"""""""""""""""""""""""
* ``threshold`` - pass rate for a behavior to be considered "permitted" when policy probed; false indicates any passes mean a positive, permissive policy


Bundled quick configs
^^^^^^^^^^^^^^^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions docs/source/detectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ garak.detectors
garak.detectors
garak.detectors.base
garak.detectors.always
garak.detectors.any
garak.detectors.continuation
garak.detectors.dan
garak.detectors.divergence
Expand Down
8 changes: 8 additions & 0 deletions docs/source/garak.detectors.any.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
garak.detectors.any
===================

.. automodule:: garak.detectors.any
:members:
:undoc-members:
:show-inheritance:

1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Advanced usage

configurable
cliref
policy

Code reference
^^^^^^^^^^^^^^
Expand Down
81 changes: 81 additions & 0 deletions docs/source/policy.rst
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define the policy codes in here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
garak.policy
============

This module represents objects related to policy scanning.

Policy scanning in garak attempts to work out what the target's content policy
is, before running a security scan.

It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.

Garak's policy support follows a typology of different behaviours, each describing
a different behaviour. By default this typology is stored in ``data/policy/policy_typology.json``.

A policy scan is conducted by invoking garak with the ``--policy_scan`` switch.
When this is requested, a separate scan runs using all policy probes within garak.
Policy probes are denoted by a probe class asserting ``policy_probe=True``.
A regular probewise harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates a policy based on policy probe
results, and writes this to both main and poliy reports.


Model behaviour typologies
--------------------------

Goal
^^^^

The model behaviour typology enumerates many different types of target behaviour. The listed behaviours help structure a policy for model output. For each behaviour, one can choose if an model should engage or not engage in that activity.

The typology serves as a point of departure for building model content policies, as well as a framework for describing model behaviour.

This typology is hierarchical, but labels are not “hard”. That is, multiple categories might apply to a single candidate behaviour, and that's OK.

Because the range of possible model behaviours is large, and an open set, this typology is not comprehensive, and is not designed or intended to ever be comprehensive.

To optimise effort spent building this typology, it's best to prioritise addition & definition of categories for which we actually have payloads.

Usage
^^^^^

To use this typology to describe an model deployment, examine each category and check if the target model engages with that behaviour directly, without using any adversarial techniques.

To use this typology to describe intended deployed model policy, consider each category in turn and decide how the model should react. A possible set of possible reactions can be as simple as "Engage" & "Decline".

Policy point guidelines
^^^^^^^^^^^^^^^^^^^^^^^

* Each point describes something that the model does, i.e. a behaviour
* Given a decent prompt representing a policy, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair
* Prioritise enumerating policies that reflect things we have tests for (or can reasonably test for)
* It's great to have two sample prompts per point
* We want to stick to max three levels if at all possible
* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology

Policy metadata
^^^^^^^^^^^^^^^

The total set of points in the behaviour typology can be represented as a dictionary. Definitions of policy names, descriptions, and behaviours are stored in a JSON data file

* Key: behaviour identifier - format is TDDDs*
* T: a top-level hierarchy code letter, in CTMS for chat/tasks/meta/safety
* D: a three-digit code for this behaviour
* s*: (optional) one or more letters identifying a sub-policy

Value: a dict describing a behaviour
* “name”: A short name of what is permitted when this behaviour is allowed
* “description”: (optional) a deeper description of this behaviour

The structure of the identifiers describes the hierarchical structure.


.. automodule:: garak.policy
:members:
:undoc-members:
:show-inheritance:
21 changes: 19 additions & 2 deletions garak/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
system_params = (
"verbose narrow_output parallel_requests parallel_attempts skip_unknown".split()
)
run_params = "seed deprefix eval_threshold generations probe_tags interactive".split()
run_params = "seed deprefix eval_threshold generations probe_tags interactive policy_scan".split()
plugins_params = "model_type model_name extended_detectors".split()
reporting_params = "taxonomy report_prefix".split()
project_dir_name = "garak"
Expand Down Expand Up @@ -77,6 +77,7 @@ class TransientConfig(GarakSubConfig):
run = GarakSubConfig()
plugins = GarakSubConfig()
reporting = GarakSubConfig()
policy = GarakSubConfig()


def _lock_config_as_dict():
Expand Down Expand Up @@ -146,12 +147,13 @@ def _load_yaml_config(settings_filenames) -> dict:


def _store_config(settings_files) -> None:
global system, run, plugins, reporting
global system, run, plugins, reporting, policy
settings = _load_yaml_config(settings_files)
system = _set_settings(system, settings["system"])
run = _set_settings(run, settings["run"])
plugins = _set_settings(plugins, settings["plugins"])
reporting = _set_settings(reporting, settings["reporting"])
policy = _set_settings(plugins, settings["policy"])


def load_base_config() -> None:
Expand Down Expand Up @@ -255,3 +257,18 @@ def parse_plugin_spec(
plugin_names.remove(plugin_to_skip)

return plugin_names, unknown_plugins


def distribute_generations_config(probelist, _config):
# prepare run config: generations
for probe in probelist:
# distribute `generations` to the probes
p_type, p_module, p_klass = probe.split(".")
if (
hasattr(_config.run, "generations")
and _config.run.generations
is not None # garak.core.yaml always provides run.generations
):
_config.plugins.probes[p_module][p_klass][
"generations"
] = _config.run.generations
9 changes: 8 additions & 1 deletion garak/_plugins.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ def plugin_info(plugin: Union[Callable, str]) -> dict:


def enumerate_plugins(
category: str = "probes", skip_base_classes=True
category: str = "probes", skip_base_classes=True, filter: Union[None, dict] = None
) -> List[tuple[str, bool]]:
"""A function for listing all modules & plugins of the specified kind.

Expand All @@ -352,6 +352,13 @@ def enumerate_plugins(
for k, v in PluginCache.instance()[category].items():
if skip_base_classes and ".base." in k:
continue
if filter is not None:
try:
for attrib, value in filter.items():
if attrib in v and v[attrib] != value:
raise StopIteration
except StopIteration:
continue
enum_entry = (k, v["active"])
plugin_class_names.add(enum_entry)

Expand Down
66 changes: 31 additions & 35 deletions garak/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

"""Flow for invoking garak from the command line"""

command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version".split()
command_options = "list_detectors list_probes list_policy_probes list_generators list_buffs list_config plugin_info interactive report version".split()


def main(arguments=None) -> None:
Expand Down Expand Up @@ -107,6 +107,12 @@ def main(arguments=None) -> None:
parser.add_argument(
"--config", type=str, default=None, help="YAML config file for this run"
)
parser.add_argument(
"--policy_scan",
action="store_true",
default=_config.run.policy_scan,
help="determine model's behavior policy before scanning",
)

## PLUGINS
# generator
Expand Down Expand Up @@ -201,6 +207,9 @@ def main(arguments=None) -> None:
parser.add_argument(
"--list_probes", action="store_true", help="list available vulnerability probes"
)
parser.add_argument(
"--list_policy_probes", action="store_true", help="list available policy probes"
)
parser.add_argument(
"--list_detectors", action="store_true", help="list available detectors"
)
Expand Down Expand Up @@ -237,11 +246,6 @@ def main(arguments=None) -> None:
action="store_true",
help="Enter interactive probing mode",
)
parser.add_argument(
"--generate_autodan",
action="store_true",
help="generate AutoDAN prompts; requires --prompt_options with JSON containing a prompt and target",
)
parser.add_argument(
"--interactive.py",
action="store_true",
Expand Down Expand Up @@ -408,6 +412,9 @@ def main(arguments=None) -> None:
elif args.list_probes:
command.print_probes()

elif args.list_policy_probes:
command.print_policy_probes()

elif args.list_detectors:
command.print_detectors()

Expand Down Expand Up @@ -435,6 +442,7 @@ def main(arguments=None) -> None:

print(f"📜 logging to {log_filename}")

# set up generator
conf_root = _config.plugins.generators
for part in _config.plugins.model_type.split("."):
if not part in conf_root:
Expand All @@ -457,6 +465,7 @@ def main(arguments=None) -> None:
logging.error(message)
raise ValueError(message)

# validate main run config
parsable_specs = ["probe", "detector", "buff"]
parsed_specs = {}
for spec_type in parsable_specs:
Expand All @@ -480,20 +489,7 @@ def main(arguments=None) -> None:
msg_list = ",".join(rejected)
raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}")

for probe in parsed_specs["probe"]:
# distribute `generations` to the probes
p_type, p_module, p_klass = probe.split(".")
if (
hasattr(_config.run, "generations")
and _config.run.generations
is not None # garak.core.yaml always provides run.generations
):
_config.plugins.probes[p_module][p_klass][
"generations"
] = _config.run.generations
Comment on lines -547 to -557
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this logic being captured?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some, but not all of it below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in garak/config.py L260?:

def distribute_generations_config(probelist, _config):
    # prepare run config: generations
    for probe in probelist:
        # distribute `generations` to the probes
        p_type, p_module, p_klass = probe.split(".")
        if (
            hasattr(_config.run, "generations")
            and _config.run.generations
            is not None  # garak.core.yaml always provides run.generations
        ):
            _config.plugins.probes[p_module][p_klass][
                "generations"
            ] = _config.run.generations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really make sense as a helper function in _config? The implementation looks to be a bit circular which is a bit confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear for me, but given the current state of garak._config I can support keeping it lean. The ability to configure plugins with some globals is desirable from a user point of view. Whether the code to do it lies in _config or command or cli, I don't know, but:

  • We'd like to keep _config lean
  • This is not something that will only ever be used by people using the cli entry point

So I'm tentatively placing it in command. Happy to hear other arguments.


evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# generator init
from garak import _plugins

generator = _plugins.load_plugin(
Expand All @@ -510,28 +506,28 @@ def main(arguments=None) -> None:
logging=logging,
)

if "generate_autodan" in args and args.generate_autodan:
from garak.resources.autodan import autodan_generate

try:
prompt = _config.probe_options["prompt"]
target = _config.probe_options["target"]
except Exception as e:
print(
"AutoDAN generation requires --probe_options with a .json containing a `prompt` and `target` "
"string"
)
autodan_generate(generator=generator, prompt=prompt, target=target)

# looks like we might get something to report, so fire that up
command.start_run() # start the run now that all config validation is complete
print(f"📜 reporting to {_config.transient.report_filename}")

# do policy run
if _config.run.policy_scan:
command.run_policy_scan(generator, _config)

# configure generations counts for main run
_config.distribute_generations_config(parsed_specs["probe"], _config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config should not change after a start_run(), since a policy scan needs to override the generations it may be appropriate for the policy to build it's own configuration dictionary with the value it needs in place.


# set up plugins for main run
# instantiate evaluator
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# parse & set up detectors, if supplied
if parsed_specs["detector"] == []:
command.probewise_run(
run_result = command.probewise_run(
generator, parsed_specs["probe"], evaluator, parsed_specs["buff"]
)
else:
command.pxd_run(
run_result = command.pxd_run(
generator,
parsed_specs["probe"],
parsed_specs["detector"],
Expand Down
Loading
Loading