feature: max_workers / give kinda helpful message if too many open files #1110

leondz · 2025-02-24T10:00:28Z

OS can get upset if parallel_attempts goes too high. Give a clearer error message about this.

(garak) 09:13:05 x1:~/dev/garak [main] $ python -m garak -m nim -n meta/llama-3.2-3b-instruct -p phrasing.PastTenseMini --parallel_attempts 1000 -g 5
garak LLM vulnerability scanner v0.10.2.post1 ( https://github.com/NVIDIA/garak ) at 2025-02-24T09:13:12.943850
📜 logging to /home/lderczynski/.local/share/garak/garak.log
🦜 loading generator: NIM: meta/llama-3.2-3b-instruct
📜 reporting to /home/lderczynski/.local/share/garak/garak_runs/garak.fb21a28e-16c8-4496-bd9e-b0f694333003.report.jsonl
🕵️  queue of probes: phrasing.PastTenseMini
probes.phrasing.PastTenseMini:   0%|                                                                                                                        | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/lderczynski/dev/garak/garak/__main__.py", line 14, in <module>
    main()
  File "/home/lderczynski/dev/garak/garak/__main__.py", line 9, in main
    cli.main(sys.argv[1:])
  File "/home/lderczynski/dev/garak/garak/cli.py", line 594, in main
    command.probewise_run(
  File "/home/lderczynski/dev/garak/garak/command.py", line 237, in probewise_run
    probewise_h.run(generator, probe_names, evaluator, buffs)
  File "/home/lderczynski/dev/garak/garak/harnesses/probewise.py", line 107, in run
    h.run(model, [probe], detectors, evaluator, announce_probe=False)
  File "/home/lderczynski/dev/garak/garak/harnesses/base.py", line 123, in run
    attempt_results = probe.probe(model)
                      ^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/dev/garak/garak/probes/base.py", line 219, in probe
    attempts_completed = self._execute_all(attempts_todo)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/dev/garak/garak/probes/base.py", line 181, in _execute_all
    with Pool(_config.system.parallel_attempts) as attempt_pool:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/context.py", line 282, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/popen_fork.py", line 65, in _launch
    child_r, parent_w = os.pipe()
                        ^^^^^^^^^
OSError: [Errno 24] Too many open files

Verification

List the steps needed to make sure this thing works

try garak -m test -p test.Test --parallel_attempts 1000, the new error should pop up on CLI and in log. If it doesn't, try a higher number, or reduce ulimit.

jmartin-tech

This looks reasonable to me for parallel_attempts, what are your thoughts on adding a similar guard in generators/base.py related to parallel_requests as well?

In theory, if both were set the error would bubble up from the generator sub-processes however since parallel_requests is independent a generator that requires a single request per call could produce a similar error when parallel_attempts was not set.

At the same time I wonder about the value of catching OSError like this, are we going down a path that will require additional handlers for various resource limitation errors across supported operating systems?

Consider the command used to test this, run on a Windows installl with only 4GB of RAM can raise:

  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\popen_spawn_win32.py", line 75, in __init__
    hp, ht, pid, tid = _winapi.CreateProcess(
                       ^^^^^^^^^^^^^^^^^^^^^^
OSError: [WinError 1455] The paging file is too small for this operation to complete

leondz · 2025-02-26T09:13:54Z

Amendments:

add the help for parallel_requests also
set configurable max_workers value and check this during CLI validation (so we fail early instead of mid-run)
cap worker pool sizes for parallel_requests, parallel_attempts

Validation:

set config.system.max_workers to 1000 (on the high side) first
requests:
- garak -m test -p test.Test --parallel_requests 2000 - rejected before run starts
- garak -m test -p test.Test --parallel_requests 1000 - no crash (linux (sometimes))
- garak -m test -p test.Test --parallel_requests 1000 -g 1000 - crash
attempts:
- garak -m test -p test.Test --parallel_attempts 2000 - rejected before run starts
- garak -m test -p test.Test --parallel_attempts 1000 - no crash (linux (sometimes))
- garak -m test -p continuation.ContinueSlursReclaimedSlursFull --parallel_attempts 1000 - crash

-- i hope the windows message is alright - i don't have a great idea of how this goes wrong

jmartin-tech

Thanks for extending to parallel_requests this looks ready.

garak/generators/base.py

jmartin-tech

Sorry for the churn, final validation identified that the new system.max_workers is not taking overrides into account.

garak.site.yaml:

system:
  max_workers: 2000

python -m garak: error: argument --parallel_attempts: Parallel worker count capped at 500 (config.system.max_workers)

garak/cli.py

Signed-off-by: Jeffrey Martin <[email protected]>

leondz · 2025-03-18T23:21:33Z

rebase looks good to me, happy to merge

Latest updates look good. There are still some edge cases that might pop up in the future around ensuring a config file set value is valid however that is out of scope here for now.

Signed-off-by: Jeffrey Martin <[email protected]>

jmartin-tech reviewed Feb 24, 2025

View reviewed changes

leondz requested a review from jmartin-tech February 26, 2025 09:15

leondz changed the title ~~give kinda helpful message if too many open files~~ feature: give kinda helpful message if too many open files Feb 26, 2025

jmartin-tech self-assigned this Feb 28, 2025

jmartin-tech reviewed Feb 28, 2025

View reviewed changes

garak/generators/base.py Outdated Show resolved Hide resolved

jmartin-tech previously requested changes Feb 28, 2025

View reviewed changes

garak/cli.py Outdated Show resolved Hide resolved

leondz changed the title ~~feature: give kinda helpful message if too many open files~~ feature: max_workers / give kinda helpful message if too many open files Mar 4, 2025

leondz and others added 9 commits March 18, 2025 11:12

give kinda helpful message if too many open files

3435e79

also give advice for parallel_requests too high

914a6b6

cap worker pool size to the volume of work requested

d78ca84

get the error message right

6d977d4

add configurable cap on max # of workers to spawn

d03a407

also cap pool sizes with max_workers

4131d75

reduce max_workers to something well below default soft ulimit

7b6ee50

document max_workers

82bf7f0

test max_workers config, use it to validate after full configuration

2af0fe7

jmartin-tech force-pushed the update/large_parallel_exception_handling branch from 057bfd4 to 2af0fe7 Compare March 18, 2025 17:31

migrate max_workers from Configurable

be3a8d6

Signed-off-by: Jeffrey Martin <[email protected]>

leondz added 3 commits March 19, 2025 10:56

update max workers msg

8907ee2

update max workers invalid value msg

efa4d80

Merge branch 'main' into update/large_parallel_exception_handling

ccde860

cleaner exit when max_workers configuration blocks run

fae9834

Signed-off-by: Jeffrey Martin <[email protected]>

jmartin-tech force-pushed the update/large_parallel_exception_handling branch from 5677c03 to fae9834 Compare March 25, 2025 17:34

jmartin-tech merged commit 3282b9e into NVIDIA:main Mar 25, 2025
9 checks passed

github-actions bot locked and limited conversation to collaborators Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: max_workers / give kinda helpful message if too many open files #1110

feature: max_workers / give kinda helpful message if too many open files #1110

leondz commented Feb 24, 2025

jmartin-tech left a comment •

edited

Loading

leondz commented Feb 26, 2025 •

edited

Loading

jmartin-tech left a comment

jmartin-tech left a comment •

edited

Loading

leondz commented Mar 18, 2025

feature: max_workers / give kinda helpful message if too many open files #1110

feature: max_workers / give kinda helpful message if too many open files #1110

Conversation

leondz commented Feb 24, 2025

Verification

jmartin-tech left a comment • edited Loading

Choose a reason for hiding this comment

leondz commented Feb 26, 2025 • edited Loading

jmartin-tech left a comment

Choose a reason for hiding this comment

jmartin-tech left a comment • edited Loading

Choose a reason for hiding this comment

leondz commented Mar 18, 2025

jmartin-tech left a comment •

edited

Loading

leondz commented Feb 26, 2025 •

edited

Loading

jmartin-tech left a comment •

edited

Loading