Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: max_workers / give kinda helpful message if too many open files #1110

Merged

Conversation

leondz
Copy link
Collaborator

@leondz leondz commented Feb 24, 2025

OS can get upset if parallel_attempts goes too high. Give a clearer error message about this.

(garak) 09:13:05 x1:~/dev/garak [main] $ python -m garak -m nim -n meta/llama-3.2-3b-instruct -p phrasing.PastTenseMini --parallel_attempts 1000 -g 5
garak LLM vulnerability scanner v0.10.2.post1 ( https://github.com/NVIDIA/garak ) at 2025-02-24T09:13:12.943850
📜 logging to /home/lderczynski/.local/share/garak/garak.log
🦜 loading generator: NIM: meta/llama-3.2-3b-instruct
📜 reporting to /home/lderczynski/.local/share/garak/garak_runs/garak.fb21a28e-16c8-4496-bd9e-b0f694333003.report.jsonl
🕵️  queue of probes: phrasing.PastTenseMini
probes.phrasing.PastTenseMini:   0%|                                                                                                                        | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/lderczynski/dev/garak/garak/__main__.py", line 14, in <module>
    main()
  File "/home/lderczynski/dev/garak/garak/__main__.py", line 9, in main
    cli.main(sys.argv[1:])
  File "/home/lderczynski/dev/garak/garak/cli.py", line 594, in main
    command.probewise_run(
  File "/home/lderczynski/dev/garak/garak/command.py", line 237, in probewise_run
    probewise_h.run(generator, probe_names, evaluator, buffs)
  File "/home/lderczynski/dev/garak/garak/harnesses/probewise.py", line 107, in run
    h.run(model, [probe], detectors, evaluator, announce_probe=False)
  File "/home/lderczynski/dev/garak/garak/harnesses/base.py", line 123, in run
    attempt_results = probe.probe(model)
                      ^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/dev/garak/garak/probes/base.py", line 219, in probe
    attempts_completed = self._execute_all(attempts_todo)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/dev/garak/garak/probes/base.py", line 181, in _execute_all
    with Pool(_config.system.parallel_attempts) as attempt_pool:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/context.py", line 282, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/lderczynski/anaconda3/envs/garak/lib/python3.12/multiprocessing/popen_fork.py", line 65, in _launch
    child_r, parent_w = os.pipe()
                        ^^^^^^^^^
OSError: [Errno 24] Too many open files

Verification

List the steps needed to make sure this thing works

  • try garak -m test -p test.Test --parallel_attempts 1000, the new error should pop up on CLI and in log. If it doesn't, try a higher number, or reduce ulimit.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me for parallel_attempts, what are your thoughts on adding a similar guard in generators/base.py related to parallel_requests as well?

In theory, if both were set the error would bubble up from the generator sub-processes however since parallel_requests is independent a generator that requires a single request per call could produce a similar error when parallel_attempts was not set.

At the same time I wonder about the value of catching OSError like this, are we going down a path that will require additional handlers for various resource limitation errors across supported operating systems?

Consider the command used to test this, run on a Windows installl with only 4GB of RAM can raise:

  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Win10x64\miniconda3\Lib\multiprocessing\popen_spawn_win32.py", line 75, in __init__
    hp, ht, pid, tid = _winapi.CreateProcess(
                       ^^^^^^^^^^^^^^^^^^^^^^
OSError: [WinError 1455] The paging file is too small for this operation to complete

@leondz
Copy link
Collaborator Author

leondz commented Feb 26, 2025

Amendments:

  • add the help for parallel_requests also
  • set configurable max_workers value and check this during CLI validation (so we fail early instead of mid-run)
  • cap worker pool sizes for parallel_requests, parallel_attempts

Validation:

  • set config.system.max_workers to 1000 (on the high side) first

  • requests:

    • garak -m test -p test.Test --parallel_requests 2000 - rejected before run starts
    • garak -m test -p test.Test --parallel_requests 1000 - no crash (linux (sometimes))
    • garak -m test -p test.Test --parallel_requests 1000 -g 1000 - crash
  • attempts:

    • garak -m test -p test.Test --parallel_attempts 2000 - rejected before run starts
    • garak -m test -p test.Test --parallel_attempts 1000 - no crash (linux (sometimes))
    • garak -m test -p continuation.ContinueSlursReclaimedSlursFull --parallel_attempts 1000 - crash

-- i hope the windows message is alright - i don't have a great idea of how this goes wrong

@leondz leondz requested a review from jmartin-tech February 26, 2025 09:15
@leondz leondz changed the title give kinda helpful message if too many open files feature: give kinda helpful message if too many open files Feb 26, 2025
@jmartin-tech jmartin-tech self-assigned this Feb 28, 2025
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for extending to parallel_requests this looks ready.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the churn, final validation identified that the new system.max_workers is not taking overrides into account.

garak.site.yaml:

system:
  max_workers: 2000
python -m garak: error: argument --parallel_attempts: Parallel worker count capped at 500 (config.system.max_workers)

@leondz leondz changed the title feature: give kinda helpful message if too many open files feature: max_workers / give kinda helpful message if too many open files Mar 4, 2025
@jmartin-tech jmartin-tech force-pushed the update/large_parallel_exception_handling branch from 057bfd4 to 2af0fe7 Compare March 18, 2025 17:31
@leondz
Copy link
Collaborator Author

leondz commented Mar 18, 2025

rebase looks good to me, happy to merge

@jmartin-tech jmartin-tech dismissed their stale review March 24, 2025 17:29

Latest updates look good. There are still some edge cases that might pop up in the future around ensuring a config file set value is valid however that is out of scope here for now.

@jmartin-tech jmartin-tech force-pushed the update/large_parallel_exception_handling branch from 5677c03 to fae9834 Compare March 25, 2025 17:34
@jmartin-tech jmartin-tech merged commit 3282b9e into NVIDIA:main Mar 25, 2025
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 25, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants