Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance Degradation] Unable to reproduce the example #130

Open
jackyk02 opened this issue Nov 16, 2023 · 3 comments
Open

[Performance Degradation] Unable to reproduce the example #130

jackyk02 opened this issue Nov 16, 2023 · 3 comments

Comments

@jackyk02
Copy link

Hi Sam, I was trying to reproduce the example:

import sys
from concurrent.futures import ThreadPoolExecutor

print(f"nogil={getattr(sys.flags, 'nogil', False)}")

def fib(n):
    if n < 2: return 1
    return fib(n-1) + fib(n-2)

threads = 8
if len(sys.argv) > 1:
    threads = int(sys.argv[1])

with ThreadPoolExecutor(max_workers=threads) as executor:
    for _ in range(threads):
        executor.submit(lambda: print(fib(34)))

However, I got a completely different outcome. On a 32-core Intel i9-13950HX, a single thread required 0.537 seconds, while 20 threads needed 1.445 seconds.

time python3.9 fib.py 1
nogil=True
9227465

real    0m0.537s
user    0m0.493s
sys     0m0.000s

time python3.9 fib.py 20
nogil=True
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465

real    0m1.445s
user    0m26.080s
sys     0m0.037s

@colesbury, I'm truly grateful for your contribution to the no GIL support in Python. It would be great to hear from you!

@colesbury
Copy link
Owner

Hi @jackyk02,

First, in case it's not clear, the work scales with the number of threads. Running with 20 threads does 20 times more work than running with one thread. So the "ideal" case will take the same amount of time with 20 threads as it does with one thread.

You are not seeing the ideal case because the i9-13950HX has 8 "performance" 1 cores and 16 "efficiency" cores. The efficiency cores run slower than the performance cores and the way the mini benchmark is set up, it's only as fast as the slowest thread. Finally, your CPU will run a single core at a higher frequency than if multiple CPU cores are active due to thermal throttling.

If you want to see nice linear scaling, you'll need to run only on the 8 performance cores and disable Intel's Turbo Boost as described at the bottom of the README.

Footnotes

  1. The performance cores support "hyper-threading" so they show up as 8+8 cores

@jackyk02
Copy link
Author

Hi @jackyk02,

First, in case it's not clear, the work scales with the number of threads. Running with 20 threads does 20 times more work than running with one thread. So the "ideal" case will take the same amount of time with 20 threads as it does with one thread.

You are not seeing the ideal case because the i9-13950HX has 8 "performance" 1 cores and 16 "efficiency" cores. The efficiency cores run slower than the performance cores and the way the mini benchmark is set up, it's only as fast as the slowest thread. Finally, your CPU will run a single core at a higher frequency than if multiple CPU cores are active due to thermal throttling.

If you want to see nice linear scaling, you'll need to run only on the 8 performance cores and disable Intel's Turbo Boost as described at the bottom of the README.

Footnotes

  1. The performance cores support "hyper-threading" so they show up as 8+8 cores

Thanks a lot for providing the detailed explanation :)

@jackyk02
Copy link
Author

Additionally, I've encountered a runtime error involving Numpy while executing an OpenAI Gym example using multi-threading in Python. The issue arises specifically when threading is combined with the Gym environment, leading to a non-reentrant call in Numpy.

Steps to Reproduce
The following code snippet can be used to reproduce the issue:

import threading
import gym
import numpy as np

def worker_thread(worker, step_num):
    env = gym.make("CartPole-v1")
    env.reset()
    return env

# Number of threads
num_threads = 4
steps_per_thread = 5

# Creating and starting threads
threads = [threading.Thread(target=worker_thread, args=(i, steps_per_thread)) for i in range(num_threads)]
for thread in threads:
    thread.start()

# Joining threads
for thread in threads:
    thread.join()

Error Message:

RuntimeError: numpy float printing code is not re-entrant. Ping the devs to fix it.
File "/home/user/.local/lib/python3.9/site-packages/gym/envs/classic_control/cartpole.py", line 117, in __init__
File "/home/user/.local/lib/python3.9/site-packages/gym/spaces/box.py", line 25, in _short_repr
  return str(arr)
File "/home/user/.local/lib/python3.9/site-packages/numpy/core/arrayprint.py", line 1592, in _array_str_implementation
  return array2string(a, max_line_width, precision, suppress_small, ' ', "")
...

Environment
Python Version: nogil 3.9.10
OpenAI Gym Version: 0.26.2
Numpy Version: 1.22.3

This issue seems to be related to the re-entrancy of Numpy's float printing code, as suggested by the error message. It would be great if you could also offer insights regarding the problem. Once again, thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants