[Performance Degradation] Unable to reproduce the example #130

jackyk02 · 2023-11-16T12:18:10Z

Hi Sam, I was trying to reproduce the example:

import sys
from concurrent.futures import ThreadPoolExecutor

print(f"nogil={getattr(sys.flags, 'nogil', False)}")

def fib(n):
    if n < 2: return 1
    return fib(n-1) + fib(n-2)

threads = 8
if len(sys.argv) > 1:
    threads = int(sys.argv[1])

with ThreadPoolExecutor(max_workers=threads) as executor:
    for _ in range(threads):
        executor.submit(lambda: print(fib(34)))

However, I got a completely different outcome. On a 32-core Intel i9-13950HX, a single thread required 0.537 seconds, while 20 threads needed 1.445 seconds.

time python3.9 fib.py 1
nogil=True
9227465

real    0m0.537s
user    0m0.493s
sys     0m0.000s

time python3.9 fib.py 20
nogil=True
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465
9227465

real    0m1.445s
user    0m26.080s
sys     0m0.037s

@colesbury, I'm truly grateful for your contribution to the no GIL support in Python. It would be great to hear from you!

The text was updated successfully, but these errors were encountered:

colesbury · 2023-11-16T16:12:24Z

Hi @jackyk02,

First, in case it's not clear, the work scales with the number of threads. Running with 20 threads does 20 times more work than running with one thread. So the "ideal" case will take the same amount of time with 20 threads as it does with one thread.

You are not seeing the ideal case because the i9-13950HX has 8 "performance" ¹ cores and 16 "efficiency" cores. The efficiency cores run slower than the performance cores and the way the mini benchmark is set up, it's only as fast as the slowest thread. Finally, your CPU will run a single core at a higher frequency than if multiple CPU cores are active due to thermal throttling.

If you want to see nice linear scaling, you'll need to run only on the 8 performance cores and disable Intel's Turbo Boost as described at the bottom of the README.

The performance cores support "hyper-threading" so they show up as 8+8 cores ↩

jackyk02 · 2023-11-17T02:00:44Z

Hi @jackyk02,

First, in case it's not clear, the work scales with the number of threads. Running with 20 threads does 20 times more work than running with one thread. So the "ideal" case will take the same amount of time with 20 threads as it does with one thread.

You are not seeing the ideal case because the i9-13950HX has 8 "performance" 1 cores and 16 "efficiency" cores. The efficiency cores run slower than the performance cores and the way the mini benchmark is set up, it's only as fast as the slowest thread. Finally, your CPU will run a single core at a higher frequency than if multiple CPU cores are active due to thermal throttling.

If you want to see nice linear scaling, you'll need to run only on the 8 performance cores and disable Intel's Turbo Boost as described at the bottom of the README.

Footnotes

The performance cores support "hyper-threading" so they show up as 8+8 cores ↩

Thanks a lot for providing the detailed explanation :)

jackyk02 · 2023-11-17T02:18:20Z

Additionally, I've encountered a runtime error involving Numpy while executing an OpenAI Gym example using multi-threading in Python. The issue arises specifically when threading is combined with the Gym environment, leading to a non-reentrant call in Numpy.

Steps to Reproduce
The following code snippet can be used to reproduce the issue:

import threading
import gym
import numpy as np

def worker_thread(worker, step_num):
    env = gym.make("CartPole-v1")
    env.reset()
    return env

# Number of threads
num_threads = 4
steps_per_thread = 5

# Creating and starting threads
threads = [threading.Thread(target=worker_thread, args=(i, steps_per_thread)) for i in range(num_threads)]
for thread in threads:
    thread.start()

# Joining threads
for thread in threads:
    thread.join()

Error Message:

RuntimeError: numpy float printing code is not re-entrant. Ping the devs to fix it.
File "/home/user/.local/lib/python3.9/site-packages/gym/envs/classic_control/cartpole.py", line 117, in __init__
File "/home/user/.local/lib/python3.9/site-packages/gym/spaces/box.py", line 25, in _short_repr
  return str(arr)
File "/home/user/.local/lib/python3.9/site-packages/numpy/core/arrayprint.py", line 1592, in _array_str_implementation
  return array2string(a, max_line_width, precision, suppress_small, ' ', "")
...

Environment
Python Version: nogil 3.9.10
OpenAI Gym Version: 0.26.2
Numpy Version: 1.22.3

This issue seems to be related to the re-entrancy of Numpy's float printing code, as suggested by the error message. It would be great if you could also offer insights regarding the problem. Once again, thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Degradation] Unable to reproduce the example #130

[Performance Degradation] Unable to reproduce the example #130

jackyk02 commented Nov 16, 2023

colesbury commented Nov 16, 2023

jackyk02 commented Nov 17, 2023

Footnotes

jackyk02 commented Nov 17, 2023

[Performance Degradation] Unable to reproduce the example #130

[Performance Degradation] Unable to reproduce the example #130

Comments

jackyk02 commented Nov 16, 2023

colesbury commented Nov 16, 2023

Footnotes

jackyk02 commented Nov 17, 2023

Footnotes

jackyk02 commented Nov 17, 2023