-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set cpu affinity and membind for better oob performance #853
Conversation
@sywangyi @jiqing-feng please review |
Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix |
optimum/intel/ipex/modeling_base.py
Outdated
@@ -153,6 +167,18 @@ def __init__( | |||
else: | |||
self._device = torch.device("cpu") | |||
|
|||
import numa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add check if the numa package is available
optimum/intel/ipex/modeling_base.py
Outdated
numa.set_membind([0]) | ||
print("affinity", numa.get_affinity(0)) | ||
print("membind", numa.get_membind()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also ,if the OMP_NUM_THREADS, membind is already set by external user. should not override the user's configuration.
also Tensor Parallel case should be considered as well.
Hi @rbrugaro . The issue still exists. I think you should check the world size (rank number) first to know how many processes you have, and then you can choose the best binding strategy for each rank. |
We plan to integrate the best strategy of binding cores in ipex backend, so other users can call this function to get the best performance of IPEXModel on CPU. cc @echarlaix |
@jiqing-feng @sywangyi I updated code to read world_size and rank from environment variables and set good default settings only if unset by user. 1 process:
2 process:
Note: if the multi process is launched with
|
Usually, we only bind physical cores which are 0-55 and 56-111 in your case. The memory bind is also required. It's weird that |
@jiqing-feng @rbrugaro , this could only happened if local_world_size in accelerate is equal to 1.https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/environment.py#L213. https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L239 seems MPI is not launched in the env. Have you followed https://github.com/huggingface/accelerate/blob/main/docs/source/usage_guides/ipex.md? |
@sywangyi @jiqing-feng Turned out to be a bug in accelerate that was setting the wrong |
@sywangyi @jiqing-feng @echarlaix I have addressed the comments and accelerate PR#3009 has been merged. please review |
Hi @IlyasMoutawwakil . Could you merge this PR? Thx! |
@sywangyi @jiqing-feng I pushed one commit that remained in my local. Results above correspond to this. I double checked. rank_id=1 numa.set_membind([node_id]) |
in previous logic, if 2nodes, 4 process in one dev in your modified logic. 2nodes, 4 process in one dev |
You are correct @sywangyi, I had missed it because i was testing with 2 machines, 2 process per node. |
import importlib.util | ||
import platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not at the top ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
import math | ||
|
||
import numa | ||
import psutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
math could be at the top here, numa is checked but not psutil
I think we should continue to use methods like the ones in
def is_ipex_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added is_numa_available to the import_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this utility ! it's really cool !
One question, does this remove the need for numactl numa-args python script.py args
or python -m torch.backends.xeon.run_cpu xeon-args script.py args
entirely ?
@IlyasMoutawwakil thanks for the comments, please check if I have addressed those properly. This utility assigns a default system configuration that works well for most common scenarios. However, it may not be optimal for every situation or machine configuration, as performance can vary depending on the model, task, and other concurrent processes on the host. What we find is that some users may not use numatcl or configure the environment and they miss on performance. The utility is designed so that any environment/system configuration by the user takes precedence but if the user does not specify the settings, the utility will apply a default configuration that improves the oob performance. Specifically, the utility sets OMP_NUM_THREADS and torch.num_threads to an equal split of the available physical cores, divided across the ranks running on the machine and performs NUMA affinity and memory binding to avoid the overhead of accessing memory on remote NUMA nodes. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
raise ImportError("'numa' module not found, install with 'pip install numa'") | ||
import numa | ||
|
||
local_size = get_int_from_env(["MPI_LOCALNRANKS", "OMPI_COMM_WORLD_LOCAL_SIZE", "MV2_COMM_WORLD_LOCAL_SIZE"], 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also LOCAL_WORLD_SIZE
from torchrun
@IlyasMoutawwakil i fixed the last style/quality issue and checks are passing. if you think is ready pls merge. thx for all the help! |
sets num of threads and cpu affinity to the number of physical cpus in a socket and performs memory binding to single numa node for better oob performance