set cpu affinity and membind for better oob performance #853

rbrugaro · 2024-07-29T22:15:18Z

sets num of threads and cpu affinity to the number of physical cpus in a socket and performs memory binding to single numa node for better oob performance

rbrugaro · 2024-07-29T22:17:20Z

@sywangyi @jiqing-feng please review

sywangyi · 2024-07-30T01:32:44Z

Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix

sywangyi · 2024-07-30T02:09:31Z

optimum/intel/ipex/modeling_base.py

@@ -153,6 +167,18 @@ def __init__(
        else:
            self._device = torch.device("cpu")

+            import numa


should add check if the numa package is available

sywangyi · 2024-07-30T02:12:21Z

optimum/intel/ipex/modeling_base.py

+            numa.set_membind([0])
+            print("affinity", numa.get_affinity(0))
+            print("membind", numa.get_membind())
+


also ,if the OMP_NUM_THREADS, membind is already set by external user. should not override the user's configuration.

also Tensor Parallel case should be considered as well.

jiqing-feng · 2024-08-06T02:30:31Z

Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix

Hi @rbrugaro . The issue still exists. I think you should check the world size (rank number) first to know how many processes you have, and then you can choose the best binding strategy for each rank.

jiqing-feng · 2024-08-06T02:33:45Z

We plan to integrate the best strategy of binding cores in ipex backend, so other users can call this function to get the best performance of IPEXModel on CPU. cc @echarlaix

rbrugaro · 2024-08-07T18:43:48Z

@jiqing-feng @sywangyi I updated code to read world_size and rank from environment variables and set good default settings only if unset by user.
50% oob improvement in generation time in 4th Gen Intel(R) Xeon(R) for 1 process (gpt2)
27% oob improvementt in generation time in 4th Gen Intel(R) Xeon(R) for 2 process (gpt2)
Note: No oob improvement seen when 2 process are launched with accelerate launcher.

1 process:

setting OMP_NUM_THREADS to 56
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}

2 process:

setting OMP_NUM_THREADS to 56
setting OMP_NUM_THREADS to 56
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}

Note: if the multi process is launched with accelerate launch text_gen2.py , accelerate launcher sets defaults for number of threads and affinity to number of logical cores and didn't see any improvement. Maybe accelerate OOB defaults need to be updated?

$ accelerate launch text-gen2.py 
2024-08-07 18:29:35,280 - launch.py - accelerate.commands.launch - WARNING - The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `112` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
My guessed rank = 1
My guessed rank = 0
OMP_NUM_THREADS already set to  112
OMP_NUM_THREADS already set to  112
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167}, membind = {0}

jiqing-feng · 2024-08-08T03:04:38Z

Usually, we only bind physical cores which are 0-55 and 56-111 in your case. The memory bind is also required. It's weird that accelerate launch has no speed-up compared to the default Python command. Do you have any clue about it? cc @sywangyi

sywangyi · 2024-08-12T00:29:25Z

@jiqing-feng @rbrugaro , this could only happened if local_world_size in accelerate is equal to 1.https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/environment.py#L213. https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L239 seems MPI is not launched in the env. Have you followed https://github.com/huggingface/accelerate/blob/main/docs/source/usage_guides/ipex.md?

rbrugaro · 2024-08-13T07:20:34Z

@sywangyi @jiqing-feng Turned out to be a bug in accelerate that was setting the wrong num_cpu_threads_per_process when the script was launched with accelerate launch myscript.py because it was reading an unset env variable. It was working OK when launched with mpirun -f hostfile -n 4 -ppn 2 accelerate launch myscript.py
I have opened an accelerate PR#3009 for it
Now I can see 15% speed up from numa bindings.

rbrugaro · 2024-08-15T16:05:35Z

@sywangyi @jiqing-feng @echarlaix I have addressed the comments and accelerate PR#3009 has been merged. please review

jiqing-feng · 2024-08-20T04:49:14Z

Hi @IlyasMoutawwakil . Could you merge this PR? Thx!

rbrugaro · 2024-08-20T05:56:23Z

@sywangyi @jiqing-feng I pushed one commit that remained in my local. Results above correspond to this. I double checked.
example: 2 nodes, 4 process
rank_per_node=2
rank_id=0
node_id = int((rank_id+1) / rank_per_node) = 0

rank_id=1
node_id = int((rank_id+1) / rank_per_node) = 1

numa.set_membind([node_id])

sywangyi · 2024-08-20T06:06:37Z

Quote reply

in previous logic, if 2nodes, 4 process in one dev
rank_id, 0, 1, node_id 0
rank_id 2,3, node_id 1

in your modified logic. 2nodes, 4 process in one dev
rank 0, node 0
rank 1, node 1
rank 2, node 1.
rank 3, node 2?

rbrugaro · 2024-08-20T07:17:28Z

You are correct @sywangyi, I had missed it because i was testing with 2 machines, 2 process per node.
I've changed it so the numa node_id is calculated from local_size not world size across machines. I have tested with 2 nodes and num_processes:2,4,8 and numa affinity and binding works as expected.

IlyasMoutawwakil · 2024-08-21T12:51:48Z

optimum/intel/utils/modeling_utils.py

+    import importlib.util
+    import platform


why not at the top ?

IlyasMoutawwakil · 2024-08-21T13:01:24Z

optimum/intel/utils/modeling_utils.py

+            import math
+
+            import numa
+            import psutil


math could be at the top here, numa is checked but not psutil
I think we should continue to use methods like the ones in

optimum-intel/optimum/intel/utils/import_utils.py

Line 166 in e9800ce

def is_ipex_available():

added is_numa_available to the import_utils.py

optimum/intel/utils/modeling_utils.py

IlyasMoutawwakil

Thanks for this utility ! it's really cool !
One question, does this remove the need for numactl numa-args python script.py args or python -m torch.backends.xeon.run_cpu xeon-args script.py args entirely ?

rbrugaro · 2024-08-21T22:50:55Z

@IlyasMoutawwakil thanks for the comments, please check if I have addressed those properly.

This utility assigns a default system configuration that works well for most common scenarios. However, it may not be optimal for every situation or machine configuration, as performance can vary depending on the model, task, and other concurrent processes on the host.
Users who need more granular control will continue to use numactl numa-args python script.py args or
python -m torch.backends.xeon.run_cpu xeon-args script.py args

What we find is that some users may not use numatcl or configure the environment and they miss on performance. The utility is designed so that any environment/system configuration by the user takes precedence but if the user does not specify the settings, the utility will apply a default configuration that improves the oob performance.

Specifically, the utility sets OMP_NUM_THREADS and torch.num_threads to an equal split of the available physical cores, divided across the ranks running on the machine and performs NUMA affinity and memory binding to avoid the overhead of accessing memory on remote NUMA nodes.

HuggingFaceDocBuilderDev · 2024-08-23T07:50:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil · 2024-08-24T21:00:23Z

optimum/intel/utils/modeling_utils.py

+        raise ImportError("'numa' module not found, install with 'pip install numa'")
+    import numa
+
+    local_size = get_int_from_env(["MPI_LOCALNRANKS", "OMPI_COMM_WORLD_LOCAL_SIZE", "MV2_COMM_WORLD_LOCAL_SIZE"], 1)


there's also LOCAL_WORLD_SIZE from torchrun

optimum/intel/utils/modeling_utils.py

optimum/intel/utils/import_utils.py

rbrugaro · 2024-08-27T16:11:06Z

@IlyasMoutawwakil i fixed the last style/quality issue and checks are passing. if you think is ready pls merge. thx for all the help!

rbrugaro added 2 commits July 29, 2024 14:56

set num threads and memory binding for better OOB performance

54bc512

clean env var

5d19b46

sywangyi reviewed Jul 30, 2024

View reviewed changes

rbrugaro added 3 commits August 1, 2024 13:53

Merge branch 'huggingface:main' into numa

37e1022

added core and memory binding util for improved performance

9387717

add example usage in docstring

191f772

rbrugaro added 2 commits August 7, 2024 11:18

Merge branch 'huggingface:main' into numa

1a20674

change utlity for best oob to support world_size and rank >=1

2773fe6

Merge branch 'huggingface:main' into numa

7b37c39

rbrugaro marked this pull request as ready for review August 13, 2024 16:14

sywangyi approved these changes Aug 20, 2024

View reviewed changes

rbrugaro added 2 commits August 19, 2024 22:17

fix style

fa5526a

fix node_id value to account for rank_id starts at zero

09ac1ce

numa node assignment calculated from local size not from world size

8b476f8

IlyasMoutawwakil reviewed Aug 21, 2024

View reviewed changes

optimum/intel/utils/modeling_utils.py Outdated Show resolved Hide resolved

IlyasMoutawwakil reviewed Aug 21, 2024

View reviewed changes

optimum/intel/utils/modeling_utils.py Outdated Show resolved Hide resolved

IlyasMoutawwakil requested changes Aug 21, 2024

View reviewed changes

reorg imports, moved checks to import_utils, remove prints for logger

85ccfc4

rbrugaro requested a review from IlyasMoutawwakil August 22, 2024 16:46

raise Errors with missing pkg and unsupported OS

5407359

IlyasMoutawwakil reviewed Aug 24, 2024

View reviewed changes

added missng env var to list

95b65fe

IlyasMoutawwakil reviewed Aug 26, 2024

View reviewed changes

optimum/intel/utils/modeling_utils.py Outdated Show resolved Hide resolved

Update optimum/intel/utils/modeling_utils.py

7fb3cb5

IlyasMoutawwakil reviewed Aug 26, 2024

View reviewed changes

optimum/intel/utils/import_utils.py Outdated Show resolved Hide resolved

Update optimum/intel/utils/import_utils.py

3613f69

IlyasMoutawwakil reviewed Aug 26, 2024

View reviewed changes

optimum/intel/utils/import_utils.py Outdated Show resolved Hide resolved

Update optimum/intel/utils/import_utils.py

c825876

IlyasMoutawwakil approved these changes Aug 26, 2024

View reviewed changes

IlyasMoutawwakil and others added 2 commits August 26, 2024 20:28

Merge branch 'main' into numa

4535927

fix style quality error

e658ffc

IlyasMoutawwakil merged commit 9a18ae0 into huggingface:main Aug 27, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set cpu affinity and membind for better oob performance #853

set cpu affinity and membind for better oob performance #853

rbrugaro commented Jul 29, 2024

rbrugaro commented Jul 29, 2024

sywangyi commented Jul 30, 2024 •

edited

Loading

sywangyi Jul 30, 2024

sywangyi Jul 30, 2024 •

edited

Loading

jiqing-feng commented Aug 6, 2024 •

edited

Loading

jiqing-feng commented Aug 6, 2024

rbrugaro commented Aug 7, 2024

jiqing-feng commented Aug 8, 2024

sywangyi commented Aug 12, 2024

rbrugaro commented Aug 13, 2024

rbrugaro commented Aug 15, 2024

jiqing-feng commented Aug 20, 2024

rbrugaro commented Aug 20, 2024 •

edited

Loading

sywangyi commented Aug 20, 2024 •

edited

Loading

rbrugaro commented Aug 20, 2024

IlyasMoutawwakil Aug 21, 2024

rbrugaro Aug 22, 2024

IlyasMoutawwakil Aug 21, 2024

rbrugaro Aug 22, 2024

IlyasMoutawwakil left a comment •

edited

Loading

rbrugaro commented Aug 21, 2024

HuggingFaceDocBuilderDev commented Aug 23, 2024

IlyasMoutawwakil Aug 24, 2024

rbrugaro commented Aug 27, 2024

set cpu affinity and membind for better oob performance #853

set cpu affinity and membind for better oob performance #853

Conversation

rbrugaro commented Jul 29, 2024

rbrugaro commented Jul 29, 2024

sywangyi commented Jul 30, 2024 • edited Loading

sywangyi Jul 30, 2024

Choose a reason for hiding this comment

sywangyi Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

jiqing-feng commented Aug 6, 2024 • edited Loading

jiqing-feng commented Aug 6, 2024

rbrugaro commented Aug 7, 2024

jiqing-feng commented Aug 8, 2024

sywangyi commented Aug 12, 2024

rbrugaro commented Aug 13, 2024

rbrugaro commented Aug 15, 2024

jiqing-feng commented Aug 20, 2024

rbrugaro commented Aug 20, 2024 • edited Loading

sywangyi commented Aug 20, 2024 • edited Loading

rbrugaro commented Aug 20, 2024

IlyasMoutawwakil Aug 21, 2024

Choose a reason for hiding this comment

rbrugaro Aug 22, 2024

Choose a reason for hiding this comment

IlyasMoutawwakil Aug 21, 2024

Choose a reason for hiding this comment

rbrugaro Aug 22, 2024

Choose a reason for hiding this comment

IlyasMoutawwakil left a comment • edited Loading

Choose a reason for hiding this comment

rbrugaro commented Aug 21, 2024

HuggingFaceDocBuilderDev commented Aug 23, 2024

IlyasMoutawwakil Aug 24, 2024

Choose a reason for hiding this comment

rbrugaro commented Aug 27, 2024

sywangyi commented Jul 30, 2024 •

edited

Loading

sywangyi Jul 30, 2024 •

edited

Loading

jiqing-feng commented Aug 6, 2024 •

edited

Loading

rbrugaro commented Aug 20, 2024 •

edited

Loading

sywangyi commented Aug 20, 2024 •

edited

Loading

IlyasMoutawwakil left a comment •

edited

Loading