Should the number of contexts correspond one-to-one with the number of threads? #759

jin-zhengnan · 2024-08-27T04:22:12Z

Describe the bug
In version 2.3.1 of Mercury, applications using the TensorFlow framework will create many threads, approximately around 100 within a process . If each thread corresponds to a context created in Mercury, the application crashes when it runs, displaying the following output.

[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.838843 WARNING: PRINT_BACKTRACE: get a signal(11), pid:2983066, tid:298456in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839184 WARNING: PRINT_BACKTRACE: symbols=0x7fb640080410 pid:2983066, tid:2984568 in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839222 WARNING: Call stack: in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839261 WARNING: /xxx/libxxx-client.so(releasexxxClient+0x82) [0x7fb8bc159882] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839290 WARNING: /usr/lib64/libc.so.6(+0x37400) [0x7fb8ba8ef400] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839319 WARNING: /xxx/mercurylib/libna.so.4(+0x1b6cb) [0x7fb8ba25e6cb] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839346 WARNING: /xxx/mercurylib/libmercury.so.2(+0x1084d) [0x7fb8ba48184d] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839373 WARNING: /xxx/mercurylib/libmercury.so.2(+0x1254a) [0x7fb8ba48354a] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839399 WARNING: /xxx/mercurylib/libmercury.so.2(HG_Core_progress+0x70) [0x7fb8ba48a840] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839433 WARNING: /xxx/mercurylib/libmercury.so.2(HG_Progress+0xe) [0x7fb8ba47a17e] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839460 WARNING: /xxx/libxxx-client.so(+0x5b301) [0x7fb8bc0ae301] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839486 WARNING: /xxx/mercurylib/libmercury_util.so.4(hg_request_wait+0xe6) [0x7fb8ba03e996] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839512 WARNING: /xxx/libxxx-client.so(send_cumemcpyhtodasync_v2+0x119) [0x7fb8bc0b3fa9] in client.c(488)

However, if only one context is created for all these threads, the application runs normally but with poor performance.
So, how many contexts should be created to be reasonable? Is there a recommended value?

jin-zhengnan · 2024-08-27T05:59:20Z

@soumagne
hi, please reply this question,thanks

soumagne · 2024-08-27T16:05:40Z

I can't see much from that trace but in general you'd want to make sure that you set hg_init_info.na_init_info.max_contexts to the maximum number of contexts that you expect. This is currently a uint8 so depending on the underlying provider it should support 100 contexts. Alternatively, if you want to be able to address each of the contexts separately from the client, the other option is to create a separate class/context pair for each thread.

jin-zhengnan · 2024-08-28T07:15:44Z

@soumagne
thanks for your reply.
The client is multithreaded. We created a class for the client, but created a context for each thread. There is one request_class and one request per thread, and each thread will send multiple messages. Each time a message is sent, we recreate the handle to send it, and destroy the handle after the message response.
the code like this:

createMsgHandle(&handle, &info, gettid(), "xxxfunction"))
unsigned int completed = 0; \
hg_request_t args = {.expected_count = 1, .complete_count = 0, .request = info->request}; \
hg_request_reset(info->request); \
ret = HG_Forward(handle, result_handle, &args, in); \
hg_request_wait(info->request, HG_MAX_TIMEOUT, &completed); \
ret = HG_Get_output(handle, &out); \
HG_Free_output(handle, &out); \
HG_Destroy(handle); \

In this situation, we found that when the number of threads reaches a certain amount, such as 100, crashes are likely to occur. By checking the call stack of the crash, we found it in the na_sm_progress function. We would like to ask, what is the reason for this crash? Attached is the call stack at the time of the crash.

mercury_crash_stack.txt
in line 1352b

Another question:
Is it appropriate for our current usage where one class corresponds to multiple contexts? What is the recommended approach? How can we improve our performance?

soumagne · 2024-09-03T23:29:48Z

It seems fine for that particular usage though this combination of one class + multiple contexts is not extensively tested, this might be an issue specific to the shared-memory plugin, I will need to see if I can reproduce it. As I was mentioning you could instead try to use one class + one context per thread and see if that works better, that would also prevent any contention that might occur underneath when using multiple contexts per class.

mercury-hpc deleted a comment from jin-zhengnan Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the number of contexts correspond one-to-one with the number of threads? #759

Should the number of contexts correspond one-to-one with the number of threads? #759

jin-zhengnan commented Aug 27, 2024 •

edited

Loading

jin-zhengnan commented Aug 27, 2024

soumagne commented Aug 27, 2024

jin-zhengnan commented Aug 28, 2024 •

edited

Loading

soumagne commented Sep 3, 2024

Should the number of contexts correspond one-to-one with the number of threads? #759

Should the number of contexts correspond one-to-one with the number of threads? #759

Comments

jin-zhengnan commented Aug 27, 2024 • edited Loading

jin-zhengnan commented Aug 27, 2024

soumagne commented Aug 27, 2024

jin-zhengnan commented Aug 28, 2024 • edited Loading

soumagne commented Sep 3, 2024

jin-zhengnan commented Aug 27, 2024 •

edited

Loading

jin-zhengnan commented Aug 28, 2024 •

edited

Loading