Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the number of contexts correspond one-to-one with the number of threads? #759

Open
jin-zhengnan opened this issue Aug 27, 2024 · 4 comments

Comments

@jin-zhengnan
Copy link

jin-zhengnan commented Aug 27, 2024

Describe the bug
In version 2.3.1 of Mercury, applications using the TensorFlow framework will create many threads, approximately around 100 within a process . If each thread corresponds to a context created in Mercury, the application crashes when it runs, displaying the following output.

[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.838843 WARNING: PRINT_BACKTRACE: get a signal(11), pid:2983066, tid:298456in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839184 WARNING: PRINT_BACKTRACE: symbols=0x7fb640080410 pid:2983066, tid:2984568 in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839222 WARNING: Call stack: in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839261 WARNING: /xxx/libxxx-client.so(releasexxxClient+0x82) [0x7fb8bc159882] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839290 WARNING: /usr/lib64/libc.so.6(+0x37400) [0x7fb8ba8ef400] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839319 WARNING: /xxx/mercurylib/libna.so.4(+0x1b6cb) [0x7fb8ba25e6cb] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839346 WARNING: /xxx/mercurylib/libmercury.so.2(+0x1084d) [0x7fb8ba48184d] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839373 WARNING: /xxx/mercurylib/libmercury.so.2(+0x1254a) [0x7fb8ba48354a] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839399 WARNING: /xxx/mercurylib/libmercury.so.2(HG_Core_progress+0x70) [0x7fb8ba48a840] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839433 WARNING: /xxx/mercurylib/libmercury.so.2(HG_Progress+0xe) [0x7fb8ba47a17e] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839460 WARNING: /xxx/libxxx-client.so(+0x5b301) [0x7fb8bc0ae301] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839486 WARNING: /xxx/mercurylib/libmercury_util.so.4(hg_request_wait+0xe6) [0x7fb8ba03e996] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839512 WARNING: /xxx/libxxx-client.so(send_cumemcpyhtodasync_v2+0x119) [0x7fb8bc0b3fa9] in client.c(488)

However, if only one context is created for all these threads, the application runs normally but with poor performance.
So, how many contexts should be created to be reasonable? Is there a recommended value?

@jin-zhengnan
Copy link
Author

@soumagne
hi, please reply this question,thanks

@mercury-hpc mercury-hpc deleted a comment from jin-zhengnan Aug 27, 2024
@soumagne
Copy link
Member

I can't see much from that trace but in general you'd want to make sure that you set hg_init_info.na_init_info.max_contexts to the maximum number of contexts that you expect. This is currently a uint8 so depending on the underlying provider it should support 100 contexts. Alternatively, if you want to be able to address each of the contexts separately from the client, the other option is to create a separate class/context pair for each thread.

@jin-zhengnan
Copy link
Author

jin-zhengnan commented Aug 28, 2024

@soumagne
thanks for your reply.
The client is multithreaded. We created a class for the client, but created a context for each thread. There is one request_class and one request per thread, and each thread will send multiple messages. Each time a message is sent, we recreate the handle to send it, and destroy the handle after the message response.
the code like this:

createMsgHandle(&handle, &info, gettid(), "xxxfunction"))
unsigned int completed = 0; \
hg_request_t args = {.expected_count = 1, .complete_count = 0, .request = info->request}; \
hg_request_reset(info->request); \
ret = HG_Forward(handle, result_handle, &args, in); \
hg_request_wait(info->request, HG_MAX_TIMEOUT, &completed); \
ret = HG_Get_output(handle, &out); \
HG_Free_output(handle, &out); \
HG_Destroy(handle); \

In this situation, we found that when the number of threads reaches a certain amount, such as 100, crashes are likely to occur. By checking the call stack of the crash, we found it in the na_sm_progress function. We would like to ask, what is the reason for this crash? Attached is the call stack at the time of the crash.

mercury_crash_stack.txt
in line 1352b

Another question:
Is it appropriate for our current usage where one class corresponds to multiple contexts? What is the recommended approach? How can we improve our performance?

@soumagne
Copy link
Member

soumagne commented Sep 3, 2024

It seems fine for that particular usage though this combination of one class + multiple contexts is not extensively tested, this might be an issue specific to the shared-memory plugin, I will need to see if I can reproduce it. As I was mentioning you could instead try to use one class + one context per thread and see if that works better, that would also prevent any contention that might occur underneath when using multiple contexts per class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants
@soumagne @jin-zhengnan and others