SPX - Group multicore SDR jobs - Testing & Call Outs #8598
Replies: 12 comments 13 replies
-
tag updated to https://github.com/filecoin-project/lotus/releases/tag/group-sdr-rc3 |
Beta Was this translation helpful? Give feedback.
-
First confirmation on ZEN3 platform:
Perfect allocation of 16 PC1 jobs in 8 CCXs cross two CPUs : |
Beta Was this translation helpful? Give feedback.
-
Same result here. Works like a charm! 2 x AMD EPYC 7542 32-Core Processor numactl -H |
Beta Was this translation helpful? Give feedback.
-
Hi guys, I would like to test it. Could you help me how to merge and compile lotus with this tag? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Depending on your available hardware, please also test if you can fit three jobs into a CCX group:
Thats 9 jobs in one CCX on each CPU I did try running with only 1 producer thread and make it produce 28 sectors in parallel. It seemed to get a bit off and haft of the sectors ended in 7h, where the above and all else I have seen has been just around 4h. |
Beta Was this translation helpful? Give feedback.
-
Is it possible to add a parameter how the cores are to be used in a dual-processor configuration? |
Beta Was this translation helpful? Give feedback.
-
There is problem on dual CPU + SMT=ON. I checked that on single CPU + SMT=ON problem doeasnt appear. Logs:
|
Beta Was this translation helpful? Give feedback.
-
Any chance to use it after nv16 upgrade? |
Beta Was this translation helpful? Give feedback.
-
lotus Without any problems for last 24h and 3 workers:
|
Beta Was this translation helpful? Give feedback.
-
@jennijuju this is 1.17.1 material? |
Beta Was this translation helpful? Give feedback.
-
Lotus is not able to permanently assign task to thread on ZEN2 (Threadripper 3960x). |
Beta Was this translation helpful? Give feedback.
-
Do you want to unleash the full potential of your Zen3 and Xeon 3rd-gen CPUs? Now is the time to try to push them to their limits!
Context:
Currently, mulitcore SDR assigns at most one multicore SDR task per CCX group until every core complex is full. If there are more jobs scheduled than core complex groups available, those additional jobs wouldn't use specific cores, but use whatever the operating system decided to give them. That is causing sealing times to be extra long for the jobs that don´t get assigned a specific core. This has been mostly fine on AMD Zen2-based CPUs which has a lower core count per CCX group, but Zen3-based (and Xeon 3rd gen) CPUs introduced many more cores per CCX group (i.e fewer CCX-groups).
The
group-sdr-rc3
tag should enable one to put multiple multicore SDR jobs into a CCX group in case there is not enough available CCX groups.1. Upgrade to the
group-sdr-rc3
tag.Upgrade your testing machines to group-sdr-rc3.
2. Adjust your producer-variable
Activate multicore SDR by setting
FIL_PROOFS_USE_MULTICORE_SDR=1
and adjust the amount of SDR-producersFIL_PROOFS_MULTICORE_SDR_PRODUCERS
according to what suits your CPU. We want to test that multiple tasks can be assigned inside one CCX, so if your CPU has 8 cores in a CCX, the default setting ofFIL_PROOFS_MULTICORE_SDR_PRODUCERS=3
should be able to fit two multicore SDR jobs into one CCX.3. RUST-logs
Set the rust-logs to
RUST_LOG=debug
.4. Seal multiple sectors
Since multicore SDR will first assign jobs to available CCX´s you will need to seal at least one more sector then available CCX´s to test that the +1 job gets assigned properly to it´s own CCX.
Reply card:
Since this testing is more open ended and hardware depending please add what type of CPU you are testing:
Depending on your available hardware, please also test if you can fit three jobs into a CCX group:
Add some observational notes here
Questions are welcomed in this discussion, the team will be monitoring this discussion and check-in regularly.
Issues can be reported in the https://github.com/filecoin-project/rust-fil-proofs. Issues should be submitted with:
RUST_LOG=debug
outputAs always, for large log files, you may upload it in this [gdrive folder (https://drive.google.com/drive/folders/1YY_HxXLn3MLMboFBNgALx9M-fxOVn8NE) and link it in the issue!
If you think the group-sdr-rc3 tag is not working well, keep all the logs for describing the issue, and revert back to a stable tag.
Beta Was this translation helpful? Give feedback.
All reactions