Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SmartSwitch] Extend reboot script for rebooting SmartSwitch #3566

Open
wants to merge 50 commits into
base: master
Choose a base branch
from

Conversation

vvolam
Copy link

@vvolam vvolam commented Oct 3, 2024

What I did

Extended reboot script for SmartSwitch cases to reboot entire SmartSwitch or a specific DPU

How I did it

Implemented changes according to https://github.com/sonic-net/SONiC/blob/605c3a56ac2717dbbb638433e7bb13054fc05a31/doc/smart-switch/reboot/reboot-hld.md

How to verify it

  • Verified the script on non-smart switch and didn't find any regressions. Also, script throws errors if any new smart switch related parameters are given by user.
  • Verified on NVIDIA smart switch.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@vvolam vvolam force-pushed the ss-reboot branch 2 times, most recently from 460146c to c72fbc0 Compare October 3, 2024 19:36
@vvolam vvolam force-pushed the ss-reboot branch 3 times, most recently from 8746356 to d6fc624 Compare October 23, 2024 20:33
@vvolam vvolam changed the title Extend reboot script for rebooting SmartSwitch [SmartSwitch] Extend reboot script for rebooting SmartSwitch Nov 4, 2024
@vvolam vvolam marked this pull request as ready for review November 4, 2024 19:39
@vvolam vvolam requested review from qiluo-msft and gpunathilell and removed request for gpunathilell February 5, 2025 04:52
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from ganglyu February 5, 2025 17:27
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Additionally remove informational logs which are flooding with so much info.
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw hdwhdw requested review from hdwhdw and gpunathilell and removed request for gpunathilell February 13, 2025 19:40
REBOOT_SCRIPT_NAME=$(basename $0)
REBOOT_TYPE="${REBOOT_SCRIPT_NAME}"
TAG_LATEST=no
REBOOT_FLAGS=""
FORCE_REBOOT="no"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable

@@ -64,6 +73,9 @@ function stop_pmon_service()
{
CONTAINER_STOP_RC=0
debug "Stopping pmon docker"
if [[ "${PRE_SHUTDOWN}" == "yes" ]]; then
systemctl disable pmon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this done so that we do not automatically restart PMON again? What about all the other non-critical dockers (other than gnmi)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this makes sure that pmon will not be restarted again. Other than pmon docker, there is no other service that is getting killed. Only syncd process is getting killed inside syncd container.

return ${EXIT_ERROR}
fi

if [[ "$REBOOT_TYPE" != $MODULE_REBOOT_SMARTSWITCH ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a check to see if reboot is being done on a DPU which is currently powered off and skipping reboot for that specific DPU?(Admin State=Down Operational State=Down)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the case of MODULE_REBOOT_SMARTSWITCH we still need to suppresss the pcied warnings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a check to see if reboot is being done on a DPU which is currently powered off and skipping reboot for that specific DPU?(Admin State=Down Operational State=Down)

We rely on platform API return code to determine this outcome.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the case of MODULE_REBOOT_SMARTSWITCH we still need to suppresss the pcied warnings

With the discussions that are in process, redis-cli requests will be done inside PCI attach/detach. Once that is finalized, I will modify it here.

Comment on lines +256 to +259
is_smartswitch=$(python3 -c 'from utilities_common.chassis import is_smartswitch; print(is_smartswitch())')
if [ "$is_smartswitch" == "True" ] && [ "$REBOOT_DPU" == "yes" ]; then
exit $smart_switch_result
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is_smartswitch() is already a helper function in reboot_smartswitch_helper

{
local dpu_ip=$1
local port=$2
$(docker exec gnmi gnoi_client -target "${dpu_ip}:${port}" -logtostderr -notls -module System -rpc RebootStatus &>/dev/null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some integration test here to prevent CLI for gnoi_client accidentally changes. I recommend writing our own python grpc client, example: sonic-net/sonic-mgmt#17068, but that can be a long shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants