Skip to content

OTA Public Release Process

jnorgan edited this page Jun 9, 2015 · 4 revisions

Overview

The OTA (Over The Air Update) process is controlled by group membership and upgrade path nodes. Every device belongs to one group. If this group is not explicitly defined in the Admin Teams Tool, the group defaults to a group named ‘release’. Each group may or may not have upgrade path nodes defined. These upgrade path nodes specify a path from one firmware version to another and a percentage value.

When a device uploads a batch, it provides information to Suripu. Contained in this information is uptime and current firmware version. Suripu then determines what group the device belongs to and uses this group name to see if an upgrade path node exists that matches the current firmware version and group. If one exists, the destination FW version is used to determine an S3 bucket that holds the new FW. Otherwise, Suripu defaults to the group name for the same purposes. In either case, Suripu then looks in this S3 bucket and tries to find a ‘build_info.txt' file and any other files that may need to be provided to the device (kitsune.bin, top.bin, etc.). The ‘build_info.txt’ file contains the FW version hash for the files in the bucket and Suripu compares this value to the fw version reported by the device to decide whether or not to reply to the device with a list of files.

Restrictions

There are some restrictions in place that will prevent an OTA from occurring for a particular device. First, the uptime check prevents any device from being updated within the first hour after is powered on. Second, the device’s timezone is used to determine if it exists in the update window from 11:00AM-8:00PM local time. These restrictions can be bypassed by adding a device ID/group to the ‘bypass_ota_checks' configuration.

Before Rollout

  • Verify the new FW version tag exists in Kitsune and the S3 bucket was properly created.
    • If neither exist, an upgrade node will not resolve to an S3 bucket and no OTA will be possible. Defaulting to whatever exists in the S3 bucket matching the group name. Note: Suripu only cares that the firmware version provided in an OTA response differs from the one currently reported by the device. There is no understanding of ‘newer’ or ‘better’ on the part of the device. Care should be taken to not create an infinite loop of OTA updates.
  • Run ‘PopulateVersionNumbers’ to create any missing firmware version mappings.
    • This depends on the existence of the build_info.txt file in the S3 bucket. Travis must complete the build and copy the appropriate files into the S3 bucket before this is run.

Rollout

  • Create an upgrade path node using the Firmware Path Tool
    • Group Name: release
    • From Firmware Version:
    • To Firmware Version:
    • Rollout Percent: 0% (best to always start from a safe state)
  • Prepare to Monitor Rollout
  • Perform Rollout in phases (1%, 3%, 25%, etc.) Always Start Slow!
  • Create an expected device list for this phase of release @ the specified version number from all devices associated with accounts
  • Set upgrade node rollout percent to desired percentage for this phase using the Firmware Path Tool
  • Monitor Rollout (as described below)
  • When updated devices rate for this phase reaches > 90%, move to next phase.
  • Maintain a list of problematic devices to troubleshoot

Monitor Rollout

  • If you’ve not already done so, Launch Papertrail and filter on the keyword ‘release'
  • Monitor on Papertrail for log errors or unexpected deployment rate
  • Monitor FW count using firmware/count endpoint
  • Run fw_list.py script to get the devices seen on desired firmware version.
  • Compare expected devices to devices actually seen.

WHAT TO DO WHEN THINGS GO WRONG

  • When in doubt, set the ‘release’ configuration percentage to 0%.
    • This is a safe operation. The worst you’ll do is not OTA some people for a little while.
  • Verify functionality using the upload.py script and the ‘fake-sense’ device.