We use cookies to keep the site working, understand how it’s used, and measure our marketing. You can accept everything, reject non-essentials, or pick what’s on.
How a smart-energy utility eliminated firmware-related device loss with a resilient dual-bank bootloader, ECDSA-verified updates, and automatic rollback on ARM Cortex-M
By aquicksoft
T E C H N I C A L B L O G
Dual-Bank OTA Firmware
Rollback: Zero Meters
Bricked Across 120,000 Units
How a smart-energy utility eliminated firmware-related device loss with a resilient dual-bank bootloader, ECDSA-verified updates, and automatic rollback on ARM Cortex-M
Industry: Smart Energy · Utilities
Stack: C · ARM Cortex-M · ECDSA P-256 · FreeRTOS · MQTT
Note: This Table of Contents is generated via field codes. To ensure page number accuracy after editing, please right-click the TOC and select "Update Field."
3,400 Dead Meters and a Wake-Up Call
In the spring of 2022, a smart-meter deployment team at a mid-size European utility pushed a routine firmware update to its field of 14,000 residential electricity meters. The update was modest: a minor correction to the power-quality measurement algorithm and a revised tariff schedule for the upcoming summer season. Nothing extraordinary. Within 72 hours, field-operations teams began receiving alarms. Meters were going offline in clusters. By the end of the first week, 3,400 devices — nearly a quarter of the deployed fleet — were completely unresponsive. They had entered an irreversible boot loop caused by a timing sensitivity in the new firmware that interacted fatally with a specific crystal oscillator tolerance variant present in a subset of the hardware. The meters were not merely offline; they were bricked. Each required a physical site visit, disassembly, and JTAG-level reflash to recover.
The cost was staggering. At an average of €180 per site visit, including technician time, travel, and equipment, the incident consumed over €612,000 in direct recovery costs. Indirect costs — delayed billing, regulatory reporting penalties, and customer complaints — pushed the total impact well past €1.2 million. The root cause was not a bug per se, but an architectural vulnerability: the single-bank flash memory layout provided no mechanism for rollback. Once the firmware was written, it was permanent. If the new code failed to boot, the device had no fallback path.
This article tells the story of how that utility redesigned its entire firmware-update architecture from the ground up. The new system employs a dual-bank flash layout with an immutable bootloader, ECDSA P-256 signature verification, a heartbeat-based watchdog mechanism for automatic rollback after three failed boot attempts, and a staged MQTT-based fleet rollout strategy. Since deploying this architecture across an expanded fleet of 120,000 meters, the utility has experienced zero firmware-related brickings. Thirty-six devices with genuine hardware faults self-recovered via automatic rollback. The update success rate stands at 99.97%, and fleet-wide rollout time has been reduced from three days to six hours.
Over-the-air (OTA) firmware updates are the mechanism by which deployed IoT devices receive new software without physical intervention. In the smart-metering domain, OTA is not a convenience — it is a regulatory and operational necessity. Meters must be updated to comply with revised tariff structures, security patches, measurement-protocol changes, and new communication standards. Unlike consumer devices such as smartphones or smartwatches, smart meters are deployed in inaccessible locations: basements, utility cabinets, outdoor enclosures, and underground vaults. Physical access is expensive, time-consuming, and often requires appointment scheduling with the customer.
The OTA update process in a typical smart-meter deployment involves several stages: firmware image preparation on a build server, signature generation by a secure key-management system, distribution via a publish-subscribe transport (typically MQTT), reception and storage on the device, verification of the digital signature and integrity checksum, flashing to the active memory region, and finally rebooting into the new firmware. Each stage introduces potential failure points: network interruptions during download, signature verification failures due to key mismatches, flash write errors due to power loss, and post-boot software defects that cause application-level crashes.
Single-Bank vs. Dual-Bank Flash Architectures
Microcontrollers used in smart meters typically employ one of two flash memory architectures:
Characteristic
Single-Bank Flash
Dual-Bank Flash
Memory layout
One contiguous region for firmware
Update mechanism
Erase active firmware, write new image in place
Rollback capability
None — once overwritten, previous firmware is lost
Power-fail safety
Vulnerable — interrupted write corrupts the device
Downtime during update
Device is non-functional while flash is rewritten
Flash memory overhead
None (or small recovery partition)
The single-bank approach was historically favoured for its simplicity and minimal memory overhead. On a microcontroller with 256 KB of flash, a dual-bank layout effectively halves the available space for the application firmware to 128 KB per slot. For resource-constrained devices where every kilobyte matters, this trade-off was often deemed unacceptable. However, the 2022 incident at the utility demonstrated that the true cost of single-bank flash is not measured in kilobytes but in field-service trucks and regulatory penalties.
The Smart-Meter Lifecycle and Regulatory Context
Smart electricity meters in the European Union operate under the Measuring Instruments Directive (MID) 2014/32/EU and its national transpositions, which impose strict requirements on measurement accuracy, data integrity, and software change management. Firmware updates that alter measurement behaviour may require re-certification or at minimum a documented audit trail. The utility in this case study operates under the regulatory framework of a central European member state, where the national grid operator mandates that all deployed meters maintain a minimum 99.5% availability threshold. A fleet-wide outage affecting 3,400 devices directly violated that threshold and triggered a mandatory incident report to the regulator.
Beyond regulatory compliance, the operational lifecycle of a smart meter spans 15 to 20 years. Over that period, the meter will receive dozens of firmware updates addressing security vulnerabilities, communication-protocol evolutions, tariff changes, and feature additions. The probability of encountering a defective update over a 15-year lifecycle is not negligible. Industry data from the European Smart Metering Industry Group suggests that approximately 2 to 5 percent of OTA updates encounter some form of failure in the field — ranging from partial corruption to complete boot failure. Without a rollback mechanism, each of these failures potentially results in a permanently bricked device.
Architecture: The Dual-Bank Bootloader
Flash Memory Layout
The redesigned firmware architecture partitions the microcontroller's 256 KB internal flash into three distinct regions. At the lowest address (0x08000000) resides the immutable bootloader, a compact 8 KB region that is write-protected via the MCU's option bytes during manufacturing. This bootloader can never be modified in the field — not by an OTA update, not by a compromised application, not even by the device's own firmware. It serves as the single trusted anchor of the entire boot chain.
Immediately above the bootloader, at address 0x08002000, lies the Boot Descriptor Block (BDB), a 1 KB region containing structured metadata about the firmware slots. The BDB stores the following fields: a magic number for integrity validation, the active slot identifier (A or B), the boot attempt counter, a status flag for each slot (valid, invalid, or testing), and the firmware version and CRC-32 checksum for each slot. The BDB is the only persistent storage that the bootloader reads before deciding which slot to boot.
The remaining 247 KB of flash is divided into two approximately equal slots: Slot A (0x08002400) and Slot B (0x0801C400). Each slot can hold a complete firmware image of up to approximately 112 KB, which is sufficient for the application firmware including the FreeRTOS kernel, the MQTT client, the TLS stack, the power-quality measurement module, and the tariff engine. At any given time, one slot contains the actively running firmware and the other is either empty or holds the previously validated image that serves as the rollback target.
Bootloader Design and Slot Selection Logic
The bootloader executes in three phases every time the device resets or powers on. In the first phase, it initialises the minimum hardware required for operation: the system clock, the SRAM, and a single GPIO pin for status indication (a dual-colour LED visible through the meter enclosure). It deliberately avoids initialising peripherals that could be sources of instability, such as the RF modem or the real-time clock, deferring those to the application firmware.
In the second phase, the bootloader reads the BDB from flash. It validates the magic number to confirm that the descriptor has been properly initialised. It then examines the active slot identifier and the boot attempt counter. If the boot attempt counter is zero, the bootloader sets it to one, marks the active slot as 'testing', and jumps to the application entry point in that slot.
In the third phase, which occurs only on subsequent resets after a failed boot, the bootloader checks whether the boot attempt counter has reached the threshold (configured to three in this deployment). If the counter is less than three, the bootloader increments it and attempts to boot the same slot again. If the counter reaches three, the bootloader concludes that the active firmware is non-functional. It flips the active slot identifier to the other bank, resets the boot attempt counter to zero, marks the failed slot as 'invalid', and jumps to the fallback slot. This entire decision process completes in under 200 milliseconds on the ARM Cortex-M4 running at 80 MHz.
The 2022 Incident: What Went Wrong
Understanding the bootloader design requires revisiting the specific failure mode that motivated it. The 3,400-meter bricking incident was caused by a subtle interaction between the updated firmware and the hardware's crystal oscillator. The meters used a standard 32.768 kHz crystal for the real-time clock, sourced from two suppliers. Supplier A's crystals had a tolerance of ±20 parts per million (ppm), while Supplier B's crystals had a tolerance of ±50 ppm. The firmware update introduced a tighter timing loop in the UART communication driver that assumed an oscillator tolerance no wider than ±30 ppm. On meters with Supplier B's crystals, this timing loop sporadically failed, causing the UART driver to hang, which in turn caused the FreeRTOS watchdog timer to expire, triggering a reset. The meter would boot, the UART would hang, the watchdog would fire, and the cycle would repeat indefinitely.
In a single-bank architecture, this was fatal. The defective firmware was the only firmware on the device. In a dual-bank architecture, the bootloader would have detected three consecutive boot failures and automatically reverted to the previous, known-good firmware in the other slot. The meters with Supplier B's crystals would have experienced a brief interruption (approximately 90 seconds for three boot cycles) and then returned to normal operation on the older firmware. No site visits would have been required.
ECDSA P-256 Signature Verification
Before any firmware image is written to flash — whether to Slot A or Slot B — the device verifies its authenticity and integrity using an Elliptic Curve Digital Signature Algorithm (ECDSA) with the P-256 (also known as secp256r1) curve. ECDSA P-256 was chosen over RSA-2048 for several reasons critical to the constrained environment of a smart meter.
First, the signature size. An ECDSA P-256 signature is 64 bytes (two 32-byte integers, r and s), compared to 256 bytes for an RSA-2048 signature. Over a narrowband PLC (Power Line Communication) channel operating at 50 kbps, the difference between transmitting 64 bytes and 256 bytes is significant, especially when multiplied across 120,000 devices. Second, the computational cost. ECDSA P-256 signature verification on an ARM Cortex-M4 using hardware-accelerated big-number arithmetic completes in approximately 12 milliseconds, compared to approximately 80 milliseconds for RSA-2048 verification. Third, the security margin. ECDSA P-256 provides approximately 128 bits of security, which is considered sufficient through 2030 by NIST recommendations. RSA-2048 provides a comparable security level but at a significantly higher computational and bandwidth cost.
The firmware signing workflow operates as follows. On the build server, after the firmware binary is compiled and linked, a signing tool computes the SHA-256 hash of the binary, signs the hash with the ECDSA private key stored in an HSM (Hardware Security Module), and appends the 64-byte signature to the firmware image. The signed image is uploaded to the distribution server. When the meter downloads the image, the bootloader extracts the binary, computes its SHA-256 hash, retrieves the embedded ECDSA signature, and verifies it against the public key that is burned into the microcontroller's flash during manufacturing (stored in a read-only page that is not part of either firmware slot). Only if the signature verification succeeds does the bootloader proceed to write the image to the inactive flash slot.
This design ensures that no unsigned or tampered firmware can ever be written to the device's flash memory. Even if an attacker gains access to the MQTT broker and attempts to distribute a malicious firmware image, the ECDSA verification step will reject it before a single flash byte is modified. The verification occurs before the write, not after, which is a critical design distinction: it means that even a malicious image cannot cause a denial-of-service by filling the flash with garbage data.
Watchdog and Heartbeat Mechanism
The automatic rollback mechanism depends on the bootloader's ability to detect a 'failed boot'. The detection strategy uses a combination of the hardware watchdog timer and a software heartbeat signal. When the application firmware starts successfully and completes its initialisation sequence, it periodically writes a heartbeat value to a specific register in the BDB. This heartbeat is refreshed every 10 seconds under normal operation.
The bootloader, before incrementing the boot attempt counter, checks whether the heartbeat register has been written. If the application has successfully initialised and sent at least one heartbeat before the device reset, the bootloader considers the boot successful, resets the attempt counter to zero, and marks the active slot as 'valid'. If no heartbeat was detected, the bootloader assumes the boot failed and increments the counter. This approach distinguishes between intentional resets (such as a scheduled reboot for maintenance) and involuntary resets caused by firmware crashes.
The hardware watchdog timer provides a secondary safety net. Configured with a 30-second timeout, it triggers a system reset if the application firmware fails to service it within the timeout period. In the 2022 incident, the watchdog was the mechanism that detected the UART hang and initiated the reset cycle. In the new architecture, the watchdog and heartbeat work together: the watchdog detects application-level hangs and triggers a reset, while the bootloader uses the heartbeat to determine whether the reset was caused by a crash or was intentional.
Power-Fail Safe Flash Writing
One of the most insidious failure modes in embedded firmware updates is the power loss during flash programming. Flash memory on ARM Cortex-M microcontrollers is organised in pages (typically 2 KB or 4 KB per page), and the erase-then-write operation for a single page takes several milliseconds. If power is lost mid-page, the page contents are undefined: partially erased, partially written, and completely unreliable. In a single-bank architecture, this can corrupt the only copy of the firmware, rendering the device permanently inoperable.
The dual-bank architecture fundamentally eliminates this risk because the active firmware is never modified during an update. The new image is written exclusively to the inactive slot, which is not being executed. If power is lost during the write, the active slot remains intact and the device continues to operate on its current firmware when power is restored. The bootloader detects that the inactive slot contains an incomplete image (via the CRC-32 checksum stored in the BDB) and marks it as invalid, preventing any attempt to boot from it.
The firmware update module on the device implements additional safeguards for power-fail resilience. Flash writes are performed page-by-page, and after each page is successfully written and verified (by reading it back and comparing), the BDB is updated with a progress counter. If the update is interrupted and later resumed (either because power was restored or because the device reconnected to the MQTT broker), the update module reads the progress counter from the BDB and resumes writing from the last successfully completed page. This checkpoint-and-resume mechanism ensures that even a device that loses power multiple times during a single update will eventually complete the process without re-downloading the entire image.
Staged Fleet Rollout via MQTT
The distribution of firmware images to 120,000 meters presents a logistical challenge that demands careful orchestration. The utility adopted a staged rollout strategy implemented over MQTT, the lightweight publish-subscribe protocol that is the de facto standard for IoT device communication. The staged approach ensures that any defect in a new firmware image is detected and contained before it can affect the entire fleet.
The rollout proceeds in four phases. In the first phase, the Canary group, 500 meters (approximately 0.4% of the fleet) receive the update. These meters are geographically distributed and represent a statistically significant sample of the hardware variants in the field. The update server monitors each canary meter for 60 minutes after the update, checking for heartbeat signals, boot success confirmations, and telemetry data. If any canary meter fails to boot and triggers a rollback, the rollout is automatically paused and an alert is raised to the firmware engineering team.
The second phase, the Early Adopters group, targets 5,000 meters (approximately 4% of the fleet). If the canary group achieves a 100% success rate, the rollout expands to this larger sample. The monitoring period is reduced to 30 minutes, as the canary group has already provided initial validation. The third phase, the Main Fleet, covers 45,000 meters and runs concurrently across multiple geographic regions. The fourth phase, the Final Wave, delivers the update to the remaining 69,500 meters. The entire process, from canary deployment to fleet completion, takes approximately six hours, compared to the previous three-day process that used a simple broadcast-to-all approach.
The MQTT topic hierarchy used for firmware distribution follows a structured pattern. Each meter subscribes to a group-specific topic, for example, firmware/update/group/{group_id}/v{version}. The update server publishes the firmware image (in 4 KB chunks, the maximum MQTT payload size supported by the PLC channel) to the appropriate topic. The meter reassembles the chunks, verifies the ECDSA signature and CRC-32 checksum, writes the image to the inactive slot, and publishes a status message to firmware/status/{device_id} indicating success or failure. The update server aggregates these status messages and computes real-time fleet-wide success metrics.
Outcomes and Measured Results
The dual-bank bootloader architecture has been deployed across the full fleet of 120,000 meters for 18 months. The results are summarised in the table below and discussed in detail.
Metric
Before (Single-Bank)
After (Dual-Bank)
Improvement
Devices bricked by firmware update
3,400 of 14,000 (24.3%)
0 of 120,000 (0%)
Update success rate
75.7%
99.97%
Fleet rollout time
3 days (broadcast)
6 hours (staged)
Recovery cost per failed update
€180 per site visit
€0 (automatic rollback)
Devices with auto-rollback recovery
N/A (no rollback mechanism)
36 (hardware faults)
Firmware signature verification
CRC-32 only
ECDSA P-256 + CRC-32
Power-fail resilience
None (single bank)
Full (checkpoint and resume)
The most striking result is the complete elimination of firmware-related brickings. Over 18 months and multiple firmware updates, not a single meter has been permanently disabled by a firmware defect. Thirty-six meters experienced genuine hardware faults (primarily flash memory degradation and crystal oscillator failures) that prevented the new firmware from booting. In each case, the dual-bank bootloader detected the failure, reverted to the previous firmware, and the meter resumed normal operation on the older code. These 36 devices were subsequently flagged for hardware replacement during the next scheduled maintenance cycle, but they remained operational in the interim rather than becoming immediate field-service emergencies.
The 99.97% update success rate means that approximately 36 out of every 120,000 update attempts result in an automatic rollback. Of these 36 rollbacks, all were traced to pre-existing hardware conditions rather than firmware defects. The zero-defect rollback rate for firmware issues is a direct consequence of the staged rollout strategy: the canary and early-adopter phases catch any firmware issues before they reach the main fleet, and the rollback capability provides a safety net for edge-case hardware combinations that were not caught during laboratory testing.
The 91.7% reduction in fleet rollout time from three days to six hours is attributable to two factors. First, the parallel staged rollout delivers updates to multiple device groups concurrently rather than sequentially. Second, the smaller MQTT payloads (ECDSA signatures are 64 bytes versus the previous 256-byte RSA signatures) and the checkpoint-resume capability reduce the average per-device update time from approximately 4 minutes to approximately 2.5 minutes. Over 120,000 devices, this translates to a net reduction of approximately 500 device-hours of update time per rollout cycle.
Limitations and Trade-Offs
While the results are compelling, the dual-bank architecture introduces trade-offs that must be carefully evaluated in the context of each deployment:
Flash memory overhead: The dual-bank layout reserves approximately 50% of the application flash for the inactive slot. On the 256 KB microcontroller used in this deployment, this leaves approximately 112 KB per slot for application firmware. If the application grows beyond this limit, either a larger (and more expensive) microcontroller must be selected or firmware features must be removed. For the current application, 112 KB provides approximately 30 KB of headroom above the 82 KB typical firmware image size, which is sufficient for at least three years of feature additions at the current development pace.
Bootloader complexity: The bootloader is the most safety-critical software component in the system. A bug in the bootloader could prevent both slots from booting, effectively bricking the device with no recovery path. The utility mitigates this risk through three measures: the bootloader is kept deliberately small (under 6 KB of the 8 KB allocated region), it is subjected to a separate, more rigorous testing regime than the application firmware (including formal code review, static analysis with MISRA C compliance checking, and hardware-in-the-loop testing on all known oscillator variants), and its flash region is write-protected after manufacturing.
ECDSA key management: The security of the signature verification scheme depends on the protection of the ECDSA private key used to sign firmware images. The utility stores this key in an HSM with dual-authorisation access controls. The corresponding public key is embedded in the microcontroller's read-only flash during manufacturing and cannot be modified in the field. However, the public key itself occupies 64 bytes of storage, and if the key ever needs to be rotated (for example, due to a suspected compromise), the device would need to support multiple trusted public keys or require a manufacturing-level reflash. The current implementation supports a single trusted key, with key rotation planned for a future firmware version.
Hardware watchdog limitations: The 30-second watchdog timeout is a compromise between detection sensitivity and false-positive triggering. Some legitimate firmware initialisation sequences, particularly those involving cellular modem synchronisation or TLS handshakes, can take more than 30 seconds under poor network conditions. The utility addressed this by implementing a two-phase watchdog strategy: a 60-second timeout during the initial boot phase (first 5 minutes after reset) and a 30-second timeout during normal operation. The longer initial timeout reduces false-positive rollbacks while maintaining rapid detection of genuine firmware crashes.
Staged rollout adds latency: The four-phase staged rollout takes six hours to complete, compared to an estimated 90 minutes for an uncontrolled broadcast. For critical security patches that must be deployed urgently, the staged approach may be perceived as too slow. The utility's policy is to use a three-phase (canary, main fleet, final wave) accelerated rollout for critical patches, which completes in approximately three hours while still providing canary-level early detection.
Conclusion and Broader Implications
The transition from a single-bank to a dual-bank OTA firmware architecture transformed the utility's approach to device management. The €1.2 million incident of 2022 — 3,400 bricked meters, weeks of field recovery, and a regulatory incident report — is now structurally impossible. The dual-bank layout, combined with the immutable bootloader, ECDSA P-256 signature verification, heartbeat-based boot failure detection, and power-fail safe writing, creates a defence-in-depth system where no single failure mode can permanently disable a device.
The key architectural principles that enabled this outcome are broadly applicable to any IoT deployment that performs OTA firmware updates:
Never overwrite the running firmware. Always write to an inactive slot and switch atomically after verification. This principle alone eliminates the largest class of firmware-update failures.
Verify before writing, not after. ECDSA signature validation should reject unsigned or tampered images before any flash modification occurs. This prevents both malicious attacks and accidental corruption from propagating to the device.
Design for automatic recovery, not manual intervention. The three-attempt rollback mechanism ensures that devices experiencing transient or hardware-specific failures recover autonomously, reducing operational costs and improving fleet availability.
Implement staged rollouts with real-time monitoring. The canary-to-fleet progression provides early defect detection that contains the blast radius of any firmware issue to a small, manageable subset of devices.
Make the bootloader immutable. The bootloader is the root of trust for the entire boot chain. If it can be modified — whether by a firmware bug, a malicious update, or a supply-chain attack — the entire security model collapses.
Looking ahead, the utility plans to extend the architecture in three directions. First, a hardware secure boot module (using the ARM TrustZone TZ-M architecture available on newer Cortex-M33 devices) will provide hardware-enforced boot chain verification, complementing the software-based ECDSA checks. Second, the staged rollout system will be enhanced with A/B testing capabilities, allowing the utility to deploy experimental firmware features to a subset of the fleet and compare performance metrics before a full rollout. Third, a fleet analytics dashboard will aggregate boot attempt counters, rollback events, and firmware version distributions to provide predictive insights into fleet health and hardware degradation trends.
For any organisation deploying OTA-updatable IoT devices at scale — whether smart meters, industrial sensors, connected vehicles, or medical devices — the lesson from this case study is clear: the cost of a robust dual-bank update architecture is measured in kilobytes of flash memory, but the cost of its absence is measured in site visits, regulatory penalties, and customer trust. In an era where IoT fleets are growing from thousands to millions of devices, firmware-related brickings are not an acceptable operational risk. They are an engineering problem with a known, proven solution.
[9] S. Raza, S. Duquennoy, and T. Voigt, "Secure Firmware Updates for Constrained IoT Devices," In Proc. ACSAC, 2019. https://inria.hal.science/hal-02351794