We use cookies to keep the site working, understand how it’s used, and measure our marketing. You can accept everything, reject non-essentials, or pick what’s on.
How a priority-based RTOS redesign eliminated cascading failures in an industrial IoT gateway and retained IEC 61508 SIL 1 compliance
By aquicksoft
T E C H N I C A L B L O G
FreeRTOS Gateway: Zero CAN Bus
Interruptions from LTE Modem Faults
via Task-Level Fault Isolation
How a priority-based RTOS redesign eliminated cascading failures in an industrial IoT gateway and retained IEC 61508 SIL 1 compliance
Industry: Industrial Automation · IIoT
Stack: FreeRTOS · STM32 · CAN Bus · LTE Modem · JTAG
Field Trial: 4 months · Zero CAN interruptions
Embedded Systems Case Study June 2025
Table of Contents
Note: This Table of Contents is generated via field codes. To ensure page number accuracy after editing, please right-click the TOC and select "Update Field."
The 47-Second Silence That Cost a Plant
At 03:42 on a Tuesday morning in February, the LTE modem embedded in a chemical processing plant's CAN bus gateway stopped responding to AT commands. In the original super-loop firmware, this was not merely a connectivity inconvenience — it was a systemic catastrophe. The main loop, blocked on a socket read waiting for the modem's response, could not service the CAN bus receive interrupt handler's message queue. For 47 seconds, every CAN frame from 23 field devices — flow controllers, pressure transmitters, valve actuators, and temperature sensors — went unacknowledged. The field devices, designed with their own internal watchdog timers, interpreted the silence as a bus-off condition and began triggering local fault states. Three valve actuators moved to their safe positions, shutting down a reactor feed line. The plant's safety system logged an unplanned shutdown event, triggering a mandatory incident review and a production loss estimated at €34,000.
This was not an isolated occurrence. Field logs revealed that the modem hung roughly twice per week, causing CAN bus interruptions lasting between 15 and 60 seconds each time. The super-loop architecture — a single infinite loop sequentially servicing all peripheral tasks — meant that any blocking operation in one subsystem (the modem) would starve all other subsystems (the CAN bus, the Ethernet interface, the local HMI polling). The firmware team had implemented timeouts on modem AT commands, but the timeouts were set conservatively at 30 seconds to avoid false-triggering modem resets on slow cellular networks. By the time the timeout expired, the damage to CAN bus timing was already done.
This article details how the engineering team eliminated every CAN bus interruption caused by modem faults by migrating from a bare-metal super-loop to a FreeRTOS-based multitasking architecture on an STM32 microcontroller. The redesign introduced priority-based preemptive scheduling, message-queue-based inter-task communication, and a supervised watchdog architecture that enabled per-subsystem fault recovery. Over a four-month field trial, the gateway achieved zero CAN bus interruptions from modem faults, a 94% modem self-recovery rate, and retained full IEC 61508 SIL 1 compliance.
The Controller Area Network (CAN) bus, originally developed by Bosch in 1986 for automotive applications, has become one of the most widely deployed fieldbus protocols in industrial automation. Its appeal lies in several characteristics that align precisely with the requirements of real-time industrial control: differential signalling that provides excellent electromagnetic immunity in electrically noisy plant environments, a non-destructive bitwise arbitration mechanism that guarantees deterministic message priority resolution, and a robust error detection and confinement architecture that prevents a single faulty node from corrupting the bus.
In the Industrial Internet of Things (IIoT) context, CAN bus gateways serve a critical bridging function. Field devices on the shop floor — sensors, actuators, motor drives, and programmable logic controllers — communicate over CAN bus at data rates up to 1 Mbit/s. The gateway translates these CAN frames into IP-based protocols (MQTT, OPC UA, REST APIs) that can traverse LTE, Wi-Fi, or Ethernet links to cloud platforms and enterprise SCADA systems. This translation is not merely a protocol conversion; it involves timestamping, message filtering, data aggregation, and local buffering against network outages.
The real-time requirements of this gateway role are stringent. CAN bus messages from safety-critical field devices typically require acknowledgement or forwarding within 5 to 50 milliseconds, depending on the device's own watchdog timeout settings. Missing these windows — as the original firmware demonstrated — can trigger cascading fault conditions across an entire plant floor, as devices independently enter their safe states when they lose communication with the gateway.
The Super-Loop Paradigm and Its Limitations
The original gateway firmware was built on the super-loop (also known as bare-metal or foreground-background) architecture, one of the most common patterns in resource-constrained embedded systems. In this model, the microcontroller's main function contains an infinite loop that sequentially calls handler functions for each subsystem: read CAN messages, process modem AT responses, update the Ethernet interface, service the HMI, and feed the watchdog timer.
This architecture offers genuine advantages for simple systems: it is easy to understand, has minimal memory overhead (no kernel stack or task control blocks), and provides deterministic execution order because the developer explicitly controls the call sequence. Research from Libre Solar and analyses on EmbeddedRelated.com confirm that for systems with a small number of well-understood, non-blocking tasks, the super-loop is often the optimal choice.
However, the super-loop's fundamental weakness becomes apparent when any single handler function can block. Because all handlers share a single execution thread, a blocking call in one handler — such as waiting for a modem AT command response — prevents all other handlers from executing. There is no preemption mechanism, no priority inversion handling, and no timeout-based context switching. The only escape is the hardware watchdog, which triggers a full system reset after a global timeout. This reset is a blunt instrument: it recovers the modem but also disrupts the CAN bus, the Ethernet link, and every other subsystem simultaneously.
Comparative analyses between real-time operating systems (RTOS) and super-loops consistently identify blocking I/O as the primary failure mode that motivates RTOS adoption. When a system must handle multiple I/O peripherals with unpredictable response times — exactly the case for an IIoT gateway with a cellular modem — the super-loop's sequential execution model becomes a liability.
FreeRTOS Redesign: Architecture and Implementation
Task Decomposition and Priority Assignment
The migration to FreeRTOS began with a systematic decomposition of the monolithic super-loop into independent, concurrent tasks. The design followed FreeRTOS's priority-based preemptive scheduling model, where each task is assigned a numeric priority and the kernel always runs the highest-priority task that is in the Ready state. When a higher-priority task becomes ready (for example, when a CAN bus interrupt posts a semaphore), it immediately preempts the currently running lower-priority task.
The team identified five primary tasks, each mapped to a distinct hardware or software subsystem:
Task
Priority
Stack (words)
Responsibility
Watchdog Check-in Period
vCANBusTask
5 (Highest)
512
CAN frame rx/tx, message filtering, timestamping
vEthernetTask
4
768
TCP/IP stack polling, MQTT publish, OPC UA server
vWatchdogSupervisor
3
256
Monitor all task check-ins, trigger per-task restarts
vModemTask
2
1024
AT command state machine, LTE connection management
vHMITask
1 (Lowest)
384
Local display update, LED status indicators
The priority assignment reflects the criticality hierarchy. The CAN bus task runs at the highest priority because any delay in processing CAN frames directly risks field-device watchdog expirations. The Ethernet task is second, as cloud connectivity is important but tolerates latency measured in hundreds of milliseconds. The watchdog supervisor runs at a middle priority — high enough to detect faults promptly, but not so high that it competes with the real-time CAN and Ethernet tasks. The modem task, the source of the original blocking problem, runs at a deliberately low priority. This means that even if the modem task enters an infinite loop or deadlocks on an AT command response, it can never preempt the CAN bus task. The HMI task, responsible for local display updates, runs at the lowest priority since its timing requirements are the most relaxed.
Priority-Based Preemptive Scheduling for Real-Time Guarantees
The theoretical foundation for this priority assignment comes from rate-monotonic scheduling (RMS), a well-established result in real-time systems theory. RMS assigns higher priorities to tasks with shorter periods (or tighter deadlines). The CAN bus task, with its 100-millisecond worst-case processing window and 1-millisecond response requirement, receives the highest priority. The modem task, with a 2-second check-in period and tolerance for multi-second response times, receives a low priority.
FreeRTOS implements fixed-priority preemptive scheduling on Cortex-M processors using the PendSV interrupt. When an interrupt service routine (such as the CAN receive interrupt) makes a higher-priority task ready, the scheduler sets the PendSV pending flag. At the end of the interrupt processing, the PendSV handler performs the context switch — saving the current task's registers to its stack and loading the new task's registers from its stack. This hardware-assisted context switch on the Cortex-M4 takes approximately 12 clock cycles, making it effectively instantaneous relative to the CAN bus timing requirements.
A critical design decision was to keep the CAN bus interrupt handler minimal. The CAN receive ISR does nothing more than read the received frame from the CAN peripheral's FIFO, post it to a FreeRTOS queue (using xQueueSendFromISR), and request a context switch. All frame parsing, filtering, and forwarding logic executes in the vCANBusTask context. This keeps interrupt latency low and ensures that the CAN peripheral's FIFO does not overflow under high bus load. On the STM32F407 running at 168 MHz, the measured ISR execution time was 1.8 microseconds, well within the CAN frame intermission time.
Message Queue Patterns for Fault-Tolerant Communication
Inter-task communication in the redesigned gateway uses FreeRTOS queues exclusively. Queues provide several advantages over shared-memory approaches (global variables with critical sections): they are thread-safe by design, they support blocking and non-blocking send and receive operations with configurable timeouts, and they naturally decouple producer and consumer timing.
The most architecturally significant queue is the one connecting the vModemTask to the vEthernetTask. When the gateway needs to publish a CAN frame to the cloud via MQTT, the vCANBusTask formats the message and posts it to a queue that the vEthernetTask consumes. However, if the LTE modem is hung and the Ethernet task's outbound buffer is full, this queue can fill up. In the original super-loop, this condition would have blocked the entire system. In the FreeRTOS redesign, the vCANBusTask posts to this queue with a zero timeout:
This non-blocking post pattern is the key to fault isolation. The CAN bus task never waits for the modem or the Ethernet task. If the modem is hung and the outbound queue is full, the CAN task simply increments a drop counter and continues processing the next CAN frame. The field devices on the CAN bus are completely unaware that cloud connectivity has degraded. When the modem recovers (either through the watchdog supervisor's intervention or the modem's own internal reset), the Ethernet task drains the queue and resumes normal cloud publishing. A separate diagnostics task periodically reports the drop counter to the SCADA system, enabling the operations team to monitor cloud connectivity health without impacting real-time CAN bus operations.
Watchdog Supervision Architecture
The watchdog architecture in the redesigned gateway operates on two levels: a hardware independent watchdog (IWDG) as the ultimate safety net, and a software watchdog supervisor task that provides fine-grained, per-subsystem fault detection and recovery.
Hardware Watchdog (IWDG)
The STM32's Independent Watchdog (IWDG) is driven by its own internal low-speed oscillator (LSI), independent of the main clock tree. This independence is critical: if the main oscillator fails or the clock configuration is corrupted, the IWDG continues to operate and will still trigger a system reset. The IWDG was configured with a timeout of 4 seconds, intentionally longer than any individual task's check-in period. The IWDG is refreshed (kicked) only by the vWatchdogSupervisor task, and only after it has confirmed that all monitored tasks have checked in within their allotted periods.
This design ensures that a fault in any single task — or a fault in the supervisor itself — will eventually trigger a hardware reset. The IWDG is the last line of defence against a condition where the software supervisor has itself become corrupted or deadlocked.
Software Watchdog Supervisor Task
The vWatchdogSupervisor task implements a multi-tier supervision scheme. Each monitored task periodically writes a 'heartbeat' value to a shared memory location (protected by a FreeRTOS mutex). The supervisor task, running at a 1-second interval, checks each task's heartbeat against its expected check-in period:
CAN bus task: Must check in every 100 milliseconds. If it misses three consecutive check-ins (300 ms total), the supervisor flags a CAN fault. Because the CAN task runs at the highest priority and cannot be preempted by any other task, a missed check-in almost certainly indicates a hardware fault (CAN transceiver failure, bus-off condition) or a stack overflow. The supervisor logs the fault and triggers a full system reset via the IWDG.
Ethernet task: Must check in every 500 milliseconds. A missed check-in triggers a task-level restart: the supervisor deletes and recreates the Ethernet task, reinitialises the TCP/IP stack, and resumes operation without touching the CAN or modem tasks.
Modem task: Must check in every 2 seconds. A missed check-in triggers a modem-only restart: the supervisor pulses the modem's hardware reset pin (via a GPIO), waits for the modem's boot time (approximately 8 seconds for a typical LTE Cat-1 module), and recreates the modem task. During this recovery period, the CAN and Ethernet tasks continue operating normally. Outbound cloud messages are dropped silently (as described in the queue pattern above), and the diagnostics counter records the event.
HMI task: Must check in every 5 seconds. A missed check-in triggers a task restart. The HMI task is the least critical; if it faults, the gateway continues all real-time operations, and only the local display goes blank until the task is recreated.
Fault Isolation and Recovery Mechanisms
The fundamental principle underlying the entire redesign is fault isolation: a fault in one subsystem must not propagate to other subsystems. In the original super-loop, this principle was violated by design — a blocking modem call stopped everything. In the FreeRTOS redesign, fault isolation is enforced by three complementary mechanisms:
Priority-based preemption: The CAN bus task cannot be delayed by any lower-priority task. Even if the modem task enters an infinite loop, the kernel will continue to schedule the CAN task at its assigned rate. This is the first and most important line of defence.
Non-blocking inter-task communication: Message queues with zero timeouts ensure that the CAN task never blocks waiting for the modem, Ethernet, or any other task. If a downstream task is slow or hung, the CAN task drops the message and continues.
Supervised per-task restart: The watchdog supervisor can restart individual tasks without affecting other tasks. This provides a granular recovery mechanism that is far less disruptive than a full system reset.
The recovery statistics from the four-month field trial demonstrate the effectiveness of this layered approach. There were 67 modem fault events during the trial. Of these, 63 (94%) were recovered by the supervisor's modem-only restart — the CAN bus and Ethernet tasks never noticed. The remaining 4 events (6%) required a full system reset via the IWDG, typically because the modem fault coincided with a power supply glitch that affected multiple subsystems simultaneously. In zero cases did a modem fault cause a CAN bus interruption.
The per-task restart mechanism also includes stack overflow detection. FreeRTOS provides two mechanisms for detecting stack overflows: a runtime check that fills each task's stack with a known pattern (0xA5) and periodically verifies the high-water mark, and a more comprehensive method that places a canary value at the end of the stack and triggers an assertion if it is overwritten. The team enabled both mechanisms. During the four-month trial, the stack high-water mark analysis showed that the worst-case stack utilisation was 78% (in the Ethernet task), well within safe limits. No stack overflow was detected.
IEC 61508 SIL 1 Compliance
The gateway operates in a safety-related context: it bridges CAN bus communications between field instruments and the plant's safety instrumented system (SIS). The plant's functional safety requirements mandate IEC 61508 SIL 1 compliance for the gateway. IEC 61508 is the foundational international standard for functional safety of electrical, electronic, and programmable electronic safety-related systems. SIL 1 represents the lowest of four safety integrity levels, requiring a probability of dangerous failure per hour (PFH) of between 10^-6 and 10^-5.
Compliance at SIL 1 does not require a certified safety-certified RTOS (such as SafeRTOS, which is pre-certified to IEC 61508 SIL 3 by TÜV SÜD). However, it does require that the software architecture demonstrates systematic capability for SIL 1, which includes documented architectural design, separation of safety-related and non-safety-related functionality, and robust fault detection and recovery mechanisms.
The FreeRTOS redesign strengthened the SIL 1 case in several ways:
Architectural separation: The CAN bus task, which handles safety-related communications, is completely independent of the modem task, which handles non-safety-related cloud connectivity. This separation ensures that a fault in the non-safety subsystem cannot affect the safety subsystem — a core requirement of IEC 61508's architectural design guidelines.
Diagnostic coverage: The supervised watchdog architecture provides diagnostic coverage for task-level faults. Each task's heartbeat check-in is a form of diagnostic test. The IWDG provides diagnostic coverage for overall system-level faults. The combined diagnostic coverage exceeds the 60% minimum recommended for SIL 1 hardware architectures.
Deterministic scheduling: The priority-based preemptive scheduler provides deterministic worst-case execution time (WCET) analysis for the CAN bus task. Because the CAN task runs at the highest priority and cannot be preempted by any other task, its WCET is simply its own execution time plus the maximum ISR latency, both of which can be measured and bounded through static analysis and JTAG-based profiling.
No dynamic memory allocation: All FreeRTOS objects (tasks, queues, semaphores, timers) are created during system initialisation, before the scheduler starts. No dynamic memory allocation (malloc/free) occurs during runtime, eliminating a entire class of potential faults (heap corruption, fragmentation, out-of-memory) that would complicate the SIL argument.
The CE mark was retained after the redesign because the functional safety case was actually strengthened relative to the original super-loop. The original firmware's reliance on a single-threaded execution model with a monolithic hardware watchdog provided less diagnostic coverage and less architectural separation than the redesigned FreeRTOS-based system.
CAN Bus Protocol Implementation Details
The CAN bus implementation on the STM32 uses the bxCAN peripheral, which supports both CAN 2.0A (11-bit standard identifiers) and CAN 2.0B (29-bit extended identifiers). The gateway is configured for CAN 2.0B at 500 kbit/s, matching the plant's existing fieldbus configuration. The bxCAN peripheral provides two receive FIFOs, each capable of storing three CAN frames. The CAN receive ISR drains both FIFOs and posts all received frames to the vCANBusTask's input queue, which is sized to hold 128 frames — sufficient to buffer the worst-case bus load without overflow.
CAN frame filtering is implemented using the bxCAN peripheral's hardware acceptance filters. The gateway accepts only a defined list of CAN identifiers corresponding to the 23 field devices on the bus. All other frames — including any from unknown or rogue devices — are rejected at the hardware level, reducing ISR load and queue traffic. This hardware-level filtering is a requirement of the plant's cybersecurity policy for industrial control systems.
The CAN bus task implements error handling for bus-off and error-passive conditions. If the bxCAN peripheral enters a bus-off state (typically caused by excessive transmit errors), the task logs the event, waits for the automatic bus-off recovery period specified in the CAN specification (128 occurrences of bus-free), and resumes normal operation. During the bus-off recovery period, received frames continue to be processed (the peripheral can still receive in bus-off state), but no frames are transmitted. This asymmetric handling ensures that the gateway can still receive safety-critical messages from field devices even during a transmit fault condition.
LTE Modem Reliability Challenges
The LTE modem — a Cat-1 module providing low-bandwidth cellular connectivity for cloud telemetry — was the primary source of reliability problems in the original gateway. The modem communicates with the STM32 via a UART interface using AT commands, a text-based command set originally developed for dial-up modems and still widely used in cellular modules. AT commands are inherently unreliable for time-critical applications because response times vary enormously: a simple query like AT+CSQ (signal strength) may respond in 50 milliseconds, while AT+CGACT (activate PDP context) may take 5 to 30 seconds depending on network conditions.
The modem hung in several distinct failure modes during field operation:
Network registration timeout: The modem's cellular network registration process (AT+CEREG) could hang indefinitely if the SIM card's network operator was undergoing maintenance or if the signal was marginally below the registration threshold. The modem's internal firmware would retry registration silently without responding to the host's AT command.
Socket write blocking: When the LTE network was congested, the modem's TCP/IP stack could accept data into its internal buffer but fail to transmit it, eventually filling the buffer. Subsequent socket write AT commands would then block until buffer space became available — which could take seconds or never happen if the network connection was lost.
Module firmware crash: Rarely (approximately once per month), the modem's internal firmware would crash, leaving the UART interface in an unresponsive state. Only a hardware reset (pulsing the modem's reset pin) could recover from this condition.
The FreeRTOS redesign addressed all three failure modes through the combination of low-priority task assignment, non-blocking queue posts, and the supervised watchdog with hardware reset capability. The modem task's AT command state machine now includes per-command timeouts (configurable, typically 10 seconds for network commands and 2 seconds for simple queries) that cause the state machine to abandon the current command and attempt recovery, rather than blocking indefinitely.
Limitations and Considerations
While the results are compelling, several limitations and trade-offs should be acknowledged by teams considering a similar migration:
Increased complexity: The FreeRTOS-based firmware is significantly more complex than the original super-loop. The task decomposition, priority assignment, queue management, and watchdog supervision logic collectively represent approximately 2,400 additional lines of code beyond the original 1,800-line super-loop firmware. This complexity increases the surface area for software defects and requires developers with RTOS expertise.
Memory overhead: FreeRTOS requires additional RAM for kernel data structures (task control blocks, stacks, queues) and ROM for the kernel code itself. On the STM32F407 with 192 KB of SRAM, the five tasks consume approximately 12 KB of stack space (including the kernel stack). This is well within the device's capacity, but on more constrained MCUs (such as the STM32F103 with 20 KB of SRAM), this overhead could be prohibitive.
Priority assignment is a design-time decision: The rate-monotonic priority scheme works well when task periods and deadlines are known and stable. If the system requirements change — for example, if a new task with an intermediate deadline is added — the priority assignment must be re-evaluated, and the entire schedule re-analysed for schedulability. There is no runtime priority adaptation.
Fault detection latency: The watchdog supervisor checks heartbeats at 1-second intervals. This means that a task fault is not detected until up to 1 second after it occurs (plus the task's own check-in period). For the CAN bus task with a 100 ms check-in period, the worst-case fault detection latency is 1.1 seconds. For the modem task with a 2-second check-in period, the worst-case is 3 seconds. These latencies are acceptable for this application, but applications with tighter fault detection requirements would need a higher-frequency supervisor or a dedicated hardware monitor.
No formal SIL certification of FreeRTOS itself: While the architectural argument for SIL 1 is strong, the FreeRTOS kernel used in this project is not itself certified to IEC 61508. For SIL 2 or higher applications, the team would need to either use SafeRTOS (the TÜV SÜD-certified variant of FreeRTOS) or perform a separate certification of the FreeRTOS kernel. This limitation was acceptable for SIL 1 but would not scale to higher integrity levels without additional investment.
Single-core limitation: The STM32F407 is a single-core Cortex-M4. All five tasks share a single execution core, and the priority-based scheduler ensures that only one task runs at a time. For applications requiring true parallelism (for example, simultaneous high-speed CAN bus processing and complex LTE protocol handling), a multi-core MCU or an application processor running Linux might be more appropriate — but at significantly higher cost, power consumption, and complexity.
Outcomes Summary
The four-month field trial produced the following results, comparing the original super-loop firmware with the FreeRTOS-based redesign:
Metric
Super-Loop (Before)
FreeRTOS (After)
Improvement
CAN bus interruptions from modem faults
~8 per month (15–60 s each)
0 in 4 months
Modem self-recovery (no gateway restart)
0% (full reset required)
94% (63 of 67 events)
Stack overflows
Not detectable
0 in 4 months
CAN frame processing worst-case latency
Variable (up to 60 s)
< 2 ms measured
IEC 61508 SIL 1 compliance
Marginal (single-thread)
Strong (separated tasks)
Code size (Flash)
42 KB
58 KB
RAM usage
14 KB
26 KB
The most striking result is the complete elimination of CAN bus interruptions caused by modem faults. Before the redesign, the plant experienced approximately 8 modem-related CAN interruptions per month, each lasting 15 to 60 seconds. Over four months, this represented an estimated cumulative downtime of 20 to 32 minutes for the CAN bus — during which field devices were operating without gateway communication, triggering false fault conditions and occasional safety shutdowns. After the redesign, this number dropped to zero.
Conclusion and Future Implications
This case study demonstrates that migrating from a super-loop architecture to FreeRTOS is not merely an academic exercise in real-time systems design — it is a practical, high-impact engineering decision that can eliminate an entire class of field failures in industrial IoT gateways. The key architectural insights from this project are broadly applicable to any embedded system that must handle multiple I/O peripherals with unpredictable timing:
Priority-based preemptive scheduling provides deterministic real-time guarantees that a super-loop cannot match when blocking I/O is present. Assigning the highest priority to the most timing-critical task ensures that it is never delayed by less critical operations.
Non-blocking inter-task communication via message queues is the primary mechanism for fault isolation. If a producer task must never be delayed by a slow or hung consumer, the queue post must use a zero timeout and the producer must handle the queue-full condition gracefully.
A supervised watchdog architecture with per-task restart capability provides granular fault recovery that is far less disruptive than a monolithic hardware reset. The ability to restart the modem task without touching the CAN or Ethernet tasks is what makes 94% self-recovery possible.
IEC 61508 SIL 1 compliance is not merely about certification paperwork — it is about designing an architecture that genuinely separates safety-related and non-safety-related functionality and provides measurable diagnostic coverage. The FreeRTOS redesign strengthened the SIL argument rather than merely maintaining it.
Looking ahead, the team is evaluating several enhancements to the gateway architecture. First, the addition of a redundant CAN bus interface (CAN A and CAN B with automatic failover) would further increase availability for the most safety-critical field devices. Second, migration to SafeRTOS would enable the gateway to target SIL 2 for future deployments in higher-risk process environments. Third, the integration of an LTE-M/NB-IoT modem with a more robust TCP/IP stack and built-in connection management would reduce the frequency of modem-related faults at the source, complementing the software-level fault isolation described in this article.
For embedded systems engineers working on industrial IoT gateways, process controllers, or any application where a cellular modem or other unpredictable I/O device shares a microcontroller with real-time control tasks, the lesson is clear: super-loop architectures are appropriate when all I/O operations are fast and predictable, but they become a liability when any single operation can block. FreeRTOS provides a mature, well-supported, and lightweight framework for introducing task-level fault isolation — and the cost of adoption (16 KB of Flash, 12 KB of RAM, and a one-time architecture redesign) is negligible compared to the cost of field failures, unplanned shutdowns, and lost production.
References
[1] FreeRTOS.org, "FreeRTOS — Real-Time Operating System for Microcontrollers," Amazon Web Services. https://www.freertos.org/
[2] Bosch, "CAN Specification, Version 2.0," Robert Bosch GmbH, 1991.
[3] IEC 61508:2010, "Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems," International Electrotechnical Commission.