Karthikey Kadati | Linux Kernel Contributor

There’s a dirty secret in the history of real-time operating systems. For decades, schedulers used fixed-size priority arrays, basically a flat table where index i holds the list of all tasks at priority level i. On paper, this gives you O(1) insertion and O(1) lookup. In practice, it locks your entire RTOS to a hardcoded maximum number of priority levels and scales terribly on modern multi-core silicon.

The RTEMS (Real-Time Executive for Multiprocessor Systems) SuperCore made a decisive break from this legacy with its adoption of Red-Black Trees for priority aggregation. This post explains why the shift happened, how it works under the hood, and why it’s a hard requirement for modern SMP locking protocols like FMLP (Flexible Multiprocessor Locking Protocol) and DFLP (Distributed Flexible Locking Protocol).

1. The Legacy: Fixed-Size Priority Arrays

In the RTEMS 4.x era and many classic RTOSs, the scheduler maintained something like:

typedef struct {
  Chain_Control ready_chains[ PRIORITY_MAXIMUM ];
} Scheduler_Priority_Context;

Where PRIORITY_MAXIMUM might be 16, 32, or 256. Each slot holds a chain (linked list) of tasks at that priority. To find the highest-priority ready task, the scheduler scans a bitmap:

/* Find the first set bit = highest priority with a ready task */
uint32_t highest = _Bitfield_Find_first_bit( priority_bitmap );
Task *next = ready_chains[highest].first;

Why This Worked (On Single Cores)

On a uniprocessor system with 16 priority levels, this is fast. The bitmap fits in a single register. Find_first_bit compiles to a single hardware instruction on most architectures (bsf on x86, clz on ARM). Insertion and removal are O(1) because you’re just adding to a linked list at a known index.

Why This Breaks (On Modern SMP)

The problems stack up when you move to SMP with protocols like FMLP and DFLP:

Problem 1: Fixed Priority Granularity. A 16-slot priority_array gives you exactly 16 priority levels. But modern locking protocols need fine-grained priority manipulation. FMLP-S’s Priority Ceiling has to boost a task to a ceiling precisely higher than every other task that can access the resource. DFLP’s Priority Boosting during migration requires inserting the migrated task at a precise priority relative to all tasks on the destination core. With only 16 slots, priority collisions are inevitable, and collisions break the mathematical guarantees of the protocol.

Problem 2: Priority Aggregation with Inheritance. FMLP-L uses Priority Inheritance (PI). When a low-priority task holds a lock and a high-priority task waits, the holder inherits the waiter’s priority. In real systems, a task can hold multiple locks at the same time, each with different waiters. The task’s effective priority is the maximum across all inherited priorities plus its own base priority. With a flat array, tracking this aggregation means scanning all the task’s held locks, which is O(k) where k is the number of held locks.

Problem 3: Scheduler Node Complexity. In RTEMS 7 SMP, each task doesn’t just have a priority. It has a Priority_Aggregation that combines multiple Priority_Node entries from different sources (base priority, inherited priorities, ceiling boosts). A flat array can’t efficiently represent this multi-source priority aggregation.

2. The Modern Solution: Red-Black Trees

RTEMS 7’s SuperCore replaces the fixed-size array with Red-Black Trees (RB-Trees) at multiple levels:

2.1: Thread Queue Priority Trees

When tasks wait for a mutex (like an FMLP-L lock), they get inserted into a Red-Black Tree keyed by priority, not a flat FIFO list or a fixed array slot:

/* From the RTEMS SuperCore: Thread queue operations for priority-based ordering */
#define FMLPL_TQ_OPERATIONS &_Thread_queue_Operations_priority_inherit

The _Thread_queue_Operations_priority_inherit implementation uses RB-Trees to maintain waiters in sorted priority order. Finding the highest-priority waiter is O(log n). Insertion is O(log n). Removal is O(log n).

2.2: Priority Aggregation Trees

Each Scheduler_Node contains a Priority_Aggregation structure that collects all priority sources affecting the thread:

typedef struct {
  Priority_Node  Node;           /* Base priority */
  RBTree_Control Contributors;   /* RB-Tree of all inherited/ceiling priorities */
} Priority_Aggregation;

When an FMLP-S lock is acquired and _FMLPS_Raise_priority() runs, a new Priority_Node is inserted into the thread’s Contributors tree:

_Priority_Node_initialize( priority_node, ceiling_priority );
_Thread_Priority_add( thread, priority_node, queue_context );

The thread’s effective (current) priority is always the minimum key in the RB-Tree (lowest numerical value = highest priority in RTEMS convention). Reading it is O(1) (just grab the tree’s leftmost node), but maintaining the tree after insertions and removals is O(log n).

When the lock is released, _FMLPS_Remove_priority() extracts the node:

_Thread_Priority_remove( thread, priority_node, queue_context );

2.3: Scheduler Ready Queues

The scheduler’s ready queue itself is an RB-Tree. When a task’s priority changes (from inheritance, ceiling, or migration), it gets repositioned in the tree. The next task to dispatch is always the leftmost node. O(1) to read, O(log n) to maintain.

3. Why O(log n) is Mandatory for SMP Locking

On a single core with 4 tasks, the difference between O(1) and O(log n) is interesting but doesn’t really matter. On a 16-core SoC running 64 tasks with multiple nested FMLP and DFLP locks, it’s the difference between deterministic and non-deterministic behavior.

Deterministic Worst-Case Execution Time (WCET)

In hard real-time systems, what matters isn’t average performance. It’s worst-case execution time (WCET). Every priority manipulation (inheritance, ceiling, aggregation) has to complete within a bounded time.

Fixed array O(1) + bitmap scan: Looks constant-time, but only works for a fixed, small number of priority levels. If you increase PRIORITY_MAXIMUM to support fine-grained priorities, the bitmap grows, and the scan becomes O(p) where p is the bitmap width.
Red-Black Tree O(log n): Scales gracefully. With 64 tasks, worst case is log2(64) = 6 tree rotations. With 256 tasks, it’s 8. With 1024 tasks, it’s 10. The growth is logarithmic, which is practically constant for any realistic embedded system.

Protocol-Specific Scaling

Think about a DFLP migration scenario:

Task migrates from Core 3 to Core 1.
On Core 1, the task has to be inserted into the ready queue at its boosted priority.
The scheduler has to find the new highest-priority task on Core 1 (which might now be the migrated task).
Priority aggregation has to recalculate Core 1’s dispatch decision.

With RB-Trees, steps 2-4 are each O(log n). Total cost: O(log n). Deterministic. Bounded. Certifiable.

With a fixed array, step 2 is O(1) only if the priority slot exists. Step 3 requires a bitmap scan whose cost depends on the array size. Step 4 requires rescanning all held locks. Total cost: unpredictable.

4. Impact on Pre-Qualification

RTEMS is used in safety-critical domains: aerospace, automotive, medical devices. These domains require pre-qualification, which means formal evidence that the RTOS meets standards like DO-178C (avionics) or ISO 26262 (automotive).

Pre-qualification demands that every kernel operation has a documented, bounded WCET. The O(log n) guarantee of Red-Black Trees is straightforward to certify:

“The maximum execution time of a priority insertion operation is bounded by C x log2(N), where N is the maximum number of concurrent tasks and C is the per-rotation cost measured on the target architecture.”

Try writing that sentence for a fixed-size priority array that requires bitmap rescanning and linked-list traversal. The certification auditor will send it back.

Spec-to-Code Traceability

This is also where the rtems-central YAML specifications become really important. The FMLP YAML specs I authored (MR #15) define requirements like:

“When an FMLP-S semaphore is obtained, the executing task’s priority shall be raised to the ceiling priority.”
“When an FMLP-S semaphore is released, the ceiling priority shall be removed from the executing task.”

These requirements map directly to _Thread_Priority_add() and _Thread_Priority_remove(), which are the RB-Tree insertion and removal operations. The spec-to-code traceability is clean because the data structure’s interface matches the requirement’s semantics one-to-one.

5. The Big Picture: Why DFLP Needs This

DFLP’s Temporary Task Migration creates a really demanding priority manipulation profile:

On the source CPU: Remove the task from the ready queue. Cost: O(log n).
Priority Boosting: Insert a ceiling Priority_Node into the task’s aggregation tree. Cost: O(log n).
On the destination CPU: Insert the task into a completely different scheduler’s ready queue at the boosted priority. Cost: O(log n).
On surrender (if handoff): Remove the task from the destination, clean its priority boost, reinsert on the source, and simultaneously migrate the next waiter. Each step: O(log n).

A single DFLP lock-acquire-release cycle can trigger 6-8 RB-Tree operations. With a fixed array, each of those operations would need special-case handling for priority slot collisions, bitmap synchronization across CPUs, and array bounds checking. With RB-Trees, they’re all the same uniform, bounded, certifiable operation.

6. Wrapping Up

The shift from fixed-size priority arrays to Red-Black Trees in the RTEMS 7 SuperCore wasn’t a performance optimization. It was an architectural prerequisite.

Without O(log n) deterministic priority aggregation, modern SMP locking protocols just can’t work correctly:

FMLP-S’s Priority Ceiling requires instant, fine-grained priority insertion and removal.
FMLP-L’s Priority Inheritance requires dynamic aggregation across multiple lock holders.
DFLP’s Task Migration requires cross-CPU scheduler insertion with bounded WCET.

Red-Black Trees make all of this possible with a single, uniform data structure whose worst-case behavior is mathematically bounded and certifiable. For pre-qualification of RTEMS on modern multi-core silicon, there really isn’t an alternative.

TL;DR

The Legacy Problem: Older RTOSs used fixed-size priority arrays (priority_array[16]) for O(1) scheduling. This breaks on modern SMP because it limits priority granularity, can’t handle multi-source priority aggregation (inheritance + ceiling), and scales unpredictably.
The RTEMS 7 Solution: Red-Black Trees at every level: thread queue ordering, priority aggregation within Scheduler_Node, and scheduler ready queues. All operations are O(log n).
Why O(log n) Matters: Hard real-time systems require bounded Worst-Case Execution Time (WCET). O(log n) is deterministic and certifiable. Fixed arrays with bitmap scanning are not.
Protocol Impact: FMLP-S (ceiling insertion/removal), FMLP-L (inheritance aggregation), and DFLP (cross-CPU migration with priority boosting) all depend on fast, bounded priority tree operations. A single DFLP lock cycle can trigger 6-8 RB-Tree operations.
Pre-Qualification: The O(log n) bound provides clean, auditable WCET documentation required by safety standards like DO-178C and ISO 26262. The YAML specs in rtems-central trace directly to these RB-Tree operations.