diff --git a/content/curriculum/hands-on/case-studies/explainable-atc-rl-agent.mdx b/content/curriculum/hands-on/case-studies/explainable-atc-rl-agent.mdx index 89c29fbe..2b176f53 100644 --- a/content/curriculum/hands-on/case-studies/explainable-atc-rl-agent.mdx +++ b/content/curriculum/hands-on/case-studies/explainable-atc-rl-agent.mdx @@ -77,7 +77,7 @@ The agent's operation spans two distinct phases. During **training** (in simulat | **State Representation** | Relative bearings, distances, vertical separations, time since last action, route deviation — a compact vector representation rather than raw radar imagery | | **Action Space** | Heading changes (turn left/right) and altitude changes (climb/descend), composed into operationally natural compound clearances | | **Reward Function** | Multi-objective: centreline tracking (route adherence), separation maintenance (safety), action damping (avoiding excessive or oscillatory commands) | -| **Explainability Methods** | Under investigation — this is a central challenge of the case study. Candidate approaches include attention weight visualisation (revealing which aircraft the policy attends to for each decision), reward-component analysis (understanding learned behavioural priorities), value function monitoring (real-time confidence signal), policy entropy (a measure of how certain the agent is about what to do), and input perturbation methods (testing how the policy responds to changes in specific aircraft features) | +| **Explainability Methods** | Under investigation — this is a central challenge of the case study. Candidate approaches include reward-component analysis (understanding learned behavioural priorities), value function monitoring (real-time confidence signal), policy entropy (a measure of how certain the agent is about what to do), and input perturbation methods (testing how the policy responds to changes in specific aircraft features). Attention-based architectures, which provide intrinsic interpretability by revealing which aircraft the policy attends to for each decision, are a promising future direction being explored within the Bluebird programme[^8] | | **Pre-Trial Qualification** | Machine Basic Training (MBT) assessment: formative and summative exercises evaluated by certified ATCO assessors against competency standards for safety, controlling, planning, and coordination[^3] | ### Deployment Context @@ -154,15 +154,11 @@ ATC relies on voice communication between controllers and pilots. Every clearanc During the trial, the ATCO communicates with pilots in the normal way — the agent does not participate in the voice loop. However, the agent's recommendations must be presented to the ATCO in a form that can be seamlessly translated into standard voice clearances. If the agent's recommendations are expressed in terms that do not map naturally to CAP493 phraseology, or if the ATCO must mentally translate between the agent's representation and the language of the voice loop, this creates friction that may slow decision-making or introduce errors. The communicability of agent recommendations is not only a technical challenge but a human factors and procedural one. -### Attention-Based Architectures and Intrinsic Interpretability - -The explainability methods discussed above are primarily *post-hoc* — they are applied to an already-trained policy to extract explanations after the fact. An alternative approach is to build interpretability into the network architecture itself. Recent work on multi-agent RL for air traffic separation assurance has demonstrated that attention networks can serve this dual purpose: they handle the practical challenge of processing a variable number of surrounding aircraft while simultaneously providing an intrinsic interpretability mechanism.[^8] +### Governance, Documentation, and Audit [^8]: Brittain, M. W., Alvarez, L. E., & Breeden, K. (2024). Improving Autonomous Separation Assurance through Distributed Reinforcement Learning with Attention Networks. In *Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)*, 22857–22863. https://doi.org/10.1609/aaai.v38i21.30321 -In an attention-based architecture, the policy computes a set of attention weights that determine how much importance it assigns to each surrounding aircraft when deciding what action to take for a given subject aircraft. These weights are a natural byproduct of the network's operation, not a post-hoc overlay. By extracting and visualising the attention weights at each decision step, it becomes possible to see which aircraft the policy was attending to when it recommended a particular clearance. This does not explain *why* the agent chose a specific action, but it reveals *which aircraft were being considered* — information that maps naturally to how controllers reason about traffic conflicts (e.g. "I turned aircraft X because of aircraft Y closing from the east"). - -Work within the Bluebird programme, building on the approach introduced by Brittain, Alvarez, and Breeden, has applied attention-based RL to en route air traffic management, enabling real-time visualisation of the policy's attention across aircraft in the sector. This represents a promising complement to the post-hoc methods listed in the [Recommended Techniques](#recommended-techniques-for-evidence) section: where input perturbation and saliency methods answer "which features mattered?", attention weights answer "which aircraft mattered?" — a distinction that is operationally intuitive for supervising controllers. +Explainability is not only about technical methods that probe the model's internal states. It also requires that the *context* around the model — why it was built this way, what data shaped its behaviour, what design decisions were made and by whom — is documented to a standard that enables independent audit. For a supervised operational trial, this includes traceable design rationale for the reward function and its weightings, provenance of the training data (which sectors, which traffic conditions, which time periods were represented in the BluebirdDT simulations), version control of the trained policy, and structured records of pre-trial qualification outcomes. Without this documentation, even technically sound explainability methods cannot be placed in their proper context: knowing *which features* drove a recommendation is less useful if there is no record of *why those features were included in the state representation* or *how the reward function was designed to weight them*. The trial oversight board and CAA observers need access to this documentation layer to evaluate the trial independently — and it must be maintained as a living record throughout the trial, not produced after the fact. ## Assurance Focus @@ -174,8 +170,8 @@ The assurance case should demonstrate that: 1. An ATCO supervising a human trainee can ask "why did you do that?" and receive an answer in operational language. What would the equivalent interaction look like with an RL agent providing decision support, and is it achievable with current explainability methods? 2. If the agent consistently performs well on safety and efficiency metrics but its developers cannot fully explain *why* it makes specific recommendations, should it be considered safe for operational use? How should the assurance case handle the gap between performance evidence and explanatory evidence? -3. The agent was trained on clean, complete state vectors in simulation, but at trial runtime it depends on live surveillance data streams (radar feeds, flight plan updates) that may be incomplete, noisy, delayed, or temporarily unavailable. How should the assurance case address the quality and reliability of runtime data? If a data feed degrades mid-session — for example, a radar gap or a stale flight plan — the agent's recommendations may become unreliable without the agent or the ATCO having any obvious signal that this has occurred. What continuous validation and monitoring mechanisms are needed to detect data quality issues in real time, and how should the system behave when data integrity cannot be assured? -4. The agent was qualified in simulation using the MBT framework. How should the assurance case address whether explainability methods validated in simulation remain reliable under live operational conditions, where traffic patterns, weather, and controller behaviour may differ from the training environment? +3. The agent was trained on clean, complete state vectors in simulation, but at trial runtime it depends on live surveillance data streams (radar feeds, flight plan updates) that may be incomplete, noisy, delayed, or temporarily unavailable. How should the assurance case address the quality and reliability of runtime data? If a data feed degrades mid-session — for example, a radar gap or a stale flight plan — the agent's recommendations may become unreliable without the agent or the ATCO having any obvious signal that this has occurred. What continuous validation and monitoring mechanisms are needed to detect data quality issues in real time, and how should the system behave when data integrity cannot be assured? More broadly, how should the assurance case address whether explainability methods and monitoring mechanisms validated in simulation remain reliable under live operational conditions, where traffic patterns, data quality, and controller behaviour may differ from the training environment? +4. What governance structures and documentation standards are needed to ensure that the trial can be independently audited — both during and after the trial — and that decisions about the agent's design, training, and operational boundaries are traceable? Who has authority to halt the trial, and on what basis? How should the assurance case address the gap between the organisational oversight needed for a first-of-its-kind RL trial in live airspace and the governance frameworks that currently exist within NATS and the CAA? 5. If the agent's reward function encodes incorrect or incomplete priorities (e.g., underweighting a rare but critical safety scenario), the agent's behaviour will be consistently wrong in ways that are difficult to detect through explainability methods alone. How should the assurance case address the trustworthiness of the reward function itself? 6. Human controllers undergo a structured competency assessment where they must demonstrate and explain their decision-making. The MBT framework has been used to assess rules-based and optimisation-based agents against the same competency standards.[^3] Should the RL agent be assessed against this same framework, and if so, what counts as an "explanation" from a non-human agent? Does the MBT framework need adaptation for RL-specific challenges such as policy opacity and reward-driven behaviour? @@ -183,29 +179,35 @@ The assurance case should demonstrate that: ### S1. Argument Over Reward-Component Attribution -**Decompose the agent's multi-objective reward into its constituent components (centreline tracking, separation maintenance, action damping) and analyse how the balance between these components shapes the policy's learned behavioural tendencies.** Because RL agents optimise for *cumulative future reward* rather than responding to immediate reward signals, individual recommendations cannot be cleanly attributed to specific reward components in the moment — the agent's actions reflect an overall learned understanding of expected value, not a direct reaction to the current reward. Reward decomposition is therefore most useful at the *policy level*: understanding whether the agent has learned to systematically prioritise safety margins over route efficiency, or vice versa, and how changes to reward weightings produce qualitatively different behaviours. This provides a *developer-facing explanation* of why the agent learned to behave the way it does, and supports regulatory review of whether the reward function encodes appropriate priorities. For explaining *individual decisions*, input perturbation methods — such as testing the policy's response to changes in specific aircraft positions or characteristics (see [Recommended Techniques](#recommended-techniques-for-evidence)) — are likely to have more mileage, as they directly reveal which features of the current state are driving the agent's action. +Decompose the multi-objective reward into its constituent components (centreline tracking, separation maintenance, action damping) and analyse how the balance between these components shapes the policy's learned behavioural priorities. Because RL agents optimise for cumulative future reward, reward decomposition is most useful at the *policy level* — understanding whether the agent has learned to systematically prioritise safety margins over route efficiency, or vice versa, and how changes to reward weightings produce qualitatively different behaviours. This provides a developer-facing explanation of why the agent behaves the way it does and supports regulatory review of whether the reward function encodes appropriate priorities. -### S2. Argument Over Operational Envelope Characterisation +### S2. Argument Over Cognitive Compatibility and Controller Workload -**Systematically map the state-space regions where the agent behaves competently and where it does not, using the MBT curriculum and structured probe scenarios to identify boundaries of reliable performance.** Rather than explaining individual decisions, this strategy characterises the agent's overall competency — where it can be trusted and where human intervention is most likely to be needed. This is the most technically tractable approach given current RL explainability methods and aligns with how human controller competency is assessed. One method for mapping the envelope is comparative baseline testing: evaluating the RL agent against a rules-based agent on the same MBT scenarios, using divergences in behaviour to identify edge cases or novel strategies. During the trial, operational data progressively extends the characterisation beyond what was achievable in simulation alone. +Ensure that the explanations provided to the supervising ATCO support — rather than degrade — situational awareness and operational decision-making. This strategy addresses the tension between providing enough explanation for trust calibration and providing so much that it becomes a cognitive burden competing with the controller's monitoring task. Evidence includes workload assessments, simulation trials measuring ATCO response times with and without explanations, and analysis of how explanation timing, modality, and complexity interact with the operational tempo of en route ATC and the voice communication loop. -### S3. Argument Over Layered Explanation Architecture +### S3. Argument Over Runtime Monitoring and Operational Envelope -**Design separate explanation interfaces for different audiences: a simplified operational layer for the supervising ATCO (concise summaries of what the agent is recommending and why), a technical layer for the trial engineering team (reward decomposition, policy entropy, saliency, and value-function monitoring as a real-time confidence signal), and a structured evidence layer for regulators (TEA-formatted claims, MBT assessment results, and behavioural testing outcomes).** This acknowledges that a single explanation cannot serve all audiences and builds the explanation system as a deliberate multi-layered architecture. +Continuously monitor the agent's confidence and input distribution during the trial, and characterise in advance where the agent behaves competently and where it does not. Pre-trial, this involves mapping the operational envelope using the MBT curriculum, structured probe scenarios, and comparative baseline testing against rules-based agents. At runtime, out-of-distribution detection, value-function monitoring, and policy entropy provide real-time signals that the agent is operating within its validated boundaries — or that it is not, triggering enhanced human oversight or reversion to manual control. -### S4. Argument Over Reversion and Handover Protocol +### S4. Argument Over Governance and Oversight Framework -**Define clear protocols for when and how the agent's decision-support role should be suspended and full manual control restored, and ensure these protocols are themselves explainable.** In a supervised trial, the ability to revert cleanly to manual control is as important as the system's normal operation. The reversion protocol should include what information the agent provides to the controller during handover (traffic state, active recommendations, perceived conflicts), the conditions that trigger reversion (e.g. value-function confidence dropping below a threshold, out-of-distribution detection, or ATCO override rate exceeding a limit), and how quickly the transition must occur. The trial provides direct evidence of the protocol's effectiveness and identifies gaps that would need to be addressed before wider deployment. +Establish that the trial has appropriate governance structures: clear roles and responsibilities, defined entry and exit criteria, escalation paths, incident response procedures, and periodic review mechanisms. This strategy argues that even where individual explanations are imperfect, the organisational and procedural wrapper around the trial ensures that problems are detected, investigated, and acted upon. Evidence includes documentation standards (design rationale, training data provenance, known limitations), monitoring of ATCO override patterns, and structured review processes involving the trial oversight board and CAA observers. ## Recommended Techniques for Evidence The following techniques from the [TEA Techniques library](https://alan-turing-institute.github.io/tea-techniques/techniques/) may be useful when gathering evidence for this assurance case: -- [Integrated Gradients](https://alan-turing-institute.github.io/tea-techniques/techniques/integrated-gradients/) — Attribute the agent's output to specific input features (e.g., which aircraft's relative position most influenced a turn recommendation), supporting both developer debugging and regulatory evidence of feature relevance -- [Contrastive Explanation Method](https://alan-turing-institute.github.io/tea-techniques/techniques/contrastive-explanation-method/) — Generate contrastive explanations of the form "the agent recommended turning aircraft A left because aircraft B was closing from the right; had aircraft B been 5 nautical miles further away, no turn would have been recommended". Aligns with how ATCOs reason about traffic conflicts -- [Partial Dependence Plots](https://alan-turing-institute.github.io/tea-techniques/techniques/partial-dependence-plots/) — Visualise how the agent's policy responds to changes in individual input features (e.g., how turn probability varies with separation distance), supporting developer understanding of learned behaviours and identification of unexpected sensitivities -- [Out-of-Distribution Detector for Neural Networks](https://alan-turing-institute.github.io/tea-techniques/techniques/out-of-distribution-detector-for-neural-networks/) — Detect traffic scenarios that fall outside the agent's training distribution, flagging situations where the agent's recommendations (and their explanations) should not be trusted and enhanced human oversight is required -- [Permutation Importance](https://alan-turing-institute.github.io/tea-techniques/techniques/permutation-importance/) — Assess which input features the agent relies on most by measuring performance degradation when each feature is randomly shuffled. Validates that the agent uses operationally meaningful features (e.g., separation distance) rather than spurious correlations +- [Integrated Gradients](https://alan-turing-institute.github.io/tea-techniques/techniques/integrated-gradients/) — Attribute the agent's output to specific input features (e.g. which aircraft's relative position most influenced a turn recommendation), supporting developer debugging and regulatory evidence of feature relevance +- [Partial Dependence Plots](https://alan-turing-institute.github.io/tea-techniques/techniques/partial-dependence-plots/) — Visualise how the agent's policy responds to changes in individual input features (e.g. how turn probability varies with separation distance), supporting developer understanding of learned behaviours. Note: assumes feature independence when averaging, so should be complemented with instance-level analysis where features are correlated +- [Permutation Importance](https://alan-turing-institute.github.io/tea-techniques/techniques/permutation-importance/) — Assess which input features the agent relies on most by measuring performance degradation when each feature is randomly shuffled; validates that the agent uses operationally meaningful features (e.g. separation distance) rather than spurious correlations +- [Contrastive Explanation Method](https://alan-turing-institute.github.io/tea-techniques/techniques/contrastive-explanation-method/) — Generate contrastive explanations of the form "the agent recommended turning aircraft A left because aircraft B was closing from the right; had aircraft B been 5 nautical miles further away, no turn would have been recommended". Contrastive explanations are argued to be the most cognitively natural form of explanation and align with how ATCOs reason about traffic conflicts +- [Prototype and Criticism Models](https://alan-turing-institute.github.io/tea-techniques/techniques/prototype-and-criticism-models/) — Identify representative traffic scenarios where the agent behaves as expected (prototypes) and scenarios where its behaviour is unexpected or atypical (criticisms), helping ATCOs build intuition about the agent's behavioural tendencies through concrete examples +- [Human-in-the-Loop Safeguards](https://alan-turing-institute.github.io/tea-techniques/techniques/human-in-the-loop-safeguards/) — Design structured checkpoints where the supervising ATCO reviews and approves agent recommendations before they are acted upon, with defined intervention criteria and escalation paths +- [Out-of-Distribution Detector for Neural Networks](https://alan-turing-institute.github.io/tea-techniques/techniques/out-of-distribution-detector-for-neural-networks/) — Detect traffic scenarios that fall outside the agent's training distribution, flagging situations where the agent's recommendations should not be trusted. Note: originally designed for supervised classification; application to RL policy networks requires adaptation of the temperature scaling approach to action probability distributions +- [Runtime Monitoring and Circuit Breakers](https://alan-turing-institute.github.io/tea-techniques/techniques/runtime-monitoring-and-circuit-breakers/) — Continuous surveillance of agent metrics (recommendation rates, override frequencies, value-function trends) with automated protective actions when thresholds are exceeded, supporting both real-time safety and post-trial analysis +- [Safety Envelope Testing](https://alan-turing-institute.github.io/tea-techniques/techniques/safety-envelope-testing/) — Systematically evaluate the agent's performance at operational boundaries — high traffic density, unusual geometries, degraded data quality — to characterise where it can be trusted and where enhanced human oversight is needed +- [Model Cards](https://alan-turing-institute.github.io/tea-techniques/techniques/model-cards/) — Standardised documentation recording the agent's architecture, training process, performance characteristics, known limitations, and intended use, enabling independent audit and regulatory review +- [Model Development Audit Trails](https://alan-turing-institute.github.io/tea-techniques/techniques/model-development-audit-trails/) — Immutable records of all design decisions, reward function changes, training configurations, and evaluation results throughout the agent's development lifecycle, providing evidence of due diligence for the trial oversight board and CAA ## Further Reading