Iterative Research Process for Analyzing Runtime Governance in AI Systems

Shared conversation from DR-Tulu Research AssistantDR Tulu 8B (1.9k step)

Use an iterative research process: (1) collect recent papers on runtime governance of AI systems, (2) extract claims about system design and failure modes, (3) identify contradictions across papers, (4) revisit the literature to resolve gaps, and (5) produce a final structured synthesis of disagreements with supporting citations.

U
DR
Analyzed
Search: "LLM runtime governance AI syst..."
Analyzed
Search: "LLM jailbreak detection runtim..."
Analyzed
Search: "NIST AI Risk Management Framew..."
Analyzed
Search: "LLM runtime jailbreak detectio..."
Analyzed

Runtime Governance of AI Systems: Design, Failures, and Disagreements (2023–2025)

What "runtime governance" means and why it matters

Runtime governance covers the controls, monitoring, and human-in-the-loop processes deployed during inference to ensure safety, security, compliance, and reliability as models and agents operate in production. Practitioners describe guardrails as runtime filters that validate inputs and outputs against policies (e.g., safety, security, compliance) before responses reach users, explicitly positioning them as enforcement mechanisms between the model and the end-user or downstream systems. Guardrails are runtime controls that validate inputs and outputs against security, safety, and compliance policies before responses reach users[5] Guardrails moderates risky issues in prompts and responses, such as hallucinations, safety violations, prompt injection attacks, and jailbreaking[1] Guardrails.ai positions itself as an AI reliability platform for building, governing, and scaling production GenAI across any LLM and deployment environment[4]. Vendor platforms emphasize that governance must extend to agent workflows and infrastructure: Galileo and others propose "agent reliability" stacks that evaluate, observe, and apply guardrails across multi-step, tool-using agents. Galileo is the AI agent reliability platform that helps enterprise engineering teams evaluate, observe, and guardrail AI agents and LLM applications[6] Layered AI observability, from LLMs to agent workflows to platform metrics, is argued to enable proactive governance, risk mitigation, and safe operation[9].

Standards articulate high-level functions for operationalizing these controls. The NIST AI Risk Management Framework (AI RMF) frames governance as a cycle of Govern, Map, Measure, and Manage, with explicit "Operation and Monitoring" tasks including logging, anomaly reporting, alerting, and ongoing assessment of risk control effectiveness. NIST has developed a framework to better manage risks to individuals, organizations, and society associated with artificial intelligence (AI)[21] NIST AI RMF is described as a structured approach to identify, assess, and manage AI risks via functions such as Govern, Map, Measure, and Manage[29] NIST AI 600-1 instructs to monitor the robustness and effectiveness of risk controls and lists AI Actor Tasks, including Operation and Monitoring, Deployment, Governance and Oversight[28] Monitoring involves automated tools that log activities, report anomalies, and alert on potential noncompliance[27]. For agentic systems, the Cloud Security Alliance's agentic profile maps identity, authorization, security, and risk-management controls that translate RMF concepts into agent-specific operational requirements. CSALL's Agentic NIST AI RMF profile aims to develop guidelines addressing identity and authorization, security and risk management, and deployment for AI agents[24].

Core runtime mechanisms and tooling patterns

Across sources, a layered stack emerges:

  • Policy-based guardrails and validation: input sanitization, output-based safety/security/compliance checks, refusal and content filtering, and rule-based de-duplication. Guardrails moderates risky issues in prompts and responses such as hallucinations, safety violations, prompt injection, and jailbreaking[1] Guardrails validate inputs and outputs against policy before responses reach users[5] LLM guardrails are systems and mechanisms designed to limit and guide LLM behavior to keep outputs within safety and policy boundaries[10].
  • Detection and judges at inference time: specialized detectors for jailbreaks and prompt injection, sometimes implemented as an LLM "judge" that flags unsafe intent, combined with automated filtering or rejections. Defenses include behavioral AI and runtime monitoring to detect jailbreak attempts[14] A jailbreak detector uses an LLM-based judge to flag prompts attempting to bypass safety measures via prompt injection[19] JailGuard is a universal detection framework deployed atop LLM systems for prompt-based attacks across modalities[38] Latent Sentinel proposes real-time jailbreak detection using layer-wise probes as a lightweight add-on[32].
  • Inference-time defenses: methods that operate without changing the base model, such as gradient-based signatures, latent probes, or constrained decoding to detect or block unsafe prompts. Gradient-based analysis compares a prompt's gradient signature against unsafe references to detect jailbreaks at inference time without fine-tuning[12] Latent Sentinel implements real-time jailbreak detection via layer-wise probes appended to the model stack[32] NVIDIA NeMo Guardrails catalogues jailbreak protection as preventing adversarial manipulation into harmful content[35].
  • Observability and monitoring: metrics about prompts, outputs, latency, cost, safety flags, and quality signals; anomaly detection, alerting, and incident response tooling for continuous operation. LLM observability centers on collecting, analyzing, and visualizing key metrics and signals from LLMs to diagnose and optimize systems[8] Automated monitoring logs activities, reports anomalies, and alerts on potential noncompliance[27] Layered observability from LLMs to agent workflows to platform metrics is argued to enable proactive governance and safe operation[9].

System design patterns at runtime

Production practices increasingly emphasize:

  • Defense-in-depth for prompt injection and jailbreaks: combine input validation, output screening, sandboxing of untrusted content, and independent detection models (including LLM judges) with alerting and human review. Layered defense-in-depth using input/output safeguards is recommended to defend against prompt injection and unsafe actions[11] An LLM-judge-based detector can flag jailbreak attempts to preserve safety[19] Layered defenses include runtime filtering, independent detection, and model hardening[15].
  • Orchestration for agents: instrument tool calls, inter-process actions, and workflow graphs; validate tool arguments and sandbox effects; and monitor drift in behaviors and performance across workflows. Agent reliability platforms are used to evaluate, observe, and guardrail multi-step agent applications[6] Observability should span from LLM outputs to agent workflows and platform metrics to enable governance[9].
  • Continuous monitoring and control assessment: track the effectiveness of safeguards, robustness under load and adversarial conditions, and adjust policies with feedback loops. NIST AI 600-1 prescribes monitoring the robustness and effectiveness of risk controls and defines operation and monitoring tasks[28] Monitoring is described as automated logging, anomaly reporting, and alerting for governance and compliance[27].

Failure modes and risks surfaced at runtime

  • Prompt injection and jailbreaks bypassing guardrails: adversarially crafted instructions can subvert input/output validation and policy filters, especially when exploiting tool-use or memory channels. At inference time, carefully crafted prompt injection or jailbreak attacks exploit LLM vulnerabilities to bypass safety[18] Independent analyses show jailbreaking techniques that bypass safety protocols and guide models to act outside constraints[13].
  • Refusal/over-constraint and under-constraint: policies may overrefuse benign requests or underrefuse risky ones; detection/judge components and guardrails can conflict with useful user intent, creating UX degradation and reduced utility. While quantitative, run-time-specific evidence of these trade-offs remains limited in the current evidence set, multiple practitioner and tooling sources stress the need to calibrate guards and judges to balance safety and helpfulness. Guardrails aim to moderate risky issues but must also avoid unduly blocking legitimate use[1] Guardrails enforce policies but may impose constraints that affect user experience if not well-tuned[5] Runtime monitoring and independent detection are used to detect jailbreaks while minimizing false positives that harm user experience[14].
  • Latency, compute, and scalability: adding inference-time defenses (LLM judges, detectors, constrained decoding) increases inference time and compute, stressing throughput and cost, especially at scale. Defenses for jailbreaks increase inference time and compute requirements compared to a baseline LLM execution[35] Gradient-based detection requires additional computation to compare prompt signatures against unsafe references[12] Latent Sentinel aims to be a lightweight probe add-on for real-time detection, but still adds overhead[32].
  • Distribution shift and agentic complexity: as agents invoke tools and operate over time, behaviors can drift; monitoring must capture distributional changes in outputs, tool-use patterns, and compliance incidents to maintain safety and reliability. Layered observability across agent workflows and platform metrics is argued to be necessary to detect and mitigate drift and bias in operation[9] Adversarial techniques evolve for LLMs and agents; inference-time attacks require continuous monitoring and adaptive defenses[18].

Contradictions and areas of disagreement

  1. Scope and separation of guardrails vs. monitoring/detection:
  • Practitioners often treat guardrails as policy enforcement at the response boundary, while research emphasizes detection and independent evaluators as distinct but complementary layers. This can lead to overlapping responsibilities and unclear failure attribution when a guardrail misfires versus a detector fails. Guardrails act as runtime policy validators at input/output boundaries[5] Runtime monitoring and independent detection are used to identify jailbreaks and reduce bypasses of guardrails[14] Universal detection frameworks are deployed atop LLMs to detect prompt-based attacks, suggesting a separation from guardrails[38].
  • Resolution: Adopt a layered stack where guardrails handle immediate, policy-critical safety gates, detectors raise risk signals for review, and observability feeds continuous improvement of both. Standards like NIST AI RMF support this separation via Govern/Manage and Operation/Monitoring functions. NIST AI RMF's functions (Govern, Map, Measure, Manage) help separate policy governance from operational monitoring[29] NIST AI 600-1 explicitly calls for monitoring the robustness and effectiveness of controls[28].
  1. Policy specificity vs. generalization:
  • Productized guardrails advocate "one-size-fits-many" policy catalogs (e.g., safety, compliance) applied broadly across models and deployments; academic/system work shows attack diversity and modality gaps, raising risks of overfit detection and undetected bypasses on new attack families or modalities. Vendor guardrails present broad categories (security, safety, compliance) for runtime validation across environments[5] JailGuard targets prompt-based attacks across text and image modalities, indicating diversity and modality-specific challenges[38] GCG-based jailbreaks specifically compel LLMs to generate beyond constraints, highlighting the need for modality-aware defenses[36].
  • Resolution: Combine catalog-based rules with attack-specific detectors (e.g., jailbreak-specific judges, gradient/latent probes) and continuous data collection of new attack samples to update detection libraries. Latent Sentinel uses layer-wise probes for real-time detection, while acknowledging added overhead[18] Gradient-based inference-time detection provides attack-specific checks without retraining[12].
  1. Safety effectiveness vs. latency and UX:
  • Evidence consistently shows that stronger runtime defenses add latency and compute; whether this is acceptable depends on risk posture and application constraints. Quantitative, controlled comparisons across workloads are still sparse in the current evidence set. Defenses significantly increase inference time and compute for LLM deployments[35] Inference-time detection adds computation via gradient comparisons[12] Inference-time probes add overhead, though designed to be lightweight[32].
  • Resolution: Deploy AB testing and canarying with traffic splitting and rate limiting; measure latency/UX deltas against safety improvements; implement adaptive gates that scale defenses based on risk signals. Layered defense in depth with input/output safeguards suggests graduated control and monitoring strategies[11] Automated monitoring and alerting support adaptive responses to anomalies and noncompliance[27].
  1. Transparency, explainability, and governance alignment:
  • LLM-based judges and proprietary detectors can improve detection but introduce opacity, raising questions about auditability, drift, and responsibility. Industry guidance stresses transparency and logging for audit trails; standards push for operational monitoring and risk measurement. LLM-judge components can flag jailbreaks but need governance to ensure reliability[19] NIST AI RMF requires monitoring robustness and effectiveness of controls, supporting auditability[28] Logging, anomaly reporting, and alerts are core to governance and compliance monitoring[27].
  • Resolution: Require model cards/datasheets and runtime artifacts (policy versions, detector models, decisions) to be versioned and exposed; adopt open formats for model observability to standardize sharing of detection and governance metrics. Practitioners argue that layered observability across workflows enables proactive governance[9] LLM observability involves standardized collection and visualization of metrics to support governance[8].
  1. Agentic systems: assurance vs. practicality:
  • Standards and profiles (NIST RMF functions and CSALL agent profile) provide high-level operational requirements (identity/authorization, risk management), but practical agent runtimes face open challenges in enforcing context isolation, auditing tool effects, and maintaining safety across long-running workflows. The CSALL agentic RMF profile specifies identity, authorization, security, and risk management for AI agents[24] At inference time, agentic workflows are vulnerable to prompt injection/jailbreaks that exploit tool-use and multi-step execution[18] Agent reliability platforms aim to evaluate, observe, and guardrail LLM agents, reflecting operational challenges[6].
  • Resolution: Introduce per-action validation and provenance tracing for tools, with circuit breakers and rollbacks; enforce least-privilege and dynamic authorization; use runtime guardrails and detectors per action type (e.g., file system, network) and maintain audit trails. Agentic profiles stress identity/authorization and security/risk controls to constrain actions[24] Agent runtime guardrails and evaluation tools are designed to enforce policy across multi-step workflows[6].

Standards and organizational alignment

NIST AI RMF provides the broadest framework, mapping to operationalize governance into functions and actor tasks, with explicit emphasis on monitoring and control effectiveness. NIST AI RMF frames risk management across the AI lifecycle[21] NIST AI 600-1 highlights monitoring of control robustness and effectiveness and defines AI actor tasks for operation and oversight[28]. For agents specifically, the CSALL agent RMF profile operationalizes identity/authorization, security, and risk management controls for multi-agent, tool-using systems. CSALL's agentic profile defines guidelines for identity and authorization, security and risk management, and deployment for AI agents[24]. Vendor guardrails and observability platforms can thus be aligned to these functions to ensure runtime policies and monitoring are traceable and standardized. Guardrails as runtime policy enforcement can be aligned to organizational governance and compliance objectives[5] Observability practices collect and expose the metrics needed for governance and continuous improvement[8].

Evidence gaps and suggested next steps

  • Quantitative evaluations of guardrails and detection pipelines across diverse attack taxonomies and modalities are limited in the current evidence; more controlled benchmarks with latency, accuracy, false-positive/negative, and UX metrics are needed. Universal detection frameworks suggest cross-modality generality but lack broad comparative evaluations[38] Studies report added latency and compute for defenses but few comprehensive run-time performance comparisons[35].
  • Field studies and incident retrospectives comparing different stacks (policy-first vs. detection-first, agent-aware orchestration) are scarce; richer case studies would clarify governance effectiveness and failure modes in production. Scholarly/industry sources describe a spectrum of adversarial techniques at inference time but call for more empirical evidence[18] Practitioners argue for layered observability yet provide few standardized field measurements[9].
  • Tool-agnostic operational frameworks are needed to compare guardrail/detection libraries and unify observability; open standards for runtime artifacts (policy, detector versions, provenance) would ease audits and portability. NIST AI RMF's monitoring and control-effectiveness mandates suggest a path for artifact standardization[28] Observability frameworks should centralize metrics and artifacts to support governance[8].

Practical recommendations for runtime governance

  • Layer controls: combine input/output validation, independent detection (including LLM judges), and run-time risk scoring with escalation to human oversight. Layered defenses include runtime filtering, independent detection, and model hardening[15] LLM-based judges can flag jailbreak intent at inference time[19].
  • Monitor and learn: implement standardized observability over prompts, outputs, tool calls, latency, and safety flags; feed insights to adaptive policy tuning and detector updates. LLM observability requires collecting and visualizing key metrics and signals[8] Monitoring should assess the robustness and effectiveness of risk controls[28].
  • Engineer for trade-offs: quantify latency and UX impacts when adding guards and detectors; use AB canaries, rate limiting, and adaptive gating to manage risk vs. performance. Defenses increase inference time and compute[35] Defense-in-depth with layered safeguards supports graduated control[11].
  • Align to standards: map policies and controls to NIST AI RMF functions and actor tasks and, for agents, to CSALL agent profiles to ensure identity/authorization and risk controls are present and traceable. NIST AI RMF defines operation and monitoring tasks to manage and assess controls[28] CSALL agent profiles specify identity/authorization and security/risk management for agents[24].
  • Prepare for agentic risks: instrument tool-use with per-action validation and provenance; enforce least privilege; add circuit breakers and rollbacks for long-running workflows. Inference-time attacks exploit multi-step agent workflows; defenses must include inference-time protections[18] Agent reliability platforms propose evaluation, observability, and guardrails across multi-step agents[6].

Summary

Recent work converges on a layered runtime governance pattern: policy-based guardrails at the model boundary, inference-time detection and judges, and comprehensive observability and monitoring to assess control effectiveness and adapt policies in production. Guardrails can moderate risky issues like hallucinations, safety violations, prompt injection, and jailbreaking[1] Universal detection frameworks such as JailGuard deploy atop LLMs to detect prompt-based attacks across modalities[38] NIST AI RMF prescribes monitoring the robustness and effectiveness of risk controls and defines operation and oversight tasks[28]. Persistent disagreements center on the separation and specialization of guardrails versus detection, the balance of specificity and generalization, and trade-offs between safety effectiveness and latency/UX; resolving them requires standardized runtime benchmarks, field studies, and artifact-centric observability aligned to recognized frameworks like NIST AI RMF and CSALL agent profiles. Vendor guardrails emphasize runtime policy validation across environments[5] Inference-time defenses increase compute and latency[35] Agent-focused profiles clarify identity, authorization, security, and risk management controls for operationalizing governance[24].

DR Tulu can make mistakes. Please double-check the responses.

This is a read-only view of a shared conversation. Try DR-Tulu Research Assistant

Sources

38
Search: "LLM runtime governance AI..."
1

Fiddler Guardrails moderates risky issues in prompts and responses, such as hallucinations, safety violations, prompt injection attacks, and jailbreaking ...

fiddler.ai

2

Learn how to gain visibility into your LLM systems with effective monitoring practices. Discover key metrics, troubleshooting techniques, and best practices ...

konghq.com

3

Explore best practices for creating and evaluating LLM guardrails to secure your AI applications.

datadoghq.com

4

The AI Reliability Platform. The guardrails framework for building, governing, and scaling production GenAI across any LLM and deployment environment.

guardrailsai.com

5

Guardrails are runtime controls that validate inputs and outputs against security, safety, and compliance policies before responses reach users.

openlayer.com

6

> Galileo is the AI agent reliability platform that helps enterprise engineering teams evaluate, observe, and guardrail AI agents and LLM applications.

galileo.ai

7

Layered AI observability, from LLMs to agent workflows to platform metrics, is what enables proactive governance, risk mitigation, and safe ...

linkedin.com

8

LLM observability refers to the practice of collecting, analyzing, and visualizing key metrics and signals from large language models (LLMs) ...

witness.ai

9

AI observability is central to governance, giving leaders real-time visibility into drift, bias, safety issues, and compliance signals.

appinventiv.com

10

LLM guardrails are systems and mechanisms designed to limit and guide the behavior of AI models. Their purpose is to ensure that generated outputs stay within ...

orq.ai

Search: "LLM jailbreak detection r..."
11

Safeguard enterprise LLM applications against prompt injection. Learn how to implement layered defense in depth using input, output, ...

redhat.com

12

In this project, we first establish a comprehensive taxonomy of jailbreak methods and training-time defenses, followed by a proposal of robust ...

arxiv.org

13

This blog will explore the various jailbreaking techniques. We will discuss them with examples and understand how they bypass LLM security protocols.

pillar.security

14

Jailbreaking attacks manipulate LLM inputs to bypass safety controls. Learn how behavioral AI and runtime monitoring defend against prompt ...

sentinelone.com

15

Learn how prompt injection and LLM jailbreaks work and apply layered defenses: runtime filtering, independent detection, model hardening, ...

blockchain-council.org

16

Prompt injection is a type of prompt attack that manipulates an LLM-based AI system by embedding conflicting or deceptive instructions.

lakera.ai

17

This repository centralizes and summarizes practical and proposed defenses against prompt injection.

github.com

18

A spectrum of adversarial attack techniques has emerged to exploit LLM vulnerabilities. At inference time, prompt injection or jailbreak attacks use carefully ...

sciencedirect.com

19

The Jailbreak detector uses an LLM-based judge to flag prompts attempting to bypass AI safety measures via prompt injection techniques such as ...

hiddenlayer.com

20

Explore essential guardrails for safe AI implementation in LLMs. Discover best practices to mitigate risks and ensure responsible use.

kili-technology.com

Search: "NIST AI Risk Management F..."
21

NIST has developed a framework to better manage risks to individuals, organizations, and society associated with artificial intelligence (AI).

nist.gov

22

Step-by-Step Implementation of the AI RMF · Step 1: Establish a Strong Governance Structure · Step 2: Conduct a Comprehensive Risk Assessment.

ateam-oracle.com

23

The NIST AI RMF is voluntary but foundational. What it actually asks for, how it maps to the EU AI Act and ISO 42001, and how to implement ...

modulos.ai

24

The initiative aims to develop voluntary guidelines for AI agents addressing identity and authorization, security and risk management, and ...

labs.cloudsecurityalliance.org

25

Discover what AI risk management is, why it matters, and how to apply the NIST AI RMF with governance strategies and best practices.

witness.ai

26

Learn how to adopt the NIST AI RMF using Veeam and Securiti AI. Practical guide covering governance, risk mapping, LLM security, and AI ...

veeam.com

27

Monitoring involves automated security tools that log activities, report anomalies, and alert administrators to potential noncompliance issues. Security teams ...

paloaltonetworks.com

28

Monitor the robustness and effectiveness of risk controls ... AI Actor Tasks: AI Deployment, Governance and Oversight, Operation and Monitoring.

nvlpubs.nist.gov

29

A voluntary, sector-agnostic framework that helps organisations manage AI risks through the four functions: Govern, Map, Measure, Manage. How is it used in 2025 ...

digital.nemko.com

30

The NIST AI RMF is a structured approach designed to help organizations identify, assess, and manage risks associated with AI.

cybersaint.io

Search: "LLM runtime jailbreak det..."
31

A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.).

github.com

32

This paper proposes a method named Latent Sentinel to help LLMs detect jailbreak attacks. The proposed method is implemented by appending ...

openreview.net

33

This guide shows how to detect those moves early, reduce impact, and stay compliant. How AI jailbreaking puts security at risk. Jailbreaks punch ...

statsig.com

34

This dual capability ensures that the model is not only effective in detecting jailbreak prompts but also in evaluating their attack success.

ieeexplore.ieee.org

35

These defenses not only increase inference time but also the compute requirement from a system that needs to execute an LLM such as Llamaguard (Llama Team and ...

ojs.aaai.org

36

This paper focuses on a specific attack method known as the greedy coordinate gradient (GCG) jailbreak attack, which compels LLMs to generate responses beyond ...

mdpi.com

37

Jailbreak protection helps prevent adversarial attempts from bypassing safety measures and manipulating the LLM into generating harmful or unwanted content.

docs.nvidia.com

38

We propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities.

dl.acm.org